++
Trauma researchers must be familiar with injury scoring schemes to accurately risk-adjust injury group comparisons in an attempt to isolate the effects of an independent predictor variable on a dependent outcome variable. Four types of severity scores are typically used: (1) anatomic injury scores, (2) physiological scores, (3) comorbidity scores, and (4) combinations of the three and other factors. The scores vary considerably in complexity of calculation and ease of use. Injury severity score selection should be based on a clear sense of what one wants to measure and why, and a good understanding of their strengths and limitations.
+++
Anatomic Injury Classification Systems, Scoring Systems, and Models
++
Anatomic injury classification systems, scores, and models describe the site of injury and extent of damage. Many anatomic scores have been proposed in the literature, but this review is limited to general scores, scales, and models that have gained practical acceptance. We will not review the huge number of specialty classification and grading systems such as the AO Foundation/Orthopaedic Trauma Association (AO/OTA) Classification of Fractures and Dislocations. Most general scoring algorithms are designed to predict mortality (Table 5-2) and are not specifically validated for other outcomes, such as LOS or functional status, although moderate correlations may exist. The majority of scores are based on either the trauma-specific AIS coding classification or the more general ICD-9-CM taxonomy. Injury scoring and modeling systems are continuously being revised, tested, and compared to each other, and no consensus has been reached on which is best. It is likely that the optimal scoring system is contextual and therefore depends on the research question at hand and on the population under study.
++
+++
Anatomic Injury Classification Systems
+++
Abbreviated Injury Scale (AIS)
++
The AIS was first conceived as a system to define the type and severity of injuries arising from aircraft and motor vehicle crashes.13 It began with a dictionary of 73 injuries in 1971 and now has more than 2000, adding to its precision but also to its cost and complexity. Major revisions to the AIS occur every few years, (eg AIS85, AIS90)14,15 with a new update in 2015. To calculate AIS scores, medical records are transcribed into specific codes that capture individual injuries. The AIS is a proprietary classification system requiring specialized training for coding personnel. Because of this, AIS is not captured at every hospital.
++
AIS divides the body into nine regions: head, face, neck, thorax, abdomen, spine, upper extremities, lower extremities, and external. The actual AIS code consists of two numerical components separated by a decimal point. The first is a six-digit injury descriptor “pre-dot” code, which classifies the injury by region, type of anatomic structure and specific structure injured, and level of injury. The “post-dot” component is a severity score ranging from 1 (minor) to 6 (virtually unsurvivable), as shown in Table 5-1. AIS severity scores are consensus-derived assessments assigned by a group of experts. The maximum AIS (MAIS), which is the highest AIS severity among all of a patient’s injuries, is often used to quantify a patient’s injury severity. This score is highly correlated with mortality but ignores information on concomitant injuries.16
++
AIS has many intrinsic shortcomings, including anatomically incorrect body regions, inability to code combat injuries,17,18 and inconsistent scaling of severity across body regions.19 For example, an AIS 3 in the abdomen is associated with a higher probability of mortality than an AIS 3 in the chest or head. Even when performed by experts, AIS coding has low interrater reliability and 65–82% of AIS codes are not used, even in the largest registries.20 Further, the limitations of the AIS are carried forward and multiplied when they are used as the basis for models (eg, the Injury Severity Score [ISS], discussed below).21 Redactions of AIS are now available for both civilian and military injury databases.
+++
International Classification of Diseases
++
Owned by the World Health Organization, the International Classification of Diseases (ICD) is not injury specific, but is a general, all-purpose diagnosis taxonomy for all health conditions. It is well over a century old and is currently in its 11th clinical modification (ICD-11). In most countries, the tenth clinical modification, ICD-10, is being used.22 However, in the United States, ICD-9-CM is the standard for coding hospital diagnoses including injuries.23 In the ICD-9 lexicon, ICD codes exist for more than 10,000 medical conditions, about 2000 of which are physical injuries (the block of ICD-9-CM codes from 800.0 to 959.9 encompasses traumatic injuries). For trauma researchers, AIS codes are generally preferred over ICD-9 because of their greater specificity of injury description (the pre-dot classification) and the availability of an injury severity classification (post-dot code).
++
In 1987, the American Association for the Surgery of Trauma (AAST) introduced the Organ Injury Scale (OIS).24 The goal of the scale was not to predict outcomes, but to standardize the descriptive language of injuries to improve communication between trauma surgeons and other physicians. Like the AIS, the OIS provides an ordinal scale to each level of organ disfigurement, ranging from grade 1 (relatively minor) to grade 5 (likely fatal). As such, it emulated similar efforts by orthopedic surgeons to classify fractures that go back decades. These scales exist for 32 organ and body region systems.25,26,27,28,29,30 The OIS is rarely used in modern injury outcome research.
+++
Barell Injury Diagnosis Matrix
++
The Barell Injury Diagnosis Matrix is a matrix of ICD-9-CM codes for classifying injury diagnosis by type and anatomic region into “injury profiles.”31 The Barell Matrix enables epidemiologic, management, and clinically oriented analysis by serving as a standard for case-mix comparison and characterization of injury patterns. The matrix consists of 12 columns based on ICD-9-CM sequence representing nature of injury (eg, fractures, amputations), and rows with varying levels of detail relating to body region (eg, head and neck, traumatic brain injury [TBI]). Using this matrix, data can be aggregated into summary reports that indicate injury distribution, enabling descriptive comparisons.
+++
Anatomic Injury Scoring System Models
++
Current anatomic injury models are based on the AIS or the ICD, and may also be based on injury scores derived by expert consensus (AIS injury severity score) or derived empirically (probabilities of mortality calculated using large injury databases). AIS-based consensus models include the Injury Severity Score (ISS), the New Injury Severity Score (NISS), and the Anatomic Profile (AP). AIS-based empirical models include the Trauma Registry Abbreviated Injury Score (TRAIS) and the Trauma Mortality Prediction Model (TMPM). ICD-based models are all derived empirically and include the ICD Injury Severity Score (ICISS) and its derivatives, as well as the OIS.
+++
Injury Severity Score
++
In 1974, Baker et al first posited a multi-injury score by introducing the Injury Severity Score (ISS).32 Injuries in each AIS region are given an AIS score and the highest AIS scores in the three most severely injured regions are squared and summed to form the ISS. The ISS range is an ascending scale of severity from 1 (least severe) to an assigned 75 (originally designated as unsurvivable). AIS 6 automatically translates to ISS 75.
++
ISS correlates with mortality and remains for researchers the most widely used anatomical scoring system. However, ISS has many limitations.33 Its association with the probability of mortality is neither smooth nor monotonic. Further, it only considers one injury in each body region and thus ignores important injury information, especially in penetrating and ballistic injury (Table 5-3).18,34 Because of these shortcomings, we continue to believe ISS should be retired and replaced by one of the more modern injury models now available (see below).
++
+++
New Injury Severity Score
++
The New Injury Severity Score (NISS) was formulated by Osler et al to address ISS shortcoming in quantifying multiple occurrences of serious injuries within the same body region.36 NISS is the sum of the squares of the three most severe AIS severities, regardless of body region and, as with the ISS, AIS 6 automatically translates to NISS 75. This permutation offers a slight predictive advantage but has several of the same shortcomings as the ISS.
++
The Anatomic Profile (AP), developed by Sacco et al, uses AIS severity scores with an adjustment for differences across body regions.18 Three modified components are weighted to form a single scalar based on anatomic location of all serious injuries (AIS >3). The AP overcomes many modeling defects inherent in ISS and NISS and avoids the excessive complexity and nontransportability in scores such as the TMPM.37 The fact that the AP has not supplanted the ISS, however, is probably a reflection of the general lack of understanding of injury severity modeling in the injury research community.
+++
ICD Injury Severity Score
++
The ICD Injury Severity Score (ICISS), created by Osler et al, took an empirical estimation approach to injury severity scoring with the formulation of ICD-9 survival probability (Ps).38 Ps is an ICD-9 code-specific estimate of the survival probability associated with that particular injury. For a set of patients, Ps for a particular injury code is the number of patients who survive that injury divided by the number of patients who display the injury. The traditional ICISS is calculated as the product of Ps for as many as 10 injuries, and ranges from 0 (unsurvivable) to 1 (high likelihood of survival). Other versions of the ICISS include Ps of the worst injury, and independent Ps calculated on patients with isolated injuries. A similar approach was used in the 1970s, with each ICD code being given a conditional and definitive Ps related to the presence of other injuries.39
++
ICISS offers several advantages over other anatomic scores. First, because it is based on the ICD coding lexicon, it can be used in any clinical setting, including smaller centers that typically do not perform AIS coding. Second, unlike the consensus-derived AIS severity scores, ICISS’ empirical approach enables powerful statistical estimates of injury-specific survival if the patient population used to derive them is large enough. Consequently, unlike ISS and NISS, ICISS is a smooth, if nonlinear, function of mortality.
++
ICISS does, however, have limitations. First, although it resembles an overall probability, ICISS can only be considered a scalar because Ps is often “contaminated” by patients with multiple injuries. Independent Ps can be calculated as mentioned earlier, but these are not available for all codes because many injuries rarely occur in isolation.40 Second, Ps is database-specific and the degree to which it is applicable in injury populations with different patient characteristics remains uncertain.41 Finally, because empirically based injury severity scores are based on observed all-cause mortality, the ICISS is not an independent measure of anatomical injury but inherently incorporates physiological reserve and physiological reaction to injury, which vary across injury diagnoses.
+++
Trauma Registry Abbreviated Injury Score
++
ICD-9 codes are nominal, meaning they are unordered, qualitative categories not ranked by severity. If one ignores the AIS severity score, AIS codes can also be treated nominally, taking advantage of their specificity in injury classification. As such, AIS injury descriptor codes can be used to calculate Ps, similar to the ICISS. The Trauma Registry Abbreviated Injury Score (TRAIS) is the product of AIS-derived Ps. Kilgo et al showed that ICISS and TRAIS behave similarly in a large group of patients coded both ways and that TRAIS predicts mortality more accurately than its AIS counterparts ISS, NISS, and AP.42 Although ICISS and TRAIS are derived from two different coding systems, they behave similarly in terms of their association with mortality. This suggests that empirical approaches might obviate the inherent structure of the coding systems.
+++
Trauma Mortality Prediction Model
++
The Trauma Mortality Prediction Model (TMPM) is also based on Ps associated with AIS codes, generated using NTDB data. It adds a level of complexity as it is calculated as a weighted sum of coefficients from two probit models of mortality: one based on AIS pre-dot codes and one on body region injured.
++
To address the lack of an injury severity scoring mechanism in the ICD classification system, MacKenzie and colleagues developed an algorithm to convert ICD-9-CM codes to AIS90 codes.43 The algorithm has more recently been updated to map ICD-10 codes to AIS98 codes.44 Mapped AIS codes can be used to calculate AIS consensus-based severity scores including the ISS, NISS, and AP. This algorithm has proven useful for quantifying injury severity in administrative databases but mapping is not achieved for all ICD codes, and the performance of mapped AIS injury severity scores for predicting mortality is significantly inferior to directly coded AIS severity scores and ICD-based scores (ICISS).16,45
++
Trauma clinicians, outcomes researchers, and hospital administrators may ask which of these approaches is best. There is no consensus, and many publications each year continue to debate this question.
++
Several large studies, including ones by Sacco et al and Meredith et al, compared these anatomic scores in terms of their ability to predict mortality.16,46 Both studies found that AP and ICISS better discriminate survivors from nonsurvivors than ISS, NISS, and the ICD-MAP versions of ISS, NISS, and AP. A surprising finding was that MAIS performed better than its multi-injury counterparts ISS and NISS. Based on this result, Kilgo et al showed that the patient’s worst injury, regardless of coding lexicon (ICD-9 or AIS) or estimation approach (AIS severity-consensus or empirical Ps), was a better predictor of mortality than multi-injury scores.42 Harwood et al. reported that the NISS was better than the ISS and equivalent to the MAIS in the prediction of mortality in blunt trauma patients.47
++
However, many of these studies have not accounted for other risk factors inherently incorporated into empirical scores (age, comorbidities, physiological reaction). Studies that do not account for these risk factors when comparing consensus-based and empirical scores unfairly disadvantage the former.48 As mentioned earlier, each of these scores has strengths and limitations, and the choice of score should be adapted to the outcome, data, and study population at hand.
+++
Physiological Scoring Systems and Models
++
Anatomic injury unleashes a set of physiological and bio-chemical consequences that require modulation and mitigation as part of the patient’s treatment. The concept of integrating physiology into injury severity modeling recognizes the dynamic and time-dependent changes to physiological status following injury. A patient with a ruptured spleen who is seen within 1 hour of injury might be normotensive and, when appropriately and promptly treated at a trauma center, will have a very low risk of dying. A patient with the identical anatomic injury who is seen 4 hours later, for whatever reason, may present with a systolic blood pressure (SBP) of 60 mm Hg and a significant risk of death despite availability of the same system of in-hospital care. The hospital should not be penalized if the patient dies (but perhaps the system should). Integration of physiological parameters on arrival with other data is therefore essential for accurate case mix adjustment and outcome prediction in injury research.
++
Early death following injury occurs as a result of central nervous system (CNS) impairment, hemorrhage, or respiratory causes or combinations of causes. Clinical markers, including RR, SBP, base deficit, and reaction to stimuli/state of consciousness are important prognosticators of outcome and are routinely used in clinical management. However, unlike anatomic injuries and preexisting comorbidities, which are fixed at the time of hospital admission, physiological parameters are ever-changing, both spontaneously and in response to therapy. Thus, it is necessary to obtain a “snapshot” of physiological status at one point in time, usually immediately upon ED or trauma center arrival. The main physiological scores currently in use in injury research include the GCS and the Revised Trauma Score (RTS).
+++
Glasgow Coma Scale (GCS)
++
The GCS was first proposed by Teasdale & Jennett as a means to directly triage brain-injured patients and to monitor postoperative craniotomy patients.49,50 The GCS was subsequently integrated into the Trauma Score/Revised Trauma Score (RTS) and Triage Score to describe level of consciousness without using subjective terms such as “semicomatose” and “lethargic” to classify head injury following trauma.51 GCS measures brain function via three components: (1) motor (GCS-M), (2) verbal (GCS-V), and (3) eye opening (GCS-E), each with ordinal characterizations of severity (Table 5-4). The scale ranges from 3 (completely unresponsive) to 15 (fully conscious) and the GCS score has been shown to be strongly correlated with survival.52 The motor component alone has been shown to be almost as powerful as the full GCS score,53 and could theoretically replace the full GCS score for predicting mortality. However, the verbal and eye components discriminate noncomatose patients and are thus valuable for predicting nonfatal outcomes.54
++
++
The Trauma Score, later updated to the Revised Trauma Score (RTS), was designed by Champion et al as an approach to combining clinical and observational physiological data into one score.55,56
++
Two forms of the RTS exist, one for triage (Triage-RTS) and one for outcomes evaluation and risk adjustment (RTS). Both are based on the GCS, SBP, and RR (Table 5-5). The Triage-RTS is calculated by summing the coded values for each of the three variables, and ranges between 0 and 12. The RTS equation for outcomes evaluation computes indexed values of GCS, SBP, and RR by weighting their coded value with logistic regression coefficients and summing them.
+
RTS = 0.9368(GCS) + 0.7326(SBP) + 0.2908(RR)
++
++
The RTS ranges from 0 to 7.84, with lower scores translating into more physiological derangement. RTS correlates strongly with mortality57 and remains important in injury scoring through its contribution to the TRISS model (see below). Studies have also shown that the combined use of SBP and GCS-M are just as effective at predicting patient survival as the RTS.58 Disadvantages of the RTS include the fact that coefficients are based on MTOS data and have not been updated, and that categories of the GCS, SBP, and RR used to calculate the RTS often have very sparse data.57
+++
Comorbidity Scoring Systems
++
Injury outcomes research has long recognized the importance of comorbidities to patient risk and outcomes. For that reason, comorbidities were integrated into the American College of Surgeons Committee on Trauma (ACS-COT) field triage decision scheme developed by Champion.59 Morris et al, among others, identified several preexisting conditions that worsen prognosis following trauma, most notably liver cirrhosis, chronic obstructive pulmonary disease (COPD), congenital coagulopathy, diabetes, and congenital heart disease.60 Morbid obesity has now been added to this list.61 The incorporation of preexisting conditions into injury severity models is difficult because so many potential comorbidities exist, each of which may occur with variable severity. Further, many are relatively rare and may be inconsistently recorded.
++
Specific comorbidity adjustments, such as the Charlson Comorbidity Index (CCI), which are widely used in other disciplines,62 are frequently used in injury severity models in attempts to enhance their predictive abilities. Results, however, have been poor.63,64 This may be because such scores are not adapted to acute injury populations.64 Indeed, the CCI, a weighted sum of 17 preexisting conditions, is based on coefficients derived in a population of cancer patients using a Cox proportional hazards model and is therefore clearly not appropriate for injury admissions. The number of Charlson comorbidities has been shown to predict injury mortality as well as the CCI.64 Other approaches include using the presence of individual comorbidities or classes of conditions (ICD-9-CM ranges) in risk-adjustment methods or simply using patient age as a surrogate for comorbidities. With the aging of trauma populations, comorbidity and multimorbidity will increase, and accounting for these factors in injury research will become increasingly important. Efforts should therefore be made to develop an injury-specific comorbidity index.
+++
Combat Injury: A Special Case
++
Since the addition of descriptors for coding penetrating injuries with the AIS 1985 edition, researchers have had a tool for evaluating both blunt and penetrating injuries. The descriptors of penetrating injuries included in AIS versions since 1985 describe low-kinetic-energy injuries treated in civilian trauma centers and hospitals. Subsequent iterations have been used to code military combat injuries as well.65 These codes, however, did not adequately describe commonly seen penetrating combat injuries such as multiple and massive soft-tissue fragment wounds; high velocity penetration; blast overpressure injuries (mutilating or nonmutilating), and/or bilateral and multiple injuries17,18 that result from explosive devices including improvised explosive devices (IEDs), which account for 55–75% of combat injuries.10 They also did not account for, nor were they designed to, some of the injury phenomena associated with mass casualty incidents, for example, crush injuries in earthquake disasters.
+++
Abbreviated Injury Scale—Military Edition (AIS-Mil)
++
To address these issues, a committee of military physicians was formed to work with the International Injury Scaling Committee (IISC) of the Association for the Advancement of Automotive Medicine (AAAM) to propose guidelines for developing a version of the AIS specifically for coding combat injuries. These physicians represented all three services of the US military (Army, Navy/Marines, Air Force) as well as a spectrum of medical specialties relevant to combat casualty care including emergency medicine and trauma, orthopedic, neurosurgery, and general surgery. AIS 2005-Military is used for coding of all injuries in the three combat trauma registries: (1) the DoD Trauma Registry (DoDTR, formerly the Joint Theatre Trauma Registry [JTTR] based in San Antonio, TX; (2) the Navy/Marine Combat Trauma Registry (CTR) Expeditionary Medical Encounter Database (EMED) based in San Diego, CA; and (3) the Mortality Trauma Registry (MTR) based at the Office of the Armed Forces Medical Examiner (OAFME) in Rockville, MD. These registries also code in AIS 2005 (civilian version) and AIS 1998 for future comparisons with civilian trauma registry data.
++
Development of AIS 2005-Military coincided with the revision efforts that would culminate in publication of the civilian AIS 2005, which included additional expanded descriptors for orthopedic trauma based on the OTA scale and expanded bilateral injury codes, particularly for vessel injuries. The same consensus model used in determining changes to each injury description by the IISC was used to determine AIS 2005-Military scores.
+++
Military Combat Injury Scale
++
Despite revisions that culminated in the AIS-Military, numerous combat injuries such as those caused by explosive devices still could not be coded or adequately described. Trying to adapt AIS was only moderately successful. Therefore, the Military Combat Injury Scale (MCIS)20 was drafted by a large panel of military and civilian experts. First, a more anatomically correct and militarily relevant set of body regions was developed (head and neck, torso, arms, legs, multiple), five combat severity levels were determined (minor through likely lethal), and combat-relevant injury descriptions were tabulated.
++
Using these new body regions, severity levels, and injury descriptors, a five-digit MCIS coding scheme was developed and 269 codes were assigned. Digit 1 indicates injury severity; digit 2 indicates body region; digit 3 indicates the type of tissue involved; and digits 4 and 5 together indicate the specific injury when combined with digits 1, 2, and 3. This coding scheme allows for injuries to the skull and brain to be identified separately from injuries to the face or neck, and for injuries to the chest, abdomen, and pelvis to be separately identified despite being assigned to the same body region. It also allows for identification of unilateral or bilateral injuries, right or left for specific injuries, and easy identification of junctional-area vascular injuries.
+++
Military Functional Incapacity Scale
++
The Military Functional Incapacity Scale (MFIS) was developed at the request of military personnel to correlate immediate functional impairment with MCIS injury severity for ground troops, and later for shipboard environments.20 The ground operational requirement is based on the ability of an injured combatant to (1) communicate, (2) move, and (3) fire a weapon. The MFIS was developed as an ascending scale of functional impairment with four levels, as follows: (1) able to continue mission, (2) able to contribute to sustaining mission, (3) lost to mission, and (4) lost to military.
++
MFIS levels of incapacity were linked directly to MCIS injury severity. MCIS Severity 1 injuries are not associated with immediate functional incapacity and casualties are able to continue with the mission; MCIS Severity 2 injuries usually result in immediate functional impairment with the potential for the casualty to contribute to the mission; and MCIS Severity 3, 4, or 5 injuries require medical treatment—casualties who sustain one or more of these injuries are lost to the mission or to the military. Specific Army and Navy scales have been developed.
+++
Combination Injury Severity Models
++
Combination injury severity models attempt to combine some or all of the three concepts of risk described by MacKenzie66: (1) pre-injury physiological reserve (eg, age, comorbidities), (2) physiologic status of the injured patient (eg, GCS, RR, SBP), and (3) anatomic injury severity (eg, ISS, NISS, MAIS, ICISS).
++
Injury severity models are generally used to quantify patient case mix or to perform adjusted comparisons across injury groups. To date, most injury severity models have been based on mortality but models based on nonfatal outcomes (eg, complications, readmissions, resource use), have more recently been proposed.67,68,69 The most common use is probably for institutional benchmarking.
++
The most well-known injury severity model is the Trauma Injury Severity Score (TRISS).70 Documented limitations of TRISS (discussed in more detail below) include the fact that age is modeled in just two categories, assuming equivalent mortality risk in all patients 55 years of age or older, it does not account for comorbidities, and has other limitations inherent to the ISS and RTS (discussed above).71 Further, the original coefficients are more than 20 years old, although they have been recently revised.
++
In an attempt to address these limitations, many other injury mortality prediction models have been proposed. A Severity Characterization of Trauma (ASCOT) includes age modeled in five categories and uses the AP instead of the ISS.72 The Harborview Assessment for Risk of Mortality (HARM) is based uniquely on hospital discharge data and includes anatomic injury descriptors (ICD), mechanism, comorbidities, and injury intent.73 The Trauma Risk Adjustment Model (TRAM) includes complications and transfer status, and employs flexible modeling techniques to use all information on continuous covariates (ie, age, GCS, RR, SBP), and to preserve their nonlinear associations with mortality.74 Several studies have compared the predictive accuracy of injury severity models, and results indicate that the most complex models offer significantly better predictive accuracy and change the results of trauma center benchmarking analyses.16,45,48 However, these more complex models have trouble supplanting TRISS.
++
To address the problem of limited injury data available in LMICs, the Kampala Trauma Score (KTS) has also been proposed. Originating in Uganda, the KTS is a “simplified composite of the RTS and the ISS and closely resembles… TRISS,”75 and is calculated along a descending scale of severity, ie, 5 to < 11 (severe), 11–13 (moderate), and 14–16 (mild).
+++
Development of Injury Severity Models
++
Development of an injury severity model implies rigorous statistical methods in line with guidelines proposed for prediction models.76 The purpose of the model must be clearly defined, the model should be derived on a large sample of (representative) patients subject to the highest standards of care, the choice of potential risk factors should be based on literature review and expert opinion in line with a conceptual model, and the model should allow for nonlinear associations with the outcome (eg, the probability of mortality does not increase linearly with age or the ISS and associations with SBP and RR are nonmonotonic). Both the internal and external validity of the model should be evaluated. Some injury severity models published to date have been evaluated in terms of apparent performance (discrimination and calibration on the sample used for derivation) but relatively few have been subject to rigorous validation.
+++
Evaluating Predictive Models
++
The performance of models is evaluated according to their capacity to accurately predict the outcome of interest. Injury severity models based on binary outcomes (eg, mortality, readmission, complications) are generally based on the logistic regression model. The predictive accuracy of logistic models is evaluated by calculating measures of discrimination and calibration.
++
Model discrimination describes the accuracy of the model for distinguishing between survivors and nonsurvivors, and is generally measured using the area under the Receiver Operating Characteristic (ROC) curve (AUC). This area varies between 0 and 1, where 0.5 indicates a model that discriminates no better than chance alone (noninformative) and 1 indicates a model that discriminates perfectly. Discrimination depends on the frequency of the outcome but, unlike calibration, tends to be relatively stable from one population to another. For example, injury severity models generally have excellent discrimination for predicting mortality (AUC >0.9),74,77 and good discrimination for complications (AUC = 0.807),69 but poor discrimination for unplanned readmission (AUC = 0.65).67 This indicates that that baseline risk (physiological reserve, physiological parameters on arrival, and anatomical injury severity) explain mortality well but that complications and unplanned readmissions are explained to a greater extent by other factors such as quality of care. Discrimination is usually considered to be more important than calibration because it cannot generally be improved by modeling strategies.
++
Model calibration (or goodness of fit) indicates how well the model fits the data or how closely model risk estimates approximate observed event rates across different levels of risk. Good model calibration is dependent on the data at hand and can, to a large extent, be ensured by appropriate model specification, respecting clinically plausible associations between each independent variable and the outcome of interest.
++
Calibration is often quantified using the Hosmer–Lemeshow (HL) statistic,78 based on the difference between observed and predicted probabilities of the outcome of interest in prespecified risk groups. The HL statistic has several limitations, including the fact that it is sensitive to sample size (a large, statistically significant value does not necessarily indicate poor model fit), is dependent on the risk groups used (deciles or other), and cannot be compared over different patient samples.76,79,80,81
++
Calibration should therefore also be evaluated using other strategies, the most useful of which is Cox’s calibration curve. This curve is based on plotting predicted against observed probabilities of the outcome, thus providing a global impression as to how the model fits the data, and enabling the analyst to identify areas where the fit is problematic. The intercept a and slope b of the calibration curve, which should be as close to a = 1 and b = 1 as possible, are useful summary indicators of calibration.82
++
Models can also be evaluated in terms of explanatory power using, among others, r-squared adapted to binary outcomes, the Akaike Information Criterion (AIC),83 and the Brier score.84 AIC is one of several information criteria that can be used in model evaluation85,86 but is the preferred choice because the underlying concepts emphasize expected predictive accuracy in new data.86,87 In AIC, information loss is incurred when a model is substituted for the true model. Smaller AIC values mean less loss, so the model with the smallest AIC is the best choice in model comparisons. Large values do not necessarily indicate extreme information loss, however, because AIC increases with sample size. This is one reason that AIC can only be used comparatively within a given study. One caution regarding AIC is that it can be very sensitive to small effects and may favor unnecessarily complex models.
++
The Brier score is a promising alternative because its decomposition yields some insight beyond a simple misfit summary.88,89,90 The U-test derived by regressing outcomes on risk estimates is another option that may have greater sensitivity to model discrepancies.91 Ready availability in computer programs has been one reason for the wide use of the HL test, but improvements in data analysis packages have made the other options more readily available.
++
Because the performance of predictive models tends to be overoptimistic in the sample used to derive them, predictive models should be validated in the study population from which they were derived (internal validation or temporal validation) and in a completely independent sample (external validation).
++
The internal validity of a model may be evaluated using split-sampling, cross-validation, or bootstrapping. In split-sampling, the model is derived on a random sample of the study population (eg, 2/3) and it is validated by fitting the same model to remaining observations and calculating metrics of discrimination and calibration. In cross-validation, the sample is split in k samples of equal size. The model is repeatedly derived on one or several subsamples and its predictive accuracy evaluated on the remaining subsamples. In bootstrapping, the whole sample is used to derive the model and it is validated on repeated random samples drawn from the original sample. The advantage of split-sampling is that the validation sample is theoretically independent from the derivation sample (although in practice it has the same characteristics as it is a random sample). However, bootstrapping has been found to be equivalent to split-sampling and is generally preferred because it uses all observations to derive the model, thereby increasing model precision.92 The temporal validity of the model can then be evaluating by fitting the model to data collected in the same population at a different time. If the model has acceptable internal and temporal validity, model performance should then be evaluated on a completely independent sample (external validity).
++
Current documented limitations do not invalidate the available injury severity models. Indeed, empirical validation studies provide strong evidence that all available models yield risk estimates of acceptable accuracy for groups of patients. The ongoing concerns are how to determine which model is best and how to improve available models. Several trends in recent modeling efforts provide initial answers to both questions. Models that reduce the weight given to secondary injuries relative to primary injuries,93 incorporate interactions between injuries, and utilize better body region information are examples of promising directions for improving the accuracy of outcome predictions.94,95,96
++
Multilevel modeling and methods that smooth the risk function (eg, spline regression, fractional polynomials) demonstrate directions for analytic refinement.97,98,99,100 Data simulation techniques such as multiple imputation improve the feasibility of adding physiologic variables to the current anatomic/demographic models.101 The growing access to extensive databases, improvements in analytic tools, and increased sophistication of substantive models lead to a straightforward conclusion: Today’s models are good; tomorrow’s will be better.