We will use operative deaths to illustrate the statistical treatment of early events.
The mean (point estimate) operative mortality is computed as the number of operative deaths divided by the number of patients. Multiplying by 100 converts this decimal to a percentage (P). The SE of a proportion P based on N patients equals the square root of P (1 − P)/N. Thus, as shown in Table 8-1, the percentages of patients with early death are 4.6% (SE = 0.9%) and 8.1% (SE = 1.0%) for the PREVIOUS and the FINAL valve model groups, respectively. Table 8-1 also contains the 95% CI, computed by two popular methods. The first method is the simple (asymptotic) method based on the fact that the binomial distribution, which governs proportions, can be approximated by the normal (bell-shaped) distribution as the sample size increases.9 This CI is computed easily as the point estimate plus and minus twice the SE. A second method uses the (exact) binomial distribution directly.10 Although the “exact” method sounds like it obviously would be the most desirable, there are other methods that may have better statistical properties.11
To demonstrate a univariable comparison, operative mortality between the two valve model groups was used. This does not seem very interesting clinically because valve model should have little to do with operative mortality; nevertheless, many valve comparison papers attempt to draw clinical conclusions from just such questionable comparisons. Comparing two proportions gives rise to a matrix with two rows (for the two valve groups) and two columns (for the two possible outcomes) called a two-by-two contingency table. Several methods have been used to assess the significance of such tables.12 The most common method for extracting a p -value from such a matrix is the (Pearson) chi-square test. This test has an alternative, more conservative form using a continuity correction. Validity of the chi-square test depends on having an adequate sample size (technically, each cell of the table should have an expected size of at least 5), and when this is not the case, the Fisher's exact test is often used. All three tests find that the FINAL valve model has significantly higher operative mortality because the p -values are smaller than .05 (see Table 8-1).
The simple comparison above showed that operative mortality with the FINAL valve model was significantly higher than with the PREVIOUS valve model. But patients with the FINAL valve model were older and had more concomitant CABG and re-replacement operations (see Table 8-1). Could the apparent difference in operative mortality between valve models be a result of these patient characteristics, instead of the valve itself? We explore this possibility using a multivariable analysis.
For binary (dichotomous) outcomes such as operative mortality, the most common method for developing a multivariable model is logistic regression.13 In this model, operative death is the outcome (dependent variable), and patient characteristics, plus valve model, are the potential risk factors (independent variables). For technical reasons, logistic regression does not use the probability (p) of death directly as the dependent variable in the model. Instead, it uses the logarithm of the oddsp /(1 − p) of death. To facilitate interpretation of a regression coefficient (B) from such a model, the coefficient can be converted into an odds ratio (OR) by using the exponential function. Most statistical programs do this automatically, and the ORs are sometimes labeled exp(B). The 95% CI for the OR is computed as the exponential of the normal approximation CI (mean plus and minus twice the SE) for the coefficient itself.
A stepwise regression program begins with a univariable test of each potential risk factor,13 using a model with a single variable to get the OR and p -value associated with that variable. If the OR is greater than 1, that variable is a risk factor (meaning that it adds to the risk). If the OR is less than 1, it is a protective factor. For the heart valve example (Table 8-2), age, concomitant CABG, and valve model are statistically significant (their p -values are less than .05). These variables, plus any others showing a trend toward association with operative mortality (usually p < .2), would be included in the next step of the stepwise logistic regression. In the final regression model, only age and concomitant CABG are still significant (see Table 8-2). After those effects are accounted for, the effect of valve model is no longer significant (p = .515). Thus, by this analysis, the apparent increase in operative mortality in the FINAL valve model group seems to be an artifact; FINAL valve model is apparently a surrogate for older age and more bypass surgery, which themselves are primarily responsible for the increased mortality. There are no doubt other clinical variables to consider in this model, but because we used the data only for demonstration purposes, not all possible variables were included. As a rule of thumb, 10 events can support one risk factor considered in a risk model.14 In our data set, there are 83 operative deaths, so we would have been justified in considering about eight risk factors. In practice, researchers would reference the published models and study their own data to select more variables for consideration.
Table 8-2 Univariable and Multivariable (Logistic Regression) Modeling of Early Mortality |Favorite Table|Download (.pdf)
Table 8-2 Univariable and Multivariable (Logistic Regression) Modeling of Early Mortality
|Variable||p-value||Odds Ratio||Coefficient||SE||p-value||Odds Ratio (95% CI)|
FINAL valve model
1.04 (1.02, 1.07)
2.76 (1.69, 4.52)
The OR of a binary variable such as concomitant CABG (2.76 in Table 8-2) means that the odds of mortality for a patient having concomitant CABG are 2.76 times those of a patient not undergoing concomitant CABG. This is the point estimate; the interval estimate (see Table 8-2) ranges from 1.69 to 4.52. When the lower limit of the 95% CI is greater than 1 (as it is for concomitant CABG, ie, 1.69), the OR will be significantly greater than 1. For a continuous variable such as age, the OR of 1.04 means that for each year of age, the odds of an operative death are multiplied by 1.04.
Evaluating the Risk Model
Discrimination: Receiver Operating Characteristic (ROC) Curve
The discrimination of a risk model is the ability to separate those who will have an event from those who will not. Traditionally, the discrimination is evaluated by the c-index, which is the area under the receiver operating characteristic (ROC) curve,15 This is the probability that a death will have a higher risk score than a survivor. Generally, a c-index between 0.7 and 0.8 is considered acceptable discrimination, a c-index between 0.8 and 0.9 is considered excellent discrimination, and a c-index greater than 0.9 is considered outstanding discrimination.16
Calibration: Hosmer-Lemeshow Statistics
Calibration is the measure of how close the predictions are to reality. For example, if 100 patients had risks of 5% from a well-calibrated model, then 5 of them would be expected to die. Calibration is evaluated by the Hosmer-Lemeshow (H-L) statistic, which computes the significance of the difference between the observed and expected events.17 If the H-L statistic is significant (p < .05), it may be a sign of poor calibration. For our final model in Table 8-2, the c-index is 0.710 (95% CI 0.653–0.767) and the H-L statistic is p = .365. These values can be considered optimistic, however, because the data used to generate the model also were used to test it. Ideally, one would use a different data set, or bootstrap resampling of the original data, to test the model.18 There are some technical issues with the H-L statistic. 19,20,21 Accordingly, in the next section we introduce a visual, continuous analog of the H-L test using the CUSUM methodology.
Using the Model for Risk-Adjusted Provider Comparisons
The predicted (expected) mortality from logistic regression can be used to compare the risk-adjusted performance between groups of patients, eg, to compare different surgical techniques or different providers. If the ratio of observed (O) to expected (E) mortality, the O/E ratio, is greater than 1, then there are more deaths than expected by the model, and if the O/E ratio is less than 1, there are fewer deaths than expected. The CI of the O/E ratio can be calculated by using a normal approximation method, which, as usual, gives a symmetric interval around the point estimate, or by using a logarithmic transformation, which provides a more appropriate asymmetric interval.22,23Table 8-3 contains these values for our heart valve example. The CIs for the O/E ratios for both groups include 1, which means that their risk-adjusted mortalities are not different from those predicted by the model.
Table 8-3 Ratio of Observed to Expected Early Mortality (O/E Ratio) |Favorite Table|Download (.pdf)
Table 8-3 Ratio of Observed to Expected Early Mortality (O/E Ratio)
25/543 = 4.6%
58/712 = 8.1%
27.5/543 = 5.1%
55.5/712 = 7.8%
95% confidence interval
95% confidence interval
Another method to compare the risk- adjusted performance between groups is using OR, which is technically more suitable. The OR is the ratio of the odds of observed O/(1-O) to the odds of the expected E/(1-E). An OR of 1 indicates that the observed death is equally likely to occur as predicted; an OR greater than 1 indicates that the observed death is more likely to occur than predicted; an OR less than 1 indicates that the observed death is less likely to occur as predicted. The CI of the OR can be calculated by using a likelihood-based method or, more easily, as an output from the logistic regression.24
Cumulative sum (CUSUM) analysis methods are often used to examine the performance of a provider across time, by plotting the cumulative sum of observed minus expected events as a function of surgery date.25 For a data set whose observed mortality exactly fits the expected, the line would lie along the horizontal line y = 0. When the CUSUM lies below the y = 0 line, it means fewer deaths were observed than were expected, and when the CUSUM lies above the y = 0 line, it means more deaths were observed than were expected. When the CUSUM is going up, it means the performance is getting worse than expected; When the CUSUM is going down, it means the performance is getting better than expected. Thus CUSUM can be used to detect a learning curve.26 The 95% prediction limits (point-wise 95% confidence intervals) account for the excursions from y = 0 that could be expected to happen by chance.27
CUSUM can be used for other purposes when using different variables for the x-axis. When the dependent (outcome) variable is dichotomous (death), it is difficult to appreciate its relationship to a continuous risk factor, eg, age, graphically. The CUSUM25 can be used to overcome this difficulty by plotting the CUSUM against age to give us a graphic view. This technique can also be used to examine the fit of a model, by plotting the cumulative sum of observed minus predicted deaths as a function of predicted mortality (Fig. 8-1). For a model whose observed mortality exactly fit the expected, the line would lie along the horizontal line y = 0. When the horizontal axis equals the predicted risk, the CUSUM could be thought of as a continuous version of the H-L test of model calibration, which is based on the differences in observed minus expected deaths in each of the 10 deciles of risk, shown by shaded vertical bars in Fig. 8-1.
CUSUM plot of operative death. Vertical axis is the cumulative sum of observed deaths minus predicted deaths by the logistic regression model in Table 8-2. The horizontal axis is scaled in number of patients (ordered by the predicted risk), so it is nonlinear in predicted risk of death. The blue/white bars each contain 10% of the patients.