In previous sections of the chapter we briefly defined study designs. The classic hierarchy of study design was based on their ability to decrease bias and confounding and ranked the research design in the following order: (1) Systematic reviews and meta-analyses; (2) RCTs with confidence intervals (CIs) that do not overlap the threshold of clinically significant effect; (3) RCTs with point estimates that suggest clinically significant effects but with overlapping CIs; (4) Cohort studies; (5) Case-control studies; (6) Cross sectional surveys; and (7) Case reports.33 Currently, however, several investigators and organizations recognize that most clinical trials fail to provide the evidence needed to inform medical decision making.34 Thus we must use the best research design available to define the best evidence.35
Randomized clinical trials are considered the most unbiased design because random group assignment provides an unbiased treatment allocation and often (but not always) results in similar distribution of confounders across study arms. Ultimately, the goal of randomization is “to ensure that all patients have the same opportunity or equal probability of being allocated to a treatment group.”36 Random allocation means that each patient has an equal chance of being given each experimental group, and the assignment cannot be predicted for any individual patients.37 However, emergency trauma research imposes difficulties to randomization as enrollment is time sensitive and the interventions must be made available without any delay.38,39 In emergency research trials, the patients are not recruited, they are enrolled as they suffer an injury, in a completely random fashion. If effective, randomization creates groups of patients at the start of the study with similar prognoses; therefore, the trial results can be attributed to the interventions being evaluated.40
How to conduct effective randomizations is a challenge in emergency and trauma research. Conventional randomization schemes (eg, sealed envelope with computer-generated random assignment) can impose unethical delays in providing treatment. Another major obstacle of complex randomizations schemes in emergency research is adherence to protocol, thus alternative schemes (eg, prerandomization in alternate weeks) have been proposed.41 In a recent trial, we randomized severely injured patients for whom a massive transfusion protocol was activated to two groups: (1) viscoelastic (thrombelastography)-guided, goal-directed massive transfusion or (2) conventional coagulation assays (eg, prothrombin time, etc) and balanced blood product ratios on predefined alternating weeks.42 The system was formidably successful in producing comparable groups at baseline. Another example is the Prehospital Acute Neurological Treatment and Optimization of Medical care in Stroke Study (PHANTOM-S), published in 2014, in which patients were randomly assigned weeks with and without availability of a Stroke Emergency Mobile.43 These alternative randomization approaches are recognized as appropriate in emergency research.38
Adaptive designs, including adaptive randomization, have been proposed to make trials more efficient.44,45 A 2015 draft guidance document from the Food and Drug Administration defines an adaptive design clinical study as “a study that includes a prospectively planned opportunity for modification of one or more specified aspects of the study design and hypotheses based on analysis of data (usually interim data) from subjects in the study.”46 For example, the recently published PROPPR trial used an adaptive design to grant their Data and Safety Monitoring Board authority to increase the sample size and reach adequate power.47 Their initial sample size (n = 580) was planned to detect a clinically meaningful 10% points difference in 24-hour mortality based on previous evidence. The DSMB recommended increasing the sample size to 680 based on the results on an interim analysis.
In sum, the research design must be appropriate to the research question, ethical and valid, both internally and externally. Internal validity refers to the extent to which the results of the study are biased or confounded. In the next sections, we will discuss bias and confounding in more detail, but basically this comes down to one question: Is the association between outcome and effect reported in the study real? How much of it may be due to bias and/or confounders? External validity, on the other hand, reflects to which extent the study is generalizable.
Primary and Secondary Data
When researchers used data collected for the purpose of addressing the research question, they are using primary data. On the other hand, in the era of “Big Data,” it is common to see the use of secondary data, collected for purposes unrelated to the specific research question. There are undeniable advantages on secondary datasets, as they are usually large and inexpensive. Yet they have had mixed results when used for tasks as risk adjustment.48,49 Administrative datasets collected for billing purposes, for example, are often influenced by financial reasons, which can favor overcoding or undercoding, may and have the number of diagnoses capped or deincentivized due to declining marginal returns in billings.49,50
In addition, medical coding and clinical practices are subject to changes over time due to a variety of reasons. For example, the coding of “illegal drugs” upon trauma admission is likely to change in unpredictable ways in states where cannabis became a legal, recreational drug. Another glaring example relates to collection and coding of the social construct variable “race and ethnicity,” which has changed dramatically over the past few decades.51 Comorbidities may be misdiagnosed as complications and vice versa. To address this problem, since 2008, most hospitals now report a “present on admission” (POA) for each diagnosis in its administrative data as a means to distinguish hospital-acquired conditions from comorbidities.50 Of course using data from before and after modifications, such as the inclusion of a POA code, were implemented affect the internal validity of the study. Whenever longitudinal data are used in a study, especially if covering long periods of time, it is important to verify whether there were changes in data collection, health policies, regulations, etc that can potentially affect the data.
Sometimes the distinction between primary and secondary data becomes blurry, as it happens in registries such as state-mandated and hospital-based trauma registries or the National Trauma Data Bank, a voluntary, national trauma dataset maintained by the American College of Surgeons. These datasets were developed to provide a comprehensive epidemiological characterization of trauma, thus one can assume that when the research question is related to frequency, risk factors, treatments, and prognosis of trauma, these represent legitimate primary data. However, registries may lack the granularity to address study hypotheses; for example, analyzing the effects of early transfusions of blood components on coagulation-related deaths. In the end, In addition, low volume hospitals may not contribute enough to aggregate estimates biasing mortality toward high volume facilities. The Center for Surgical Trials and Outcomes Research (Department of Surgery, The Johns Hopkins School of Medicine), Baltimore, has done a commendable effort documenting differences in risk adjustment and, more important, providing standardize analytic tools to improve risk adjustment and decrease low-volume bias in studies using the NTDB.52,53,54
Hypothesis: Significance and Power
All studies that use a statistical test, even purely descriptive studies, have hypotheses. That is because a statistical test is based on a hypothesis. Every hypothesis can be placed in the following format:
Variable X distribution in Group A
is different (or not different) from
Variable X distribution in Group B
Despite its simplicity, this is a widely applicable model for constructing hypotheses.55 It sets the stage for elements that must be included in the methods section. The authors must define what characterizes Group A and Group B and what makes them comparable (aside from Variable X). Variable X, which is the variable of interest, must be defined in a way that allows the reader to completely understand how the variable is measured. The hypotheses should be defined using the above mentioned PICO framework. For example, “we hypothesize that adult trauma patients (P) receiving pharmacoprophylaxis for venous thromboembolism (I) will have fewer venous thromboembolisms (O) than patients not receiving pharmacoprophylaxis (C).’’
The commonly reported p-value is the probability of obtaining the observed effect (or larger) under the null hypothesis that there is no effect.56 Colloquially, we can interpret the p-value as the probability that the finding was the result of chance.57 The p-value is the chance of committing what is called a type 1 error, that is, wrongfully rejecting the null hypothesis (ie, accepting a difference when in reality there is none). Now, more important, what the p-value is not: “how sure one can be that the difference found is the correct difference.”
Significance is the level of the p-value below which we consider a result “statistically significant.” It has become a worldwide convention to use the 0.05 level, although it is completely arbitrary, not based on any objective data, and, in fact, inadequate in several instances. It was suggested initially by the famous statistician Ronald Fisher, who rejected it later and proposed that researchers reported the exact level of significance.58
The readers will often see in research articles that p-values were “adjusted for multiple comparisons,” resulting in significance set at p-values smaller than the traditional 0.05 threshold. This is one of the most controversial issues in biostatistics, with experts debating the need for such adjustment.59,60,61 Those who defend the use of multiple comparisons adjustments claim that multiple comparisons increase the chances of finding a p-value less than 0.05 and inflate the likelihood of a type 1 error.60 Those who criticize its use argue that this leads to type 2 errors, that is, the chance of not finding a difference when indeed there is one.59,60,61 One of the authors of this chapter (Angela Sauaia) recalls her biostatistics professor claiming that, if multiple comparisons indeed increased type 1 error, then biostatisticians should stop working at the age of 40, as any differences after then would be significant just by chance. Our recommendation is: if the hypotheses being tested were predefined (ie, before seeing the data), then multiple comparisons adjustment is probably unnecessary; however, if hypothesis testing was motivated by trends in the data, then it is possible that even the most restrictive multiple comparison adjustment will not be able to account for the immense potential for investigator bias. In the end, the readers should check the exact p-values and make their own judgment about a significance threshold based on the impact of the condition under study. Lethal conditions for which there are few or no treatment options may require looser significance cutoffs, while, at the other end of the spectrum, benign diseases with many treatment options may demand strict significance values.
A special case of multiple comparisons is the interim analyses, which are preplanned, sequential analyses conducted during a clinical trial. These analyses are almost obligatory in contemporary trials due to cost and ethical factors. The major rationale for interim analyses relies on the ethics of holding subjects hostage of a fixed sample size, when a new therapy is potentially harmful, overwhelmingly beneficial or futile. Interim analyses allow investigators, upon a Data Safety Monitoring Board (DSMB) independent committee advice, to stop a trial early due to efficacy (the tested treatment has already proven to be of benefit), or futility (the treatment-control difference is smaller than a predetermined value), or harm (treatment resulted in some harmful effect).62 For example, Burger et al in their RCT testing the effect of prehospital hypertonic resuscitation after traumatic hypovolemic shock reported that the DSMB stopped the study on the basis of potential harm in a preplanned subgroup analysis of non-transfused subjects.63
The 95% CI, a concept related to significance, means that, if we were to repeat the experiment multiple times, and, at each time, calculate a 95% CI, 95% of these intervals would contain the true effect. A more informal interpretation is that the 95% CI represents the range within which we can be 95% certain that the true effect lies. Although the calculation of 95% CI is highly related to the process to obtain the p-value, the CIs provide more information on the degree of uncertainty surrounding the study findings. There have been initiatives to replace the p-value by 95% CI, met with much resistance. Most journals now require that both are reported.
For example, in the CRASH-2 trial, a randomized controlled study on the effects of TXA in bleeding trauma patients, the TXA group showed a lower death rate (14.5%) than the placebo group (16.0%) with p-value = 0.0035.64 We can interpret this p-value as: “There is 0.35% chance that the difference in mortality rates was found by chance.” The authors also reported the effect size as a relative risk of 0.91 with a 95% CI of 0.85–0.97. This means, in a simplified interpretation, that we can be 95% certain that the true relative risk lies between 0.85 and 0.97. Some people prefer to interpret the effect as an increase; in which case one just needs to calculate the inverse: 1/0.91 = 1.10; 95% CI: 1.03–1.18. If the authors did not provide the 95% CI, you can visit a free, online statistics calculator (eg, www.vassarstats.net), to easily obtain the 95% CI of the difference of 1.5% points: 0.5–2.5% points. In an abridged interpretation, we can be 95% certain that the “true” difference lies between 0.5% and 2.5% points. Incidentally, notice that we consistently use “percent points” to indicate that this is an absolute (as opposed to relative) difference between two percentages.
The example above reminds us that it is important to keep in mind that statistical significance does not necessarily mean practical or clinical significance. Small effect sizes can be statistically significant if the sample size is very large. The above mentioned CRASH-2 trial, for example, enrolled 20,211 patients.64 P-values are related to many factors extraneous to whether the finding was by chance or not: including the effect size (larger effect sizes usually produce smaller p-values), sample size (larger samples sizes often result in significant p-values) and multiple comparisons (especially when unplanned and driven by the data).65
In addition, we must make sure that the study used appropriate methods for hypothesis testing. Statistical tests are based on assumptions, and if these assumptions are violated, the tests may not produce reliable p-values. Many tests (eg, t-test, ANOVA, Pearson correlation) rely on the normality assumption. The Central Limit Theorem (the distribution of the average of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution) and the law of large numbers (the sample mean converges to the distribution mean as the sample size increases) are often invoked to justify the use of parametric tests to compare non-normally distributed variables. However, these allowances apply to large (n > 30) samples; gross skewness and small sample sizes (n < 30) will render parametric tests inappropriate. Thus, if the data are much skewed, as it is often the case with number of blood product units transfused, length of hospital stay, and viscoelastic measurements of fibrinolysis (eg, clot lysis in 30 minutes), or the sample size is small (as it is often the case in basic science experiments), nonparametric tests (eg, Wilcoxon rank-sum, Kruskal-Wallis, Spearman correlation, etc) or appropriate transformations (eg, log, Box-Cox power transformation) to approximate normality are more appropriate. More on this topic in the section on sample descriptors.
Statistical power is the counterpart to the p-value. It relates to type 2 error, or the failure to reject a false null hypothesis (a “false negative”). Statistical power is the probability of accepting the null hypothesis when there is actually a difference. Despite its importance, it is one of the most neglected aspects of research articles. Most studies are superiority studies, that is, the researchers are searching for a significant difference. When a difference is not found in a superiority study, there are two alternatives: to declare “failure to find a significant difference” or to report a power analysis to determine how confident we can be to declare the interventions (or risk factors) under study indeed equivalent. The latter alternative is more appealing when it is preplanned as an equivalence or non-inferiority trial rather than an afterthought in a superiority study.
Whether statistical power is calculated beforehand (ideally) or afterwards (better than not at all), it must always contain the following four essential components: (1) power: usually 80% (another arbitrary cutoff), (2) confidence: usually 95%, (3) variable value in the control group/comparator, and (4) difference to be detected. For example, Burger et al stated in the above mentioned prehospital hypertonic resuscitation RCT63:
The study was powered to detect a 4.8% overall difference in survival (from 64.6% to 69.4%) between the NS group and at least 1 of the 2 hypertonic groups. These estimates were based on data from a Phase II trial of similar design completed in 2005.8 There was an overall power of 80% (62.6% power for individual agent) and 5 planned interim analyses. On the basis of these calculations a total sample size of 3726 patients was required.
We call attention to how the difference to be detected was based on previous evidence. The difference to be detected is mostly a clinical decision based on evidence. Power should not be calculated based on the observed difference: we determine the appropriate difference and then obtain the power to detect such difference. The basic formula to calculate power is shown below:
where N is the sample size in each group (assuming equal sizes), σ is the standard deviation of the outcome variable, Zb represents the desired power (0.84 for power = 80%), Za/2 represents the desired level of statistical significance (1.96 for alpha = 5%), and “difference” is the proposed, clinically meaningful difference between means. There are free, online power calculators; however, they usually are meant for simple calculations. Power analysis can become more complex when we need to take into account multiple confounders (covariates), in which case the correlation between the covariates needs to be taken into account, or when cluster effects (explained in more detail later) exist, when an inflation factor, dependent on the level of intracluster correlation and number of clusters, is used to estimate the sample size.
Equivalence and noninferiority studies are becoming much more common in the era of comparative effectiveness.66,67 Their null hypothesis assumes that there is a difference between arms, while for the more common superiority trials the null hypothesis assumes that there is no difference between groups. A noninferiority trial seeks to determine whether a new treatment is not worse than a standard treatment by more than a predefined margin of noninferiority for one or more outcomes (or side effects or complications). Noninferiority trials and equivalence trials are similar, but equivalence trials are two-sided studies, where the study is powered to detect whether the new treatment is not worse and not better than the existing one. Equivalence trials are not common in clinical medicine.
In noninferiority studies, the researchers must prespecify the difference they intend to detect, known as the noninferiority margin, or irrelevant difference or clinical acceptable amount.66 Within this specified noninferiority margin, the researcher is willing to accept the new treatment as noninferior to the standard treatment. This margin is determined by clinical judgment combined with statistical factors and is used in the sample size and power calculations. For example, is a difference of 4% in infection rates between two groups large enough to sway your decision about antibiotics? Would a difference of 10% make you think they are different? Or you need something smaller? These decisions are based on clinical factors such as severity of the disease, and variation of the outcomes.
The USA Multicenter Prehospital Hemoglobin-based Oxygen Carrier Resuscitation Trial is an example of a dual superiority/noninferiority assessment trial.68 The noninferiority hypothesis assumed that patients in the experimental group would have no more than a 7% higher mortality rate compared with control patients, based on available medical literature. The noninferiority question in this study was that the blood substitute product would be used in scenarios in which blood products were needed but not available or permissible. Incidentally, this is another example of where adaptive power analysis was performed after enrollment of 250 patients to ensure that no increase in the trial size was necessary.
Reviews of the quality of noninferiority trials have shown major problems.69,70 Essential elements to be included in the reports of this type of study are (1) the hypotheses must specify the noninferiority margin and explain the rationale for that choice; (2) whether the participants in the noninferiority trial were similar to those that were included in previous trials that established the efficacy of the control treatment; (3) whether the control treatment is in fact similar to the treatment tested in efficacy trials; (4) whether secondary outcomes were tested for noninferiority or superiority.67
We often hear the comment that a small study with a significant difference may lack power. This is incorrect. Once a study has a significant difference, questions about statistical power are irrelevant.
Association and Causation
Most studies will describe associations between outcomes and effects of interest. Whether these associations represent a cause-effect relationship is often an issue. A useful tool to make this determination is the set of nine criteria proposed by Sir Austin Bradford Hill in 1965, which still proves to be useful as a guideline.71,72
Strength of the association: This criterion does no equate to size of the p-value; rather it refers to the effect size. Large effect sizes are more likely to be causal. For example, in the initial studies on risk factors for MOF, there was a strong, independent association between transfusions of red blood cells (RBC) and post-injury MOF, which triggered further investigations on the specific causative, harmful role of RBCs.73,74,75
Consistency: The results have been replicated under different conditions by independent investigators.
Specificity: The effect of interest is associated with a specific outcome rather than a wide range of outcomes. Its presence can help the case for a causal effect, but its absence does not discard it, as most outcomes have multifactorial, interdependent causes.
Temporality: There is a clear temporal relationship in which the effect of interest precedes the disease. Although this may seem obvious, it is important that we take into account how the outcomes and the effect are measured. For example, although lung failure seems to precede liver and renal dysfunction in the development of post-injury MOF, the tests used to assess the function of these organs have different sensitivities to levels of organ damage. PaO2/FIO2 ratios may detect early, mild levels of pulmonary failure, while bilirubin and creatinine only rise after substantial organ derangement.76,77 Thus, the temporal relationship may be unclear.
Biological gradient or dose response: Increasing/decreasing exposure is associated with increasing/decreasing risk of disease. This is a powerful criterion when measurements are accurate. In the above mentioned studies on the association of RBC and MOF, a dose-response relationship was observed both in the direction of higher MOF incidence associated with larger number of RBC units transfused,78 and also with the observed decreased incidence of MOF when judicious transfusion practices dramatically limited the amount of RBCs transfused post-injury.75
Plausibility: Although novel findings may not fit this criterion, when there is a proposed scientific mechanism that can explain the association, the case for causation is strengthened.
Coherence: The association is consistent with what is known about the disease. Of course, once again, novel observations may not fit this criterion. Yet there is no denying that if the reported finding does not fit with current knowledge, there is a tendency toward a healthy skepticism.
Experimental evidence: Hill proposed that “Causation is more likely if evidence is based on randomized experiments.”71
Analogy: In the presence of previous evidence of a causal effect by one class of agent (eg, RBC and MOF), we are more likely to accept causation when another agent from the same class (eg, plasma) is implicated as a risk factor for the same outcome. However, as Rothman and Greenland indicated, we must be careful in the application of this criterion: “Whatever insight might be derived from analogy is handicapped by the inventive imagination of scientists who can find analogies everywhere. At best, analogy provides a source of more elaborate hypothesis about the associations under study; absence of such analogies only reflects lack of imagination or lack of evidence.”79
These criteria are to be used as guidelines, as Sir Bradford Hill himself wrote: “None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non.”71
Physicians can appreciate the acronym CBC as a mnemonic for the three main reasons for a spurious association: Chance, Bias and Confounding. Chance is dealt with by statistical testing, while appropriate designs and analytic techniques can assist with eliminating or minimizing bias and confounding.
Bias is the deviation of results due to systematic errors in the research methods. Although there are several different names for biases, two types seem to capture most the biases presented in surgical literature: (1) selection bias, which occurs when the study groups differ systematically in some way or when the study sample differs from the study population; and (2) observer/information bias, which occurs when there are systematic differences in the way information is collected for the groups being studied. The article by del Junco et al80 on the “seven deadly sins in trauma outcomes research” is an excellent review of some of the most common biases.
There are several types of selection bias that commonly appear in the trauma literature. One of the most common is missing data not at random. Treatment of missing data has become the focus of much attention recently. Indeed, the Food and Drug Administration (FDA) requested that the National Academy of Sciences created a Panel on the Handling of Missing Data in Clinical Trials. The panel was charged with proposing appropriate study designs and follow-up methods to reduce missing data and appropriate statistical methods to address missing data for analysis of results.81 Although their focus was on RCTs, the report is very informative for other study designs as well.
The most important things to consider about missing data are (1) the proportion of patients with missing data; and (2) whether the data are missing at random, thus not biasing the results in a significant way, or whether there is some pattern that can bias the results. Finding out that data are missing not at random (MNAR) does not mean that the study is automatically flawed but appropriate statistical methods must be used to deal with them. If the proportion of missing data is high and MNAR, one cannot ignore it and proceed to analyze the complete dataset without further consideration. Let us illustrate this with a little story: the father of one of the authors (Sauaia) was a physician interested in congenital heart defects. He was conducting a population-based study about the incidence of such defects in school-age children and visited several schools, screening children for heart defects. Some children were, of course, absent that day and were not screened. At the end of the day, this young researcher considered the absent children (his missing data) for a moment and, wanting complete data, decided to visit them at home. Not surprisingly, some of the absent children carried in fact a heart defect. This made sense, as these ill children were more likely to miss school because of symptoms or medical appointments.
Missing data not at random in trauma occur for two radically different reasons: (1) patients are too sick to have the test (eg, died early, intravenous access not possible, chaotic trauma scene, etc), in which case, adverse outcomes are common; or (2) patients are not sick enough to justify the test (ie, early discharge, hemodynamically stable, not on mechanical ventilation, etc), in which case, adverse outcomes are rare. For example, in the late 1990s, we developed predictive models for post-injury MOF, and observed that lactate was a significant independent predictor of MOF.73,82 As expected, lactate measurement was available only for the group of severely injured patients. We then addressed the missing data using two analyses: the first included only patients for whom lactate was measured; the second included all patients and used a “missing indicator” for unavailable lactate levels (ie, each patient was assigned three possible values for lactate: missing, normal, and abnormal). The results of the two analyses were remarkably similar and increased the strength of their findings. In a more recent example, Odom and colleagues83 addressed this issue in their study on the value of lactate as a predictor of trauma mortality. They astutely observed that the selection bias created by the missing lactate values would bias their results toward the null hypothesis rather than the positive effect they found.
Many methods are available to deal with missing data, such as the “last value carried forward.” Albeit highly criticized, this technique has its place in longitudinal studies, when most variables are not missing and there is high predictability of missing values.84 For example, we used this technique to input the values of daily liver function tests using the last obtained value until a subsequent result was available.85
More sophisticated techniques, such as multiple imputation by chained equations (MICE), are becoming popular. In simple words, this method imputes missing values based on regression equations derived M times, followed by an analysis of each imputed dataset and finalized by a combination of the M analyses. Although multiple imputation functions better for data missing at random, it has been shown to provide good estimates even in certain cases when data are MNAR. For example, Moore and colleagues from Canada tested the use of MICE to impute values of the Glasgow Coma Scale in 2005, and again in 2015 for the Glasgow Coma Scale, respiratory rate, and systolic blood pressure for a model to evaluate quality of trauma care with good results.86,87
Another important type of selection bias, especially in trauma and emergency care, is survivor bias. This occurs when the individual does not survive long enough to have the “opportunity” to receive the complete intervention. These early nonsurvivors contribute to increasing the mortality rate of the group not receiving the intervention, artificially inflating the effect of the intervention. The studies on fixed, balanced blood product ratios (1:1:1 RBCs:plasma:platelets) are well-known examples of this problem.80,88,89 Survival analysis, which analyzes time to event and allows for censoring patients who died before experiencing the intervention, is a helpful technique to deal with this problem. However, truth be told, one can never know what would have happened to nonsurvivors had they survived long enough to get the intervention.
One can also add a time-varying covariate to the survival analysis, which is a variable that, as the name says, varies over time. For example, the ratio of blood products varies hour to hour during the dynamic resuscitation period. If a patient receives 6 RBC units and 3 units of plasma in the first 3 hours and no blood products between hour 4 and hour 6, its RBC:plasma ratio at hour 6 is 2:1, which is exactly the same hour 6 ratio as someone who receives 6 RBC units in the first 3 hours and 3 units of plasma from hour 4 to the end of hour 6 (a “catch-up” practice). The big difference here is that the first patient experienced the 2:1 ratio at all times while the second had an initial ratio of 3:0 followed by a 0:3. It is quite possible that the outcomes are different for these two extremes. Using a time-varying covariate, we can actually express the changes in RBC:plasma ratio hourly. The best solution, however, to deal with survivor bias is an RCT. Indeed, the results of the PROPPR trial, an RCT testing the effect of fixed, balanced blood products ratio failed to find a difference, despite excellent statistical power, contradicting the results of previous observational studies that found it beneficial.90
Some authors exclude early deaths from the analysis to deal with survivor bias. This may be a solution, but it limits the generalizability, as the study findings apply only to patients who survive the acute post-injury period. Other types of commonly encountered selection biases are—loss to follow-up not at random (eg, patients failed to return to follow-up visits due to long-term injury-related complications), refusal to participate or withdrawal due to side effects or invasiveness of the intervention, consent not obtained due to traumatic brain injury, etc. This is a good reminder to always read the inclusion and exclusion criteria to determine whether they resulted in selection bias. Common exclusion criteria that may limit the generalizability of the investigation are: advanced age, comorbidities, early deaths or incomplete data. It is important to emphasize the findings apply only to the population that fits both the inclusion and exclusion criteria.
Another type of common bias in surgical research is intervention bias, defined as using an intervention to define a population or group as opposed to preintervention risk factors. Notable examples of interventions used to define groups to be compared are—massive transfusion versus no massive transfusion or Resuscitative Endovascular Balloon Occlusion of the Aorta (REBOA) versus resuscitative thoracotomy. Bias is smaller if the indications for the intervention are well established and there is minimal variation among providers, but it becomes larger when the intervention is given “at the attending’s discretion” and there is large variation among providers.
Moving on to the final C of the CBC acronym: confounding, that is, a third variable responsible for all or part of the association between two other variables. The most quoted example is the association of coffee drinking and lung cancer, which is confounded by the real culprit, smoking, a variable associated with both coffee drinking and lung cancer. To be considered a confounder, a variable must be associated with both the outcome and the effect of interest (ie, a risk factor or an intervention). In trauma, injury severity, age, and comorbidities are frequent confounders of the association between an intervention and the outcome.
There are two ways to deal with confounders: (1) research designs, including RCT or matching; and (2) analytic tools such as stratified analysis (when there are few confounders) or multivariate analyses in the case of multiple confounders. These techniques will be discussed in a subsequent section of this chapter. The first table of articles describing RCTs usually shows the distribution of potential confounders between the control and experimental groups. Appropriately so, readers should not find p-values in this table, as any differences in the distribution of these confounders are, by definition, a result of chance. Randomization, in general, increases the likelihood of similar distribution of confounders, which then are not associated with group membership, allowing us to assess the effect of the intervention (the main difference between the two groups) on the outcome. Single blinding (only the patient) or double blinding (the patient and the researcher) are powerful add-ins to improve the likelihood that the researchers and patients’ own biases do not interfere with the design. Surgical and emergency interventions, however, are often not amenable to blinding.91
Matching is the alternative option to RCTs in research design to deal with confounders. This can be accomplished via traditional matching when only a few, known risk factors exist, or via PSM, which we mentioned in a previous section. In PSM, a multivariate model including potential reasons for receiving the intervention is used to derive a propensity score, that is, the probability of receiving the intervention. In this model, the “outcome” or “dependent variable” is the intervention. Using this propensity score, we can choose matching control patients, that is, patients who did not receive the intervention despite having the same propensity score as the patients who received the intervention. The downside of this procedure is that the reason why patients did not receive the intervention despite having the same propensity score is often unknown, or due to unmeasured variables that may have an effect themselves on the outcomes. Finally, matching limits the ability of the investigators to examine the effects of the matching variables; once used in the matching, a variable is no longer available for analysis. Therefore, it is important that the investigators are certain that the matching variable is of no interest in the analysis of the outcomes.
Interactions and Effect Modification
In confounding, variable A is responsible for part or all the association between variables B and C, and we want to adjust for it (in other words, minimize its effect) to be able to assess the association between B and C. There is little or no interest in any effects mediated by variable A. Conversely, in effect modification or interaction, variable A modifies the association between B and C. This type of association must be described, not adjusted for, as it provides important information of the mechanism underlying the association of B and C. Thus, when appraising multivariate models, the reader should make sure pertinent interactions were tested and, if significant, described appropriately.
An example of an important interaction can be found in our study on the effect of pre-injury antiplatelet therapy on post-injury outcomes.92 We showed that antiplatelet therapy modified the relationship between blood product transfusions and mortality.92 Specifically, as shown in Fig. 63-1, among patients who were taking antiplatelet agents prior to the injury, the odds of mortality associated with RBCs transfused were lower than among patients not receiving this medication.
Example of an important effect modification (or interaction) in the study by Harr et al investigating the effect of pre-injury antiplatelet therapy on post-injury outcomes.92 These investigators showed that antiplatelet therapy modified the relationship between blood product transfusions and mortality, decreasing the risk associated with requirement for transfusions.
Descriptive statistics, such as mean and standard deviation, median and interquartile range, frequency, and percentages, are used to provide the reader with the best possible description of the sample. Since the reader does not have access to the raw data, it is up to the authors to provide readers with a clear picture of what the sample looked like. More important, this picture should allow the readers to use the PICO framework and compare the study’s sample to the population to whom they intend to apply the findings. Are they similar enough that one may directly apply the findings, or are they older, younger, more severely injured, etc?
Categorical variables, such as sex and blunt versus penetrating mechanism, are expressed as frequency (N) and percentages. Continuous variables, such as age, injury severity score (ISS), Glasgow Coma Scale, length of stay, ventilator time, etc, can be described in different ways. Central tendency of the data is usually described by means or medians. When the variable distribution is normal, symmetric, medians and means are identical. However, when the variable is skewed or there are outliers, medians, rather than means, are better descriptors. As shown in Fig. 63-2, using data from a multicenter prospective study of severely injured patients, the distributions of systolic blood pressure at the emergency department and age were approximately normal, thus the median and mean both reflect the “typical patient.” On the other hand, the mean does not describe well the typical number of RBC units transfused in the first 12 hours or the length of stay in the intensive care unit.
Distribution of variables commonly reported in trauma and surgery research. Note how means and medians differ in ICU (Intensive care unit) days and RBC 12 hours (number of red blood cell units in the first 12 hours post-injury), which are skewed, compared to age or ED SBP (emergency department systolic blood pressure), which are closer to normal. (Data from the Glue Grant, analysis by the authors.)
Data dispersion for normally distributed variables is usually represented by the standard deviation (68% of the data should be contained within mean ±1 standard deviation, 95% of the data within mean ±2 standard deviations), while for skewed data we most commonly use the interquartile range (the lower and upper quartile; 50% of the data are contained within this interval) or the range (maximum and minimum values). Box plots, shown in Fig. 63-3, are excellent ways to represent the data central tendency and dispersion.
Example of a box plot showing the amount of plasma (FFP) units given in the first (h1), second (h2) and third hour (h3) post-injury in two study groups (C: resuscitation guided by conventional coagulation tests) and TEG: resuscitation guided by thromboelastography).42 The length of the box represents the interquartile range (the distance between the 25th and 75th percentiles); the symbol in the box interior, the group mean; the horizontal line inside the box interior, the group median; and the vertical lines (“whiskers”) the minimum and maximum values (range). Note the difference in mean and median, suggesting skewness.
Studies will usually present initially bivariate analysis (sometimes called univariate). This consists of an unadjusted, crude comparison between two (or more) groups of subjects. The main point for readers in this part of the article is to pay attention to variables that can be confounders, that is, variables with different distribution between the groups (thus associated with group membership) and also associated with the outcome. This difference does not need to be statistically significant, as the sample may be small thus prone to a type 2 error. Deciding whether a variable may be playing a confounder role is a clinical, not a statistical decision. After examining this table, the reader can decide which ones are potential confounders.
Analytic techniques that minimize confounding are stratification and multivariate analysis. Stratification may work when there are just a few risk factors and a large sample size. For example, in order to account for the confounding effect of smoking in the association between coffee drinking and lung cancer, one may stratify the analysis on smoking status and observe whether the association between coffee drinking and lung cancer holds within each stratum. Multivariate analyses are basically an advanced form of stratification done on multiple variables. There are several types of multivariate models depending on the distribution of the outcome. Binary outcomes (eg, death yes or no) are commonly addressed using logistic regression. Categorical outcomes with more than two strata can be analyzed with polytomous logistic regression. When time to event is of interest or there is need to censor data (eg, patients who died or are discharged before experiencing the outcome of interest), survival analysis is an option.
Linear regression assumes that the outcome is continuous, has a distribution not too far from normal, and has, as the name says, a linear relationship with the covariates. When these assumptions are not applicable (eg, outcome is categorical or too skewed), then we may apply a larger category of models, named generalized linear models. Please, note that there is a difference between generalized linear models and general linear models. The economic analysis by Schwartz and colleagues93 on delays in laparoscopic cholecystectomy is an example of a study using a generalized linear model. The generalized linear models are a broad class of models that include the logistic regression, the Poisson regression, log-linear models, gamma-distribution models (as used in the above mentioned article) and others.
The description of generalized linear models usually includes the term “link,” which is a function linking the actual outcome Y to the estimated Y in a model. In simple words, it is the transformation done to the outcome variable to convert it to continuous. In a linear regression model, the link is the identity, that is, the estimated and the actual outcome are expressed the same, and no transformation is needed. In a logistic regression, the transformation or link used is called the logit and the distribution is binomial, that is, a yes or no type of variable. For the gamma model (used in the Schwartz et al paper93), the link is the log, and the distribution is right-skewed with a variation that increases with the mean. This type of model is often used in econometrics because it fits the distribution of cost in health care, that is, care for most patients results in little costs, but a few patients require very costly treatments. In the log-linear and the Poisson regression, the link is a log and the distribution is the Poisson distribution. Poisson regression is usually applied to count data, for example, number of trauma deaths over a period of time, as seen in a 2013 article by Kahl et al assessing time-trends in annual trauma mortality rates.94
Each multivariate model has its own set of assumptions, and although many are robust to limited violations, this is an important consideration when appraising the article. For example, the Cox-proportional regression model, a type of survival analysis, requires, as the name says, that the risks are proportional, that is, do not vary over time. Variation over time is not an insurmountable problem and can be tested for and remedied by introducing time-varying covariates in the model.
Another assumption in regression models relates to the shape of the association between the outcome and the risk factor. Is it linear? If linear, is it a straight line or U-shaped? The computer software used for linear regression will draw a straight line, unless we add what is called a quadratic term (or second order polynomial), in which case, it will draw a U-shaped line (the U can face up or down). For example, in a recent study on fibrinolysis, we showed that the association of fibrinolysis and mortality was U-shaped, with higher mortality associated with very low (fibrinolysis shutdown) and very high (hyperfibrinolysis) levels of fibrinolysis and a mortality nadir with moderate fibrinolysis levels.95 We showed a similar U-shaped relationship between RBC: plasma ratios and mortality, with highest mortality peaking with both low and high rations and lowest mortality associated with medium ratios.96 Higher order polynomials are seldom used.
Most multivariate models assume independence between observations. In other words, if one cannot predict a subject’s outcome based on the outcome of other subjects, their outcomes are said to be independent. This is not completely true for patients within the same center, or even the same surgeon. Patients within a center tend to have similar outcomes, violating the independence assumption. This similarity between patients of the same center or same provider is named cluster effect and it should be accounted for in the statistical modeling and in sample size calculations. The larger the correlation between subjects within a cluster, the larger is the required sample size. This correlation can be assessed by the intraclass correlation coefficient, which measures how similar patients are within centers and how different they are from patients in other centers.97
Another important aspect of multivariate models is their performance, which can be evaluated in several ways, depending on the model. The R-square gauges how much of the variation of the outcome is explained by the model. We must keep in mind that in medicine, given the multifactorial nature of most diseases and clinical scenarios, it is uncommon to see large R-squares (>0.30). The R-square should be accompanied by a p-value that provides the probability that the observed R-square was not due to chance. Together they compose an assessment of the model’s performance.
Model discrimination, that is, the ability of the model to discriminate individuals with and without the outcome (eg, survivors and nonsurvivors) is often evaluated by the Area Under the Receiver Operating Characteristic Curve or AUROC (also known as c-statistic). ROC curves derived from studies on radar and sonar detection during World War II to ascertain the best radar setting to distinguish enemy airplanes from harmless targets (eg, flocks of birds). As you can see in Fig. 63-4, which shows the AUROC for thromboelastography values as predictors of massive transfusion, the Y-axis has the sensitivity (eg, % of deaths predicted), while the X-axis depicts [1–Specifity] (eg, [1-% of survivors correctly predicted to survive]). The AUROC is a good assessment of the overall accuracy of the model, and it has been arbitrarily categorized into: 0.90–1 = excellent; 0.80–0.90 = good; 0.70–0.80 = fair; 0.60–0.70 = poor; 0.50–0.60 = fail. AUROCs should be accompanied by 95% CIs, to allow the reader to determine whether they are significantly different than the 0.50 no-value AUROC, and to compare different AUROCs (if they not overlap, they are significantly different). For example, our group conducted a comparison between the Denver MOF score and the Multiple Organ Dysfunction Syndrome score (MODS) as predictors of death using AUROCs.98 The Denver MOF score’s AUROC was 0.88 (95% CI: 0.84–0.92) while the MODS’s AUROC was 0.86 (95% CI: 0.82–0.89), suggesting that both scoring systems performed similarly well.
Example of a receiver-operating-characteristic (ROC) curve to compare the predictive power of the thromboelastography (TEG) values (ACT, activated clotting time; MA, maximum amplitude; angle; and LY30, fibrinolysis) for massive transfusion defined as ≥4 units of red blood cells in the first hour post-injury (excludes deaths within 1 hour post-injury). The area under the ROC curve (AUROC) for the angle was 0.79 (95% CI: 0.69–0.89), MA: 0.78 (95% CI: 0.68–0.89), Ly30: 0.71 (95% CI: 0.56–0.85) and ACT: 0.70 (95% CI: 0.57–0.83).98 Note that the 95% CIs overlap, suggesting no difference in prediction performance.
ROC curves may also be used to determine the best threshold (to discriminate between normal/abnormal values) for a continuous variable. There are several methods to define the best cutoff, such as the Youden Index (aka Youden J statistic), defined as J = sensitivity + specificity –1 and ranging from 0 (no discrimination) to 1 (complete discrimination). The Youden Index corresponds to the maximum vertical distance between the ROC curve and the diagonal, noninformative line. Another common measure is the upper leftmost point. Although these points represent optimal combination of sensitivity and specificity, researchers may decide to privilege sensitivity or specificity depending on the condition. Certain conditions demand more sensitivity (eg, when one does not want to miss a case due to potential disastrous consequences) while others may require more specificity (eg, when the consequences of missing a case are minimal, but the cost of the test is high). In Fig. 63-4, the upper leftmost point in the ROC for the TEG variable angle was 65° and the Youden Index was 60°, thus researchers may choose a cutoff between these two values to define an optimal cutoff to predict the need for massive transfusion.
Calibration is another important assessment of model performance. It gauges the model’s ability to correctly estimate the probability of an event. It is commonly assessed using the Hosmer-Lemeshow statistic, which compares the predicted and observed rates within deciles. In this statistic, the larger the p-value (ie, the lower likelihood of disparity between predicted and observed rates), the better the model’s calibration. It is important to be careful, however, with sample sizes, when the p-value can be small (ie, significant) for models that are not poorly calibrated. 99
Multivariate models are often over-fitted, that is, have more variables than appropriate. Although there is software to calculate how many variables one can have in a multivariate model, a good rule of thumb, used even by expert statisticians, is 10 subjects with the outcome per variable in the model (not 10 subjects per variable, but 10 subjects with the outcome). When the number of confounders is higher than this threshold, an alternative is to use the above-mentioned propensity score, this time as an adjustment. Thus, instead of 20 variables, one has only one propensity score, representing the combined effect of the 20 variables. For example, Brown and colleagues used this approach in a study on the scope of prehospital (PH) crystalloids in patients with and without PH hypotension. A propensity score was used to adjust the mortality comparison between the group receiving high- versus low-volume of PH crystalloids. Instead of using five covariates (PH time, PH blood, PH SBP, ISS and initial base deficit), the authors had a single variable, a propensity score ranging from 0 to 1.100
There is controversy whether multivariate adjustment for confounders is better than propensity score.101 The answer seems to be: it depends. When the sample is large (>8–10 patients with the outcome per confounder), the multivariate method provides more information as one can assess the effect of individual confounders. Conversely, the propensity score is better when the sample is small.
When using a multivariate model for confounder adjustment it is important to compare the unadjusted effect and the adjusted (with confounders) effect. For example, if the unadjusted relative risk of a given variable is 2.0 (95% CI 1.5–2.5) and after adjustment of confounders, shows an adjusted relative risk of 1.2 (95% CI 0.8–1.6), one can conclude that confounders were responsible for the effect seen in the unadjusted analysis.
Principal Components and Cluster Analyses
So far, we have discussed how to analyze values of specific variables. Sometimes, however, researchers need to analyze the variables themselves. This is especially true when there are numerous variables, and researchers may wish to examine patterns and combinations of variables using factor analysis. A special case of factor analysis is principal components analysis (PCA), in which the factors or components are combinations of variables that do not correlate, thus representing independent components. Two examples in the trauma literature may help illustrate the use of PCA. In the first sample, the San Francisco group used PCA to define three uncorrelated groups of coagulation assays (prothrombin; factors V, VII, VIII, IX, X; D-dimer; activated and native protein C; and antithrombin III levels): component 1 was defined as a global clotting factor depletion; component 2 corresponded to the activation of protein C and fibrinolysis; and component 3 to factor VII elevation and VIII depletion.102 The authors reported that component 1 predicted mortality and INR/PTT elevation, while component 2 (fibrinolytic coagulopathy) predicted infection, end-organ failure, and mortality. A second example is seen in a similar study by our group, this time using TEG values.103 As shown in Table 63-2, PCA generated a number of components, each of which contain a number of variables, each one associated with a factor loading (which can be interpreted much like a simple correlation coefficient, ranging from –1 to 1). In our analysis of TEG variables, three components were responsible for 93% of the variation in the data. Component 1 included K, angle, maximum amplitude, maximal rate of thrombus generation, and total thrombus generation; component 2 included activated clotting time and time to maximal rate of thrombus generation, while component 3 reflected fibrinolysis. Taken together, both studies supported the conclusion that trauma-induced coagulopathy has distinct, independent mechanisms, which may require tailored hemostatic therapeutic approaches.
TABLE 63-2Principal Components Analysis (Using thromboelastogram (TEG) values in trauma patients. Shaded cells represent the variables that had a loading greater than |60| in that component.)103 ||Download (.pdf) TABLE 63-2 Principal Components Analysis (Using thromboelastogram (TEG) values in trauma patients. Shaded cells represent the variables that had a loading greater than |60| in that component.)103
|Composition of Principal Components (PC) |
| ||PC 1 ||PC 2 ||PC 3 |
|% variance explained by component ||63% ||17% ||13% |
|Activated clotting time ||–30 ||90 ||6 |
|K ||–95 ||15 ||–4 |
|Angle ||92 ||–26 ||5 |
|Maximum amplitude ||95 ||–15 ||–10 |
|% lysis at 30 min ||–4 ||0 ||99 |
|Time to maximal rate of thrombus generation ||–13 ||95 ||–5 |
|Maximal rate of thrombus generation ||94 ||–23 ||3 |
|Total thrombus generation ||94 ||–14 ||–16 |
PCA also has assumptions, including normal distribution and more subjects than variables. Therefore, this technique is not applicable to many of the “omics” studies (ie, genomics, proteomics and metabolomics), where the number of variables far exceeds the number of subjects. In this case, an option is the Primary Latent Structures (PLS, formerly known as partial least squares) analysis, commonly followed by a discriminant analysis (PLS-DA). Although a description of the technique is beyond the scope of the chapter, the principles of PLS are similar to the PCA explained above. The PLS will result in combinations of variables, which can then be used to predict outcomes.
Cluster analysis, on the other hand, groups subjects into clusters based on their responses. One can use the factors or components derived from the PCA or the PLS to define the clusters. Cluster analysis is not really a statistical tool, rather it is a mathematical model to group subjects based on their responses. In a recent example, our group used cluster analysis to group patients according to PCA-derived TEG components (Fig. 63-5). Patients in Cluster 2 (hypocoagulable with no fibrinolysis) were more severely injured (higher ISS), but had a lesser degree of shock compared to patients in Clusters 3 and 4. Patients in Cluster 4 required significantly more blood products and had an increased hemorrhage-related mortality rate compared to the other groups.
Cluster analysis: trauma patients plotted according ACT/angle/MA (X-axis) and fibrinolysis (Y-axis). FLYSIS, fibrinolysis; Nl, normal.