*Good doctors use both individual clinical expertise and the best available external evidence, and neither alone is enough. Without clinical expertise, practice risks becoming tyrannized by evidence, for even excellent external evidence may be inapplicable to or inappropriate for an individual patient. Without current best evidence, practice risks becoming rapidly out of date, to the detriment of patients.*

*Sacket et al, 1996*

Evidence-based medicine (EBM) is “the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients.”^{1,2} The phrase evidence-based medicine was developed by a group of physicians at McMaster University in Hamilton, Ontario, in the early 1990s.^{1,3} EBM is a combination of clinical expertise and best evidence, as eloquently stated by Sacket and colleagues in the above quotation.^{1} The EBM definition has two basic components: the first is “the conscientious, explicit, and judicious use,” which, in trauma, applies to split-second decision in the face of an immense variety of unexpected clinical scenarios. The time pressure and the irreversibility of many surgical procedures enhance the anxiety of decision making.^{4}

The second component of the EBM definition is “best evidence.” What constitutes “best evidence”? Best evidence is “clinically relevant research,”^{1} which can both invalidate previously accepted procedures or replace them with new methods that are more powerful, efficacious, safe, and cost containing.^{5} In simple words, it comes down to “How does the article I read today changes (or not) how I treat my patients tomorrow?”^{6}

Searching for evidence usually starts with the formulation of a searchable clinical question. For this purpose, the PICO framework is very helpful.^{7} PICO stands for patient (P), intervention (I), comparator (C), and outcome (O). When searching for evidence, or determining whether the most recently read article applies to the patient in front of you, quickly define whether the PICO observed in the study are similar enough to allow application of the study results to the individual patients in your practice.

Once you determine whether a published report applies to your practice, critical appraisal is the next step. Critical appraisal of the internal and external validity of a scientific report is essential to filter the information that can improve individual patient care as well as contribute to an efficient, high quality health care system. Critical appraisal of an article requires an inquisitive and skeptical mindset combined with a basic understanding of scientific and statistical methods.^{8} Many would say that scientific and statistical methods are only for researchers. Yet these methods are essential in becoming excellent consumers of research and EBM practitioners. This chapter will provide a basic review of methods for searching, critically appraising, and applying evidence, including an introduction to statistical topics essential to a reliable interpretation of trauma literature.

Best evidence rarely comes from a single study; more often it is the final step of a long scientific journey, in which experts collect, appraise, and summarize the findings of several individual studies using a specific, standard methodology.^{1} Systematic reviews, such as those published by the Cochrane Collaboration or collections of evidence-based readings (eg, Selective Readings in General Surgery at https://www.facs.org/publications/srgs, which includes selections in Trauma), are excellent ways to distill the daunting amount of information available. When first learning about a topic, systematic reviews are an excellent first step. These reviews are commonly attached to a level of evidence, which gauges the confidence of the summative collection of research on a specific topic.

Yet a systematic review is not always available; thus, how will busy health care providers manage the formidable volume of information that becomes available every day, much of which is contradictory? Appraisal is the answer. There are several systems to appraise and grade the level of evidence provided by the literature. For example, the GRADE (Grades of Recommendation, Assessment, Development and Evaluation)^{9} system follows a detailed stepwise process to rate evidence (http://www.gradeworkinggroup.org/news, accessed May 30, 2015), which is summarized later in this chapter. The *Journal of Trauma and Acute Care Surgery* recently adapted the GRADE system to gauge the level of uncertainty of individual articles as shown in Table 63-1. Their system retains study design as a major factor in the classification, but recognizes that each type of clinical question (therapeutic, diagnostic accuracy, etc) demands different types of study designs and tolerates different levels of uncertainty.^{10,11}

Types of Studies | ||||||
---|---|---|---|---|---|---|

Level | Therapeutic/care management | Prognostic and epidemiological | Diagnostic tests or criteria | Economic and value-based evaluations | Systematic reviews and meta-analyses | Guide lines |

Level I | RCT with no negative criteria ^{a}and with a significant differenceRCT with no negative criteria ^{a}without significant difference and adequate statistical power^{b}
| Prospective ^{c}study with large effect^{d}and no negative criteria^{a}
| Testing of previously developed diagnostic criteria in consecutive patients (all compared to “gold” standard) | Sensible costs and alternatives; values obtained from many sources; multi-way sensitivity analyses | Systematic review (SR) or meta-analysis (MA) of predominantly level I studies and no SR/MA negative criteria^{e} | GRADE system |

Level II | RCT with significant difference and only 1 negative criterion ^{a}Prospective ^{c}comparative study with no negative criteria^{a}
| Prospective ^{c}study with less than large effect^{d}and no negative criteria^{a}Untreated controls from RCT Prospective/Retrospective ^{c}study with large effect^{d}and only 1 negative criterion^{a}
| Development of diagnostic criteria on consecutive patients (all compared to “gold” standard) | Sensible costs and alternatives; values obtained from limited sources; multiway sensitivity analyses | SR/MA of predominantly level II studies with no SR/MA negative criteria^{e} | |

Level III | Case-control study with no negative criteria ^{a}Prospective ^{c}comparative study with only 1 negative criterion^{a}Retrospective ^{c}comparative study with no negative criteria^{a}
| Case-control study with no negative criteria ^{s}Prospective or retrospective ^{c}study with up to 2 negative criteria^{a}
| Nonconsecutive patients (without consistently applied “gold” standard) | Analyses based on limited alternatives and costs; poor estimates | SR/MA with up to 2 negative criteria^{e} | |

Level IV | Prospective or Retrospective^{c} study using historical controls or having more than 1 negative criterion^{a} | Prospective or retrospective ^{c}study with up to 3 negative criteria^{a}
| Case-control study with no negative criteria^{a} | No sensitivity analyses | SR/MA with more than 2 negative criteria^{e} | |

Level V | Case series Studies with quality worse than level IV
| Case series Studies with quality worse than level IV
| No or poor “gold” standard |

^{a}Negative criteria (decreases level of evidence):

1. <80% follow-up

2. >20% missing data or missing data not at random without proper use of missing data statistical techniques

3. Limited control of confounding (eg, mortality comparisons with inadequate risk adjustment)

4. More than minimal bias (selection bias, publication bias, report bias, etc)

5. Heterogeneous populations (eg, institutions with distinct protocols/patient volume; conditions caused by distinct pathogenic mechanisms)

6. For RCT only: no blinding or improper randomization

^{b}Adequate statistical power: this only applies to studies *not* finding statistical differences and it is defined as power >80% for declaring “failure to detect a significant difference” or power >90% for declaring “bio-equivalence or noninferiority or comparative effectiveness” of a *predefined* difference based on previous evidence.

^{c}Prospective versus retrospective: studies with data collected to answer predefined questions are prospective; studies with data collected for questions unrelated to the original question for which the data were gathered are retrospective.

^{d}Large effect:

• Study with large RR (>5 or <0.2) about condition of low to moderate morbidity/mortality

• Study with moderate to large RR (2–5 or 0.2–0.5) about condition of high morbidity/mortality

^{e}Negative criteria for systematic reviews and meta-analysis (decreases level of evidence):

1. No or inadequate standard search protocol

2. More than minor chance of publication bias or publication bias nonassessed

3. Moderate heterogeneity of included studies and/or populations (eg, elective surgery and acute surgery)

4. Predominance of level III or lower studies

5. No measures or inappropriate measures of pooled risk (for meta-analysis only)

The determination of the level of evidence of a study involves four steps.

*Step 1: Define study type*

Therapeutic and care management studies evaluate a treatment efficacy, effectiveness, and/or potential harm, including comparative effectiveness research and investigations focusing on adherence to standard protocols, recommendations, guidelines, and/or algorithms.

Prognostic and epidemiological studies

^{12}assess the influence of selected predictive variables or risk factors on the outcome of a condition. These predictors are not under the control of the investigator(s). Epidemiological investigations describe the incidence or prevalence of disease or other clinical phenomena, risk factors, diagnosis, prognosis, or prediction of specific clinical outcomes and investigations on the quality of health care.Diagnostic tests or criteria

^{13}studies describe the validity and applicability of diagnostic tests/procedures or of sets of diagnostic criteria used to define certain conditions (eg, definition of adult respiratory distress syndrome, multiple organ failure (MOF), or post-injury coagulopathy).Economic and value-based evaluations focus on which type of care management can provide the highest quality or greatest benefit for the least cost. Several types of economic-evaluation studies exist, including cost-benefit, cost-effectiveness, and cost-utility analyses. More recently, Porter proposed value-based health care evaluations, in which value was defined as the health outcomes achieved per dollar spent.

^{14,15}Systematic reviews and meta-analyses evaluate the body of evidence on a topic; meta-analyses specifically include the quantitative pooling of data.

Guidelines are systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances.

^{16}

*Step 2: Define the research question, the hypothesis(es), and the research design*

The research design is the plan used by the investigators to test a hypothesis related to a research question as unambiguously as possible given practical and ethical constraints. Even descriptive and exploratory studies should have research questions and, most of the time, a testable hypothesis. This step is where you assess whether the research design was appropriate to address the research question and test the hypothesis. Certain designs are better than others. Randomized clinical trials (RCTs) remain the paragon of biomedical research, followed by cohort comparative studies, case-control and case series in this order. In RCTs, by randomly distributing patients to the study groups, potential risk factors are more likely to be evenly dispersed. When blinding of patients and investigators is added to the design, bias is further minimized. These processes diminish the risk of confounding factors influencing the results. As a result, the findings generated by RCTs are likely to be closer to the true effect than the findings generated by other research methods.

Controlled experimentation, however, is not always possible or ethical, thus other research models must be utilized to answer research questions.^{17} Although the application of study designs other than RCTs may be well justified, we must still recognize and beware of the higher level of uncertainty, and consequent lower level of evidence associated with them. Cohort comparative studies and case-control studies include a comparison group, that is, a group of patients who received a different type of care/procedure/test. Alternatively, the investigator can compare the same group of patients before and after an intervention, or use historical controls. The fundamental difference between cohort and case-control studies resides on where they start; that is, the outcome or the intervention/risk factor. In case-control studies, investigators find a group of patients with a specific outcome (eg, patients with severe torso trauma and coagulopathy) and a comparable group of patients without the outcome (eg, patients with severe torso trauma and no coagulopathy); the investigator then compares the incidence of a risk factor (eg, hemodynamic instability in the field), or whether a particular intervention was instituted (eg, plasma in the field). For example, Wu et al^{18} used a case-control design to compare the bone mineral density (BMD) of 87 elderly patients with hip fractures to 87 elderly patients without hip fractures and found BMD to be significantly lower among patients with the outcome (ie, hip fracture).

In cohort studies, investigators define a group of patients with the risk factor or intervention (eg, severe torso trauma patients who received plasma in the field), and a comparable group without the risk factor or intervention (eg, severe torso trauma patients who did not receive plasma in the field); then the outcomes are compared. To make things more confusing, a case-control study is sometimes a later offspring of a well-planned cohort study, as in the case-control study by Shaz et al^{19} on post-injury coagulopathy. Although important medical discoveries were done through case-control studies, such as the association of smoking and lung cancer,^{20} this design has several limitations that place it lower on the evidence hierarchy. These include, but are not limited to, potential for bias in the selection of the control group and uncontrolled confounding in the assessment of the risk factor or intervention.

Case-series evaluate a single group of patients submitted to a type of care/procedure/test without a comparison group. Case-series have a role in rare conditions or to describe a preliminary experience with a new intervention. However, the lack of a comparator lowers our confidence in the findings and they are less likely to change practice. Of course, if this is an innovative treatment for a currently incurable, lethal disease, we may adopt it even with low confidence for lack of better options. The urgency to adopt the new treatment, however, does not change the fact that our confidence is still low, and that further research is crucial to increase our confidence level.

Propensity score matching (PSM) has gained increased popularity in clinical research as an alternative design when RCTs are not possible.^{21,22,23,24,25} In brief, the propensity score is the probability of treatment assignment conditional on observed (emphasis on observed) baseline characteristics. Each treated patient is matched to one or more control patients with similar propensity scores but who did not receive the intervention. This works well when the intervention is relatively new and there is variation in adoption, with some professionals using it and others not, regardless of indications. There are controversial results regarding how well PSM studies reflect the results of correspondent RCTs. Lonjon et al^{23} reported no significant differences in effect estimates between RCTs and PSM observational studies regarding surgical procedures. Conversely, Zhang et al^{21,22} reported that PSM studies tended to report larger treatment effect than RCTs in the field of sepsis, but the opposite was true for studies in critical care medicine. It is likely that differences in populations and differential control of confounding played a role in these disparities.

RCTs control for unmeasured confounders, while a PSM study is only as good as the measured confounders included in the propensity score model. While appraising a PSM study, make sure to inspect the model used to generate the propensity score: were all important confounders or indications included in the model? Was the model goodness-of-fit (ie, how well the model fit the data) reported and appropriate (as discussed later in this chapter)? More on this design will be presented in subsequent sections.

*Step 3: Define effect size*

Effect size represents the magnitude and the direction of the difference or association between studied groups.^{26,27} Large effect sizes tend to strengthen the evidence. Although there is no specific threshold to define a large effect, an effect greater than 5 (or <0.2, when the desired direction is to decrease the outcome) in a condition of low to moderate morbidity or mortality (eg, stable femur fracture in a young person) is considered a large effect.^{6} In highly morbid or lethal conditions (eg, penetrating chest wounds), an effect size greater than 2 (or <0.5) may be considered large enough.

There are several measures of effect size, and some of the most commonly reported are

*Relative risk or risk ratio (RR):*The RR compares the outcome probability in two groups (eg, with and without an intervention, or with and without a risk factor). When equal to 1, there is no evidence of effect.*Odds ratio (OR):*This index compares the odds of the outcome in two groups. Unless the reader is a gambler, the concept of odds is not intuitive, yet because the highly popular logistic regression (discussed later in this chapter) produces odds ratios, this measure is often reported in the literature. Thus, it is important we understand it. For example, in a group of pediatric severe trauma admissions:Odds of death in the group receiving tranexamic acid (TXA): number of deaths/number of survivors

Odds of death in the group

*not*receiving TXA: number of deaths/number of survivorsOdds ratio: Odds of death in the TXA group/odds of death in the non-TXA group

Odds ratios are good estimates of the relative risk when the outcome is relatively rare (<10–20%); however, this is not true when the outcome is more common. In studies with high prevalence of the outcome, odds ratios may exaggerate the association or effect and should be interpreted with caution.

^{28}To illustrate the dangers of misinterpretation of odds ratios, consider the study in the area of health disparities by Schulman and colleagues titled “Effects of Race and Sex on Physicians’ Referrals for Cardiac Catheterization” published in the*New England Journal of Medicine*(NEJM) in 1999.^{29}In this experiment, doctors were asked to predict compliance and potential benefit from revascularization for cases portrayed by actors of different races and sex. A logistic regression model showed that Black race was associated with a lower likelihood of receiving a referral to cardiac catheterization with an odds ratio of 0.4. The authors concluded that the “race and sex of a patient independently influence how physicians manage chest pain.” Their study received extensive coverage in the news media, including a feature story on ABC’s Nightline, with Surgeon General David Satcher providing commentary.^{30}For the most part, the media interpreted the findings as “Blacks were 40% less likely to be referred for cardiac testing than Whites.” In a subsequent*NEJM*Sounding Board article, Schwartz and colleagues called attention for the dangers of using the odds ratios as an effect size measure when the outcome (referral for cardiac catheterization) was very common (>80% in the Schulman et al.’s study). The odds ratio, in this case, led to a gross exaggeration of the actual relative risk: the reported odds ratio of 0.6 ([Blacks referred/Blacks not referred]/[Whites referred/Whites non referred]) actually corresponded to a risk ratio of 0.93. In other words, Blacks were, at most, 7% less likely than Whites to receive the referral for further cardiac testing. Quite a difference! When faced with the report of odds ratios in studies with outcomes more frequent than 20%, one can use the formula proposed by Zhang and Yu^{28}to obtain an estimate of the relative risk, as follows:Relative risk (or risk ratio) = odds ratio/[(1 – Po) + (Po × odds ratio)]where Po is the probability of the event in the control (or reference) group.

Although this formula has been criticized for overestimating the risk ratio (although the overestimation was less pronounced than in logistic regression derived odds ratio),

^{31}it is a simple formula that can assist the reader in obtaining a more realistic estimate of the effect size when the odds ratio was inappropriately used.*Cohen’s d:*This measurement is the standardized mean difference; it is not commonly reported, but can be easily calculated as (mean 1 – mean 2)/(pooled standard deviation).^{26}There are several free online calculators of the Cohen’s d (eg, Ellis PD. Effect size calculators [2009] at http://www.polyu.edu.hk/mm/effectsizefaqs/calculator/calculator.html. Accessed April 15, 2015; and Soper D. Statistics calculators version 3.0 beta. http://danielsoper.com/statcalc3/default.aspx.; Lee A. Becker. Effect size calculators. http://www.uccs.edu/lbecker/index.html, both accessed April 16, 2015). Cohen’s d around 0.2 are considered small, 0.5, medium, and more than or equal to 0.8, large.*Correlation*: Pearson’s correlation coefficient*r*(or its nonparametric equivalent the Spearman’s Rho) measures the correlation between two continuous outcomes (eg, volume transfused and systolic blood pressure) and ranges from –1 to 1. Cohen (the same statistician who proposed the Cohen’s d) suggested the following rule of thumb for correlations: small = |0.10|, medium = |0.30|, and large = |0.50|.^{32}

*Step 4: Assess the limitations of the study*

The next step recognizes that all research designs, even RCTs, are more or less limited by confounding, bias, inadequate sample size and statistical power, heterogeneity of included subjects, differences between control and study groups, missing data, loss to follow-up, etc. These factors increase the uncertainty surrounding the findings of an investigation and, consequently, decrease its level of evidence. They will be explored in detail in the next sections of the chapter.

In previous sections of the chapter we briefly defined study designs. The classic hierarchy of study design was based on their ability to decrease bias and confounding and ranked the research design in the following order: (1) Systematic reviews and meta-analyses; (2) RCTs with confidence intervals (CIs) that do not overlap the threshold of clinically significant effect; (3) RCTs with point estimates that suggest clinically significant effects but with overlapping CIs; (4) Cohort studies; (5) Case-control studies; (6) Cross sectional surveys; and (7) Case reports.^{33} Currently, however, several investigators and organizations recognize that most clinical trials fail to provide the evidence needed to inform medical decision making.^{34} Thus we must use the best research design available to define the best evidence.^{35}

Randomized clinical trials are considered the most unbiased design because random group assignment provides an unbiased treatment allocation and often (but not always) results in similar distribution of confounders across study arms. Ultimately, the goal of randomization is “to ensure that all patients have the same opportunity or equal probability of being allocated to a treatment group.”^{36} Random allocation means that each patient has an equal chance of being given each experimental group, and the assignment cannot be predicted for any individual patients.^{37} However, emergency trauma research imposes difficulties to randomization as enrollment is time sensitive and the interventions must be made available without any delay.^{38,39} In emergency research trials, the patients are not recruited, they are enrolled as they suffer an injury, in a completely random fashion. If effective, randomization creates groups of patients at the start of the study with similar prognoses; therefore, the trial results can be attributed to the interventions being evaluated.^{40}

How to conduct effective randomizations is a challenge in emergency and trauma research. Conventional randomization schemes (eg, sealed envelope with computer-generated random assignment) can impose unethical delays in providing treatment. Another major obstacle of complex randomizations schemes in emergency research is adherence to protocol, thus alternative schemes (eg, prerandomization in alternate weeks) have been proposed.^{41} In a recent trial, we randomized severely injured patients for whom a massive transfusion protocol was activated to two groups: (1) viscoelastic (thrombelastography)-guided, goal-directed massive transfusion or (2) conventional coagulation assays (eg, prothrombin time, etc) and balanced blood product ratios on predefined alternating weeks.^{42} The system was formidably successful in producing comparable groups at baseline. Another example is the Prehospital Acute Neurological Treatment and Optimization of Medical care in Stroke Study (PHANTOM-S), published in 2014, in which patients were randomly assigned weeks with and without availability of a Stroke Emergency Mobile.^{43} These alternative randomization approaches are recognized as appropriate in emergency research.^{38}

Adaptive designs, including adaptive randomization, have been proposed to make trials more efficient.^{44,45} A 2015 draft guidance document from the Food and Drug Administration defines an adaptive design clinical study as “a study that includes a prospectively planned opportunity for modification of one or more specified aspects of the study design and hypotheses based on analysis of data (usually interim data) from subjects in the study.”^{46} For example, the recently published PROPPR trial used an adaptive design to grant their Data and Safety Monitoring Board authority to increase the sample size and reach adequate power.^{47} Their initial sample size (*n* = 580) was planned to detect a clinically meaningful 10% points difference in 24-hour mortality based on previous evidence. The DSMB recommended increasing the sample size to 680 based on the results on an interim analysis.

In sum, the research design must be appropriate to the research question, ethical and valid, both internally and externally. Internal validity refers to the extent to which the results of the study are biased or confounded. In the next sections, we will discuss bias and confounding in more detail, but basically this comes down to one question: Is the association between outcome and effect reported in the study real? How much of it may be due to bias and/or confounders? External validity, on the other hand, reflects to which extent the study is generalizable.

When researchers used data collected for the purpose of addressing the research question, they are using primary data. On the other hand, in the era of “Big Data,” it is common to see the use of secondary data, collected for purposes unrelated to the specific research question. There are undeniable advantages on secondary datasets, as they are usually large and inexpensive. Yet they have had mixed results when used for tasks as risk adjustment.^{48,49} Administrative datasets collected for billing purposes, for example, are often influenced by financial reasons, which can favor overcoding or undercoding, may and have the number of diagnoses capped or deincentivized due to declining marginal returns in billings.^{49,50}

In addition, medical coding and clinical practices are subject to changes over time due to a variety of reasons. For example, the coding of “illegal drugs” upon trauma admission is likely to change in unpredictable ways in states where cannabis became a legal, recreational drug. Another glaring example relates to collection and coding of the social construct variable “race and ethnicity,” which has changed dramatically over the past few decades.^{51} Comorbidities may be misdiagnosed as complications and vice versa. To address this problem, since 2008, most hospitals now report a “present on admission” (POA) for each diagnosis in its administrative data as a means to distinguish hospital-acquired conditions from comorbidities.^{50} Of course using data from before and after modifications, such as the inclusion of a POA code, were implemented affect the internal validity of the study. Whenever longitudinal data are used in a study, especially if covering long periods of time, it is important to verify whether there were changes in data collection, health policies, regulations, etc that can potentially affect the data.

Sometimes the distinction between primary and secondary data becomes blurry, as it happens in registries such as state-mandated and hospital-based trauma registries or the National Trauma Data Bank, a voluntary, national trauma dataset maintained by the American College of Surgeons. These datasets were developed to provide a comprehensive epidemiological characterization of trauma, thus one can assume that when the research question is related to frequency, risk factors, treatments, and prognosis of trauma, these represent legitimate primary data. However, registries may lack the granularity to address study hypotheses; for example, analyzing the effects of early transfusions of blood components on coagulation-related deaths. In the end, In addition, low volume hospitals may not contribute enough to aggregate estimates biasing mortality toward high volume facilities. The Center for Surgical Trials and Outcomes Research (Department of Surgery, The Johns Hopkins School of Medicine), Baltimore, has done a commendable effort documenting differences in risk adjustment and, more important, providing standardize analytic tools to improve risk adjustment and decrease low-volume bias in studies using the NTDB.^{52,53,54}

All studies that use a statistical test, even purely descriptive studies, have hypotheses. That is because a statistical test is based on a hypothesis. Every hypothesis can be placed in the following format:

*Variable X distribution in Group A*

*is different (or not different) from*

*Variable X distribution in Group B*

Despite its simplicity, this is a widely applicable model for constructing hypotheses.^{55} It sets the stage for elements that must be included in the methods section. The authors must define what characterizes Group A and Group B and what makes them comparable (aside from Variable X). Variable X, which is the variable of interest, must be defined in a way that allows the reader to completely understand how the variable is measured. The hypotheses should be defined using the above mentioned PICO framework. For example, “we hypothesize that adult trauma patients (P) receiving pharmacoprophylaxis for venous thromboembolism (I) will have fewer venous thromboembolisms (O) than patients not receiving pharmacoprophylaxis (C).’’

The commonly reported p-value is the probability of obtaining the observed effect (or larger) under the null hypothesis that there is no effect.^{56} Colloquially, we can interpret the p-value as the probability that the finding was the result of chance.^{57} The p-value is the chance of committing what is called a type 1 error, that is, wrongfully rejecting the null hypothesis (ie, accepting a difference when in reality there is none). Now, more important, what the p-value is not: “how sure one can be that the difference found is the correct difference.”

Significance is the level of the p-value below which we consider a result “statistically significant.” It has become a worldwide convention to use the 0.05 level, although it is completely arbitrary, not based on any objective data, and, in fact, inadequate in several instances. It was suggested initially by the famous statistician Ronald Fisher, who rejected it later and proposed that researchers reported the exact level of significance.^{58}

The readers will often see in research articles that p-values were “adjusted for multiple comparisons,” resulting in significance set at p-values smaller than the traditional 0.05 threshold. This is one of the most controversial issues in biostatistics, with experts debating the need for such adjustment.^{59,60,61} Those who defend the use of multiple comparisons adjustments claim that multiple comparisons increase the chances of finding a p-value less than 0.05 and inflate the likelihood of a type 1 error.^{60} Those who criticize its use argue that this leads to type 2 errors, that is, the chance of not finding a difference when indeed there is one.^{59,60,61} One of the authors of this chapter (Angela Sauaia) recalls her biostatistics professor claiming that, if multiple comparisons indeed increased type 1 error, then biostatisticians should stop working at the age of 40, as any differences after then would be significant just by chance. Our recommendation is: if the hypotheses being tested were predefined (ie, before seeing the data), then multiple comparisons adjustment is probably unnecessary; however, if hypothesis testing was motivated by trends in the data, then it is possible that even the most restrictive multiple comparison adjustment will not be able to account for the immense potential for investigator bias. In the end, the readers should check the exact p-values and make their own judgment about a significance threshold based on the impact of the condition under study. Lethal conditions for which there are few or no treatment options may require looser significance cutoffs, while, at the other end of the spectrum, benign diseases with many treatment options may demand strict significance values.

A special case of multiple comparisons is the interim analyses, which are preplanned, sequential analyses conducted during a clinical trial. These analyses are almost obligatory in contemporary trials due to cost and ethical factors. The major rationale for interim analyses relies on the ethics of holding subjects hostage of a fixed sample size, when a new therapy is potentially harmful, overwhelmingly beneficial or futile. Interim analyses allow investigators, upon a Data Safety Monitoring Board (DSMB) independent committee advice, to stop a trial early due to efficacy (the tested treatment has already proven to be of benefit), or futility (the treatment-control difference is smaller than a predetermined value), or harm (treatment resulted in some harmful effect).^{62} For example, Burger et al in their RCT testing the effect of prehospital hypertonic resuscitation after traumatic hypovolemic shock reported that the DSMB stopped the study on the basis of potential harm in a preplanned subgroup analysis of non-transfused subjects.^{63}

The 95% CI, a concept related to significance, means that, if we were to repeat the experiment multiple times, and, at each time, calculate a 95% CI, 95% of these intervals would contain the true effect. A more informal interpretation is that the 95% CI represents the range within which we can be 95% certain that the true effect lies. Although the calculation of 95% CI is highly related to the process to obtain the p-value, the CIs provide more information on the degree of uncertainty surrounding the study findings. There have been initiatives to replace the p-value by 95% CI, met with much resistance. Most journals now require that both are reported.

For example, in the CRASH-2 trial, a randomized controlled study on the effects of TXA in bleeding trauma patients, the TXA group showed a lower death rate (14.5%) than the placebo group (16.0%) with p-value = 0.0035.^{64} We can interpret this p-value as: “There is 0.35% chance that the difference in mortality rates was found by chance.” The authors also reported the effect size as a relative risk of 0.91 with a 95% CI of 0.85–0.97. This means, in a simplified interpretation, that we can be 95% certain that the true relative risk lies between 0.85 and 0.97. Some people prefer to interpret the effect as an increase; in which case one just needs to calculate the inverse: 1/0.91 = 1.10; 95% CI: 1.03–1.18. If the authors did not provide the 95% CI, you can visit a free, online statistics calculator (eg, www.vassarstats.net), to easily obtain the 95% CI of the difference of 1.5% points: 0.5–2.5% points. In an abridged interpretation, we can be 95% certain that the “true” difference lies between 0.5% and 2.5% points. Incidentally, notice that we consistently use “percent points” to indicate that this is an absolute (as opposed to relative) difference between two percentages.

The example above reminds us that it is important to keep in mind that statistical significance does not necessarily mean practical or clinical significance. Small effect sizes can be statistically significant if the sample size is very large. The above mentioned CRASH-2 trial, for example, enrolled 20,211 patients.^{64} P-values are related to many factors extraneous to whether the finding was by chance or not: including the effect size (larger effect sizes usually produce smaller p-values), sample size (larger samples sizes often result in significant p-values) and multiple comparisons (especially when unplanned and driven by the data).^{65}

In addition, we must make sure that the study used appropriate methods for hypothesis testing. Statistical tests are based on assumptions, and if these assumptions are violated, the tests may not produce reliable p-values. Many tests (eg, t-test, ANOVA, Pearson correlation) rely on the normality assumption. The Central Limit Theorem (the distribution of the average of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution) and the law of large numbers (the sample mean converges to the distribution mean as the sample size increases) are often invoked to justify the use of parametric tests to compare non-normally distributed variables. However, these allowances apply to large (*n* > 30) samples; gross skewness and small sample sizes (*n* < 30) will render parametric tests inappropriate. Thus, if the data are much skewed, as it is often the case with number of blood product units transfused, length of hospital stay, and viscoelastic measurements of fibrinolysis (eg, clot lysis in 30 minutes), or the sample size is small (as it is often the case in basic science experiments), nonparametric tests (eg, Wilcoxon rank-sum, Kruskal-Wallis, Spearman correlation, etc) or appropriate transformations (eg, log, Box-Cox power transformation) to approximate normality are more appropriate. More on this topic in the section on sample descriptors.

Statistical power is the counterpart to the p-value. It relates to type 2 error, or the failure to reject a false null hypothesis (a “false negative”). Statistical power is the probability of accepting the null hypothesis when there is actually a difference. Despite its importance, it is one of the most neglected aspects of research articles. Most studies are superiority studies, that is, the researchers are searching for a significant difference. When a difference is not found in a superiority study, there are two alternatives: to declare “failure to find a significant difference” or to report a power analysis to determine how confident we can be to declare the interventions (or risk factors) under study indeed equivalent. The latter alternative is more appealing when it is preplanned as an equivalence or non-inferiority trial rather than an afterthought in a superiority study.

Whether statistical power is calculated beforehand (ideally) or afterwards (better than not at all), it must always contain the following four essential components: (1) power: usually 80% (another arbitrary cutoff), (2) confidence: usually 95%, (3) variable value in the control group/comparator, and (4) difference to be detected. For example, Burger et al stated in the above mentioned prehospital hypertonic resuscitation RCT^{63}:

The study was powered to detect a 4.8% overall difference in survival (from 64.6% to 69.4%) between the NS group and at least 1 of the 2 hypertonic groups. These estimates were based on data from a Phase II trial of similar design completed in 2005.^{8} There was an overall power of 80% (62.6% power for individual agent) and 5 planned interim analyses. On the basis of these calculations a total sample size of 3726 patients was required.

We call attention to how the difference to be detected was based on previous evidence. The difference to be detected is mostly a clinical decision based on evidence. Power should not be calculated based on the observed difference: we determine the appropriate difference and then obtain the power to detect such difference. The basic formula to calculate power is shown below:

where *N* is the sample size in each group (assuming equal sizes), σ is the standard deviation of the outcome variable, *Z*_{b} represents the desired power (0.84 for power = 80%), *Z*_{a/2} represents the desired level of statistical significance (1.96 for alpha = 5%), and “difference” is the proposed, clinically meaningful difference between means. There are free, online power calculators; however, they usually are meant for simple calculations. Power analysis can become more complex when we need to take into account multiple confounders (covariates), in which case the correlation between the covariates needs to be taken into account, or when cluster effects (explained in more detail later) exist, when an inflation factor, dependent on the level of intracluster correlation and number of clusters, is used to estimate the sample size.

Equivalence and noninferiority studies are becoming much more common in the era of comparative effectiveness.^{66,67} Their null hypothesis assumes that there is a difference between arms, while for the more common superiority trials the null hypothesis assumes that there is no difference between groups. A noninferiority trial seeks to determine whether a new treatment is not worse than a standard treatment by more than a predefined margin of noninferiority for one or more outcomes (or side effects or complications). Noninferiority trials and equivalence trials are similar, but equivalence trials are two-sided studies, where the study is powered to detect whether the new treatment is not worse and not better than the existing one. Equivalence trials are not common in clinical medicine.

In noninferiority studies, the researchers must prespecify the difference they intend to detect, known as the noninferiority margin, or irrelevant difference or clinical acceptable amount.^{66} Within this specified noninferiority margin, the researcher is willing to accept the new treatment as noninferior to the standard treatment. This margin is determined by clinical judgment combined with statistical factors and is used in the sample size and power calculations. For example, is a difference of 4% in infection rates between two groups large enough to sway your decision about antibiotics? Would a difference of 10% make you think they are different? Or you need something smaller? These decisions are based on clinical factors such as severity of the disease, and variation of the outcomes.

The USA Multicenter Prehospital Hemoglobin-based Oxygen Carrier Resuscitation Trial is an example of a dual superiority/noninferiority assessment trial.^{68} The noninferiority hypothesis assumed that patients in the experimental group would have no more than a 7% higher mortality rate compared with control patients, based on available medical literature. The noninferiority question in this study was that the blood substitute product would be used in scenarios in which blood products were needed but not available or permissible. Incidentally, this is another example of where adaptive power analysis was performed after enrollment of 250 patients to ensure that no increase in the trial size was necessary.

Reviews of the quality of noninferiority trials have shown major problems.^{69,70} Essential elements to be included in the reports of this type of study are (1) the hypotheses must specify the noninferiority margin and explain the rationale for that choice; (2) whether the participants in the noninferiority trial were similar to those that were included in previous trials that established the efficacy of the control treatment; (3) whether the control treatment is in fact similar to the treatment tested in efficacy trials; (4) whether secondary outcomes were tested for noninferiority or superiority.^{67}

We often hear the comment that a small study with a significant difference may lack power. This is incorrect. Once a study has a significant difference, questions about statistical power are irrelevant.

Most studies will describe associations between outcomes and effects of interest. Whether these associations represent a cause-effect relationship is often an issue. A useful tool to make this determination is the set of nine criteria proposed by Sir Austin Bradford Hill in 1965, which still proves to be useful as a guideline.^{71,72}

*Strength of the association:*This criterion does no equate to size of the p-value; rather it refers to the effect size. Large effect sizes are more likely to be causal. For example, in the initial studies on risk factors for MOF, there was a strong, independent association between transfusions of red blood cells (RBC) and post-injury MOF, which triggered further investigations on the specific causative, harmful role of RBCs.^{73,74,75}*Consistency:*The results have been replicated under different conditions by independent investigators.*Specificity:*The effect of interest is associated with a specific outcome rather than a wide range of outcomes. Its presence can help the case for a causal effect, but its absence does not discard it, as most outcomes have multifactorial, interdependent causes.*Temporality:*There is a clear temporal relationship in which the effect of interest precedes the disease. Although this may seem obvious, it is important that we take into account how the outcomes and the effect are measured. For example, although lung failure seems to precede liver and renal dysfunction in the development of post-injury MOF, the tests used to assess the function of these organs have different sensitivities to levels of organ damage. PaO_{2}/FIO_{2}ratios may detect early, mild levels of pulmonary failure, while bilirubin and creatinine only rise after substantial organ derangement.^{76,77}Thus, the temporal relationship may be unclear.*Biological gradient or dose response:*Increasing/decreasing exposure is associated with increasing/decreasing risk of disease. This is a powerful criterion when measurements are accurate. In the above mentioned studies on the association of RBC and MOF, a dose-response relationship was observed both in the direction of higher MOF incidence associated with larger number of RBC units transfused,^{78}and also with the observed decreased incidence of MOF when judicious transfusion practices dramatically limited the amount of RBCs transfused post-injury.^{75}*Plausibility:*Although novel findings may not fit this criterion, when there is a proposed scientific mechanism that can explain the association, the case for causation is strengthened.*Coherence:*The association is consistent with what is known about the disease. Of course, once again, novel observations may not fit this criterion. Yet there is no denying that if the reported finding does not fit with current knowledge, there is a tendency toward a healthy skepticism.*Experimental evidence:*Hill proposed that “Causation is more likely if evidence is based on randomized experiments.”^{71}*Analogy:*In the presence of previous evidence of a causal effect by one class of agent (eg, RBC and MOF), we are more likely to accept causation when another agent from the same class (eg, plasma) is implicated as a risk factor for the same outcome. However, as Rothman and Greenland indicated, we must be careful in the application of this criterion: “Whatever insight might be derived from analogy is handicapped by the inventive imagination of scientists who can find analogies everywhere. At best, analogy provides a source of more elaborate hypothesis about the associations under study; absence of such analogies only reflects lack of imagination or lack of evidence.”^{79}

These criteria are to be used as guidelines, as Sir Bradford Hill himself wrote: “None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non.”^{71}

Physicians can appreciate the acronym CBC as a mnemonic for the three main reasons for a spurious association: Chance, Bias and Confounding. Chance is dealt with by statistical testing, while appropriate designs and analytic techniques can assist with eliminating or minimizing bias and confounding.

Bias is the deviation of results due to systematic errors in the research methods. Although there are several different names for biases, two types seem to capture most the biases presented in surgical literature: (1) selection bias, which occurs when the study groups differ systematically in some way or when the study sample differs from the study population; and (2) observer/information bias, which occurs when there are systematic differences in the way information is collected for the groups being studied. The article by del Junco et al^{80} on the “seven deadly sins in trauma outcomes research” is an excellent review of some of the most common biases.

There are several types of selection bias that commonly appear in the trauma literature. One of the most common is missing data not at random. Treatment of missing data has become the focus of much attention recently. Indeed, the Food and Drug Administration (FDA) requested that the National Academy of Sciences created a Panel on the Handling of Missing Data in Clinical Trials. The panel was charged with proposing appropriate study designs and follow-up methods to reduce missing data and appropriate statistical methods to address missing data for analysis of results.^{81} Although their focus was on RCTs, the report is very informative for other study designs as well.

The most important things to consider about missing data are (1) the proportion of patients with missing data; and (2) whether the data are missing at random, thus not biasing the results in a significant way, or whether there is some pattern that can bias the results. Finding out that data are missing not at random (MNAR) does not mean that the study is automatically flawed but appropriate statistical methods must be used to deal with them. If the proportion of missing data is high and MNAR, one cannot ignore it and proceed to analyze the complete dataset without further consideration. Let us illustrate this with a little story: the father of one of the authors (Sauaia) was a physician interested in congenital heart defects. He was conducting a population-based study about the incidence of such defects in school-age children and visited several schools, screening children for heart defects. Some children were, of course, absent that day and were not screened. At the end of the day, this young researcher considered the absent children (his missing data) for a moment and, wanting complete data, decided to visit them at home. Not surprisingly, some of the absent children carried in fact a heart defect. This made sense, as these ill children were more likely to miss school because of symptoms or medical appointments.

Missing data not at random in trauma occur for two radically different reasons: (1) patients are too sick to have the test (eg, died early, intravenous access not possible, chaotic trauma scene, etc), in which case, adverse outcomes are common; or (2) patients are not sick enough to justify the test (ie, early discharge, hemodynamically stable, not on mechanical ventilation, etc), in which case, adverse outcomes are rare. For example, in the late 1990s, we developed predictive models for post-injury MOF, and observed that lactate was a significant independent predictor of MOF.^{73,82} As expected, lactate measurement was available only for the group of severely injured patients. We then addressed the missing data using two analyses: the first included only patients for whom lactate was measured; the second included all patients and used a “missing indicator” for unavailable lactate levels (ie, each patient was assigned three possible values for lactate: missing, normal, and abnormal). The results of the two analyses were remarkably similar and increased the strength of their findings. In a more recent example, Odom and colleagues^{83} addressed this issue in their study on the value of lactate as a predictor of trauma mortality. They astutely observed that the selection bias created by the missing lactate values would bias their results toward the null hypothesis rather than the positive effect they found.

Many methods are available to deal with missing data, such as the “last value carried forward.” Albeit highly criticized, this technique has its place in longitudinal studies, when most variables are not missing and there is high predictability of missing values.^{84} For example, we used this technique to input the values of daily liver function tests using the last obtained value until a subsequent result was available.^{85}

More sophisticated techniques, such as multiple imputation by chained equations (MICE), are becoming popular. In simple words, this method imputes missing values based on regression equations derived M times, followed by an analysis of each imputed dataset and finalized by a combination of the M analyses. Although multiple imputation functions better for data missing at random, it has been shown to provide good estimates even in certain cases when data are MNAR. For example, Moore and colleagues from Canada tested the use of MICE to impute values of the Glasgow Coma Scale in 2005, and again in 2015 for the Glasgow Coma Scale, respiratory rate, and systolic blood pressure for a model to evaluate quality of trauma care with good results.^{86,87}

Another important type of selection bias, especially in trauma and emergency care, is survivor bias. This occurs when the individual does not survive long enough to have the “opportunity” to receive the complete intervention. These early nonsurvivors contribute to increasing the mortality rate of the group not receiving the intervention, artificially inflating the effect of the intervention. The studies on fixed, balanced blood product ratios (1:1:1 RBCs:plasma:platelets) are well-known examples of this problem.^{80,88,89} Survival analysis, which analyzes time to event and allows for censoring patients who died before experiencing the intervention, is a helpful technique to deal with this problem. However, truth be told, one can never know what would have happened to nonsurvivors had they survived long enough to get the intervention.

One can also add a time-varying covariate to the survival analysis, which is a variable that, as the name says, varies over time. For example, the ratio of blood products varies hour to hour during the dynamic resuscitation period. If a patient receives 6 RBC units and 3 units of plasma in the first 3 hours and no blood products between hour 4 and hour 6, its RBC:plasma ratio at hour 6 is 2:1, which is exactly the same hour 6 ratio as someone who receives 6 RBC units in the first 3 hours and 3 units of plasma from hour 4 to the end of hour 6 (a “catch-up” practice). The big difference here is that the first patient experienced the 2:1 ratio at all times while the second had an initial ratio of 3:0 followed by a 0:3. It is quite possible that the outcomes are different for these two extremes. Using a time-varying covariate, we can actually express the changes in RBC:plasma ratio hourly. The best solution, however, to deal with survivor bias is an RCT. Indeed, the results of the PROPPR trial, an RCT testing the effect of fixed, balanced blood products ratio failed to find a difference, despite excellent statistical power, contradicting the results of previous observational studies that found it beneficial.^{90}

Some authors exclude early deaths from the analysis to deal with survivor bias. This may be a solution, but it limits the generalizability, as the study findings apply only to patients who survive the acute post-injury period. Other types of commonly encountered selection biases are—loss to follow-up not at random (eg, patients failed to return to follow-up visits due to long-term injury-related complications), refusal to participate or withdrawal due to side effects or invasiveness of the intervention, consent not obtained due to traumatic brain injury, etc. This is a good reminder to always read the inclusion and exclusion criteria to determine whether they resulted in selection bias. Common exclusion criteria that may limit the generalizability of the investigation are: advanced age, comorbidities, early deaths or incomplete data. It is important to emphasize the findings apply only to the population that fits both the inclusion and exclusion criteria.

Another type of common bias in surgical research is intervention bias, defined as using an intervention to define a population or group as opposed to preintervention risk factors. Notable examples of interventions used to define groups to be compared are—massive transfusion versus no massive transfusion or Resuscitative Endovascular Balloon Occlusion of the Aorta (REBOA) versus resuscitative thoracotomy. Bias is smaller if the indications for the intervention are well established and there is minimal variation among providers, but it becomes larger when the intervention is given “at the attending’s discretion” and there is large variation among providers.

Moving on to the final C of the CBC acronym: confounding, that is, a third variable responsible for all or part of the association between two other variables. The most quoted example is the association of coffee drinking and lung cancer, which is confounded by the real culprit, smoking, a variable associated with both coffee drinking and lung cancer. To be considered a confounder, a variable must be associated with both the outcome and the effect of interest (ie, a risk factor or an intervention). In trauma, injury severity, age, and comorbidities are frequent confounders of the association between an intervention and the outcome.

There are two ways to deal with confounders: (1) research designs, including RCT or matching; and (2) analytic tools such as stratified analysis (when there are few confounders) or multivariate analyses in the case of multiple confounders. These techniques will be discussed in a subsequent section of this chapter. The first table of articles describing RCTs usually shows the distribution of potential confounders between the control and experimental groups. Appropriately so, readers should not find p-values in this table, as any differences in the distribution of these confounders are, by definition, a result of chance. Randomization, in general, increases the likelihood of similar distribution of confounders, which then are not associated with group membership, allowing us to assess the effect of the intervention (the main difference between the two groups) on the outcome. Single blinding (only the patient) or double blinding (the patient and the researcher) are powerful add-ins to improve the likelihood that the researchers and patients’ own biases do not interfere with the design. Surgical and emergency interventions, however, are often not amenable to blinding.^{91}

Matching is the alternative option to RCTs in research design to deal with confounders. This can be accomplished via traditional matching when only a few, known risk factors exist, or via PSM, which we mentioned in a previous section. In PSM, a multivariate model including potential reasons for receiving the intervention is used to derive a propensity score, that is, the probability of receiving the intervention. In this model, the “outcome” or “dependent variable” is the intervention. Using this propensity score, we can choose matching control patients, that is, patients who did not receive the intervention despite having the same propensity score as the patients who received the intervention. The downside of this procedure is that the reason why patients did not receive the intervention despite having the same propensity score is often unknown, or due to unmeasured variables that may have an effect themselves on the outcomes. Finally, matching limits the ability of the investigators to examine the effects of the matching variables; once used in the matching, a variable is no longer available for analysis. Therefore, it is important that the investigators are certain that the matching variable is of no interest in the analysis of the outcomes.

In confounding, variable A is responsible for part or all the association between variables B and C, and we want to adjust for it (in other words, minimize its effect) to be able to assess the association between B and C. There is little or no interest in any effects mediated by variable A. Conversely, in effect modification or interaction, variable A modifies the association between B and C. This type of association must be described, not adjusted for, as it provides important information of the mechanism underlying the association of B and C. Thus, when appraising multivariate models, the reader should make sure pertinent interactions were tested and, if significant, described appropriately.

An example of an important interaction can be found in our study on the effect of pre-injury antiplatelet therapy on post-injury outcomes.^{92} We showed that antiplatelet therapy modified the relationship between blood product transfusions and mortality.^{92} Specifically, as shown in Fig. 63-1, among patients who were taking antiplatelet agents prior to the injury, the odds of mortality associated with RBCs transfused were lower than among patients not receiving this medication.

###### FIGURE 63-1

Example of an important effect modification (or interaction) in the study by Harr et al investigating the effect of pre-injury antiplatelet therapy on post-injury outcomes.^{92} These investigators showed that antiplatelet therapy modified the relationship between blood product transfusions and mortality, decreasing the risk associated with requirement for transfusions.

Descriptive statistics, such as mean and standard deviation, median and interquartile range, frequency, and percentages, are used to provide the reader with the best possible description of the sample. Since the reader does not have access to the raw data, it is up to the authors to provide readers with a clear picture of what the sample looked like. More important, this picture should allow the readers to use the PICO framework and compare the study’s sample to the population to whom they intend to apply the findings. Are they similar enough that one may directly apply the findings, or are they older, younger, more severely injured, etc?

Categorical variables, such as sex and blunt versus penetrating mechanism, are expressed as frequency (*N*) and percentages. Continuous variables, such as age, injury severity score (ISS), Glasgow Coma Scale, length of stay, ventilator time, etc, can be described in different ways. Central tendency of the data is usually described by means or medians. When the variable distribution is normal, symmetric, medians and means are identical. However, when the variable is skewed or there are outliers, medians, rather than means, are better descriptors. As shown in Fig. 63-2, using data from a multicenter prospective study of severely injured patients, the distributions of systolic blood pressure at the emergency department and age were approximately normal, thus the median and mean both reflect the “typical patient.” On the other hand, the mean does not describe well the typical number of RBC units transfused in the first 12 hours or the length of stay in the intensive care unit.

###### FIGURE 63-2

Distribution of variables commonly reported in trauma and surgery research. Note how means and medians differ in ICU (Intensive care unit) days and RBC 12 hours (number of red blood cell units in the first 12 hours post-injury), which are skewed, compared to age or ED SBP (emergency department systolic blood pressure), which are closer to normal. (Data from the Glue Grant, analysis by the authors.)

Data dispersion for normally distributed variables is usually represented by the standard deviation (68% of the data should be contained within mean ±1 standard deviation, 95% of the data within mean ±2 standard deviations), while for skewed data we most commonly use the interquartile range (the lower and upper quartile; 50% of the data are contained within this interval) or the range (maximum and minimum values). Box plots, shown in Fig. 63-3, are excellent ways to represent the data central tendency and dispersion.

###### FIGURE 63-3

Example of a box plot showing the amount of plasma (FFP) units given in the first (h1), second (h2) and third hour (h3) post-injury in two study groups (C: resuscitation guided by conventional coagulation tests) and TEG: resuscitation guided by thromboelastography).^{42} The length of the box represents the interquartile range (the distance between the 25th and 75th percentiles); the symbol in the box interior, the group mean; the horizontal line inside the box interior, the group median; and the vertical lines (“whiskers”) the minimum and maximum values (range). Note the difference in mean and median, suggesting skewness.

Studies will usually present initially bivariate analysis (sometimes called univariate). This consists of an unadjusted, crude comparison between two (or more) groups of subjects. The main point for readers in this part of the article is to pay attention to variables that can be confounders, that is, variables with different distribution between the groups (thus associated with group membership) and also associated with the outcome. This difference does not need to be statistically significant, as the sample may be small thus prone to a type 2 error. Deciding whether a variable may be playing a confounder role is a clinical, not a statistical decision. After examining this table, the reader can decide which ones are potential confounders.

Analytic techniques that minimize confounding are stratification and multivariate analysis. Stratification may work when there are just a few risk factors and a large sample size. For example, in order to account for the confounding effect of smoking in the association between coffee drinking and lung cancer, one may stratify the analysis on smoking status and observe whether the association between coffee drinking and lung cancer holds within each stratum. Multivariate analyses are basically an advanced form of stratification done on multiple variables. There are several types of multivariate models depending on the distribution of the outcome. Binary outcomes (eg, death yes or no) are commonly addressed using logistic regression. Categorical outcomes with more than two strata can be analyzed with polytomous logistic regression. When time to event is of interest or there is need to censor data (eg, patients who died or are discharged before experiencing the outcome of interest), survival analysis is an option.

Linear regression assumes that the outcome is continuous, has a distribution not too far from normal, and has, as the name says, a linear relationship with the covariates. When these assumptions are not applicable (eg, outcome is categorical or too skewed), then we may apply a larger category of models, named gene*ralized* linear models. Please, note that there is a difference between gene*ralized* linear models and general li*near* models. The economic analysis by Schwartz and colleagues^{93} on delays in laparoscopic cholecystectomy is an example of a study using a generalized linear model. The generalized linear models are a broad class of models that include the logistic regression, the Poisson regression, log-linear models, gamma-distribution models (as used in the above mentioned article) and others.

The description of generalized linear models usually includes the term “link,” which is a function linking the actual outcome Y to the estimated Y in a model. In simple words, it is the transformation done to the outcome variable to convert it to continuous. In a linear regression model, the link is the identity, that is, the estimated and the actual outcome are expressed the same, and no transformation is needed. In a logistic regression, the transformation or link used is called the *logit* and the distribution is binomial, that is, a yes or no type of variable. For the gamma model (used in the Schwartz et al paper^{93}), the link is the log, and the distribution is right-skewed with a variation that increases with the mean. This type of model is often used in econometrics because it fits the distribution of cost in health care, that is, care for most patients results in little costs, but a few patients require very costly treatments. In the log-linear and the Poisson regression, the link is a log and the distribution is the Poisson distribution. Poisson regression is usually applied to count data, for example, number of trauma deaths over a period of time, as seen in a 2013 article by Kahl et al assessing time-trends in annual trauma mortality rates.^{94}

Each multivariate model has its own set of assumptions, and although many are robust to limited violations, this is an important consideration when appraising the article. For example, the Cox-proportional regression model, a type of survival analysis, requires, as the name says, that the risks are proportional, that is, do not vary over time. Variation over time is not an insurmountable problem and can be tested for and remedied by introducing time-varying covariates in the model.

Another assumption in regression models relates to the shape of the association between the outcome and the risk factor. Is it linear? If linear, is it a straight line or U-shaped? The computer software used for linear regression will draw a straight line, unless we add what is called a *quadratic term* (or second order polynomial), in which case, it will draw a U-shaped line (the U can face up or down). For example, in a recent study on fibrinolysis, we showed that the association of fibrinolysis and mortality was U-shaped, with higher mortality associated with very low (fibrinolysis shutdown) and very high (hyperfibrinolysis) levels of fibrinolysis and a mortality nadir with moderate fibrinolysis levels.^{95} We showed a similar U-shaped relationship between RBC: plasma ratios and mortality, with highest mortality peaking with both low and high rations and lowest mortality associated with medium ratios.^{96} Higher order polynomials are seldom used.

Most multivariate models assume independence between observations. In other words, if one cannot predict a subject’s outcome based on the outcome of other subjects, their outcomes are said to be independent. This is not completely true for patients within the same center, or even the same surgeon. Patients within a center tend to have similar outcomes, violating the independence assumption. This similarity between patients of the same center or same provider is named cluster effect and it should be accounted for in the statistical modeling and in sample size calculations. The larger the correlation between subjects within a cluster, the larger is the required sample size. This correlation can be assessed by the intraclass correlation coefficient, which measures how similar patients are within centers and how different they are from patients in other centers.^{97}

Another important aspect of multivariate models is their performance, which can be evaluated in several ways, depending on the model. The R-square gauges how much of the variation of the outcome is explained by the model. We must keep in mind that in medicine, given the multifactorial nature of most diseases and clinical scenarios, it is uncommon to see large R-squares (>0.30). The R-square should be accompanied by a p-value that provides the probability that the observed R-square was not due to chance. Together they compose an assessment of the model’s performance.

Model discrimination, that is, the ability of the model to discriminate individuals with and without the outcome (eg, survivors and nonsurvivors) is often evaluated by the Area Under the Receiver Operating Characteristic Curve or AUROC (also known as c-statistic). ROC curves derived from studies on radar and sonar detection during World War II to ascertain the best radar setting to distinguish enemy airplanes from harmless targets (eg, flocks of birds). As you can see in Fig. 63-4, which shows the AUROC for thromboelastography values as predictors of massive transfusion, the Y-axis has the sensitivity (eg, % of deaths predicted), while the X-axis depicts [1–Specifity] (eg, [1-% of survivors correctly predicted to survive]). The AUROC is a good assessment of the overall accuracy of the model, and it has been arbitrarily categorized into: 0.90–1 = excellent; 0.80–0.90 = good; 0.70–0.80 = fair; 0.60–0.70 = poor; 0.50–0.60 = fail. AUROCs should be accompanied by 95% CIs, to allow the reader to determine whether they are significantly different than the 0.50 no-value AUROC, and to compare different AUROCs (if they not overlap, they are significantly different). For example, our group conducted a comparison between the Denver MOF score and the Multiple Organ Dysfunction Syndrome score (MODS) as predictors of death using AUROCs.^{98} The Denver MOF score’s AUROC was 0.88 (95% CI: 0.84–0.92) while the MODS’s AUROC was 0.86 (95% CI: 0.82–0.89), suggesting that both scoring systems performed similarly well.

###### FIGURE 63-4

Example of a receiver-operating-characteristic (ROC) curve to compare the predictive power of the thromboelastography (TEG) values (ACT, activated clotting time; MA, maximum amplitude; angle; and LY30, fibrinolysis) for massive transfusion defined as ≥4 units of red blood cells in the first hour post-injury (excludes deaths within 1 hour post-injury). The area under the ROC curve (AUROC) for the angle was 0.79 (95% CI: 0.69–0.89), MA: 0.78 (95% CI: 0.68–0.89), Ly30: 0.71 (95% CI: 0.56–0.85) and ACT: 0.70 (95% CI: 0.57–0.83).^{98} Note that the 95% CIs overlap, suggesting no difference in prediction performance.

ROC curves may also be used to determine the best threshold (to discriminate between normal/abnormal values) for a continuous variable. There are several methods to define the best cutoff, such as the Youden Index (aka Youden J statistic), defined as J = sensitivity + specificity –1 and ranging from 0 (no discrimination) to 1 (complete discrimination). The Youden Index corresponds to the maximum vertical distance between the ROC curve and the diagonal, noninformative line. Another common measure is the upper leftmost point. Although these points represent optimal combination of sensitivity and specificity, researchers may decide to privilege sensitivity or specificity depending on the condition. Certain conditions demand more sensitivity (eg, when one does not want to miss a case due to potential disastrous consequences) while others may require more specificity (eg, when the consequences of missing a case are minimal, but the cost of the test is high). In Fig. 63-4, the upper leftmost point in the ROC for the TEG variable angle was 65° and the Youden Index was 60°, thus researchers may choose a cutoff between these two values to define an optimal cutoff to predict the need for massive transfusion.

Calibration is another important assessment of model performance. It gauges the model’s ability to correctly estimate the probability of an event. It is commonly assessed using the Hosmer-Lemeshow statistic, which compares the predicted and observed rates within deciles. In this statistic, the larger the p-value (ie, the lower likelihood of disparity between predicted and observed rates), the better the model’s calibration. It is important to be careful, however, with sample sizes, when the p-value can be small (ie, significant) for models that are not poorly calibrated. ^{99}

Multivariate models are often over-fitted, that is, have more variables than appropriate. Although there is software to calculate how many variables one can have in a multivariate model, a good rule of thumb, used even by expert statisticians, is 10 subjects with the outcome per variable in the model (not 10 subjects per variable, but 10 subjects with the outcome). When the number of confounders is higher than this threshold, an alternative is to use the above-mentioned propensity score, this time as an adjustment. Thus, instead of 20 variables, one has only one propensity score, representing the combined effect of the 20 variables. For example, Brown and colleagues used this approach in a study on the scope of prehospital (PH) crystalloids in patients with and without PH hypotension. A propensity score was used to adjust the mortality comparison between the group receiving high- versus low-volume of PH crystalloids. Instead of using five covariates (PH time, PH blood, PH SBP, ISS and initial base deficit), the authors had a single variable, a propensity score ranging from 0 to 1.^{100}

There is controversy whether multivariate adjustment for confounders is better than propensity score.^{101} The answer seems to be: it depends. When the sample is large (>8–10 patients with the outcome per confounder), the multivariate method provides more information as one can assess the effect of individual confounders. Conversely, the propensity score is better when the sample is small.

When using a multivariate model for confounder adjustment it is important to compare the unadjusted effect and the adjusted (with confounders) effect. For example, if the unadjusted relative risk of a given variable is 2.0 (95% CI 1.5–2.5) and after adjustment of confounders, shows an adjusted relative risk of 1.2 (95% CI 0.8–1.6), one can conclude that confounders were responsible for the effect seen in the unadjusted analysis.

So far, we have discussed how to analyze values of specific variables. Sometimes, however, researchers need to analyze the variables themselves. This is especially true when there are numerous variables, and researchers may wish to examine patterns and combinations of variables using factor analysis. A special case of factor analysis is principal components analysis (PCA), in which the factors or components are combinations of variables that do not correlate, thus representing independent components. Two examples in the trauma literature may help illustrate the use of PCA. In the first sample, the San Francisco group used PCA to define three uncorrelated groups of coagulation assays (prothrombin; factors V, VII, VIII, IX, X; D-dimer; activated and native protein C; and antithrombin III levels): component 1 was defined as a global clotting factor depletion; component 2 corresponded to the activation of protein C and fibrinolysis; and component 3 to factor VII elevation and VIII depletion.^{102} The authors reported that component 1 predicted mortality and INR/PTT elevation, while component 2 (fibrinolytic coagulopathy) predicted infection, end-organ failure, and mortality. A second example is seen in a similar study by our group, this time using TEG values.^{103} As shown in Table 63-2, PCA generated a number of components, each of which contain a number of variables, each one associated with a factor loading (which can be interpreted much like a simple correlation coefficient, ranging from –1 to 1). In our analysis of TEG variables, three components were responsible for 93% of the variation in the data. Component 1 included K, angle, maximum amplitude, maximal rate of thrombus generation, and total thrombus generation; component 2 included activated clotting time and time to maximal rate of thrombus generation, while component 3 reflected fibrinolysis. Taken together, both studies supported the conclusion that trauma-induced coagulopathy has distinct, independent mechanisms, which may require tailored hemostatic therapeutic approaches.

Composition of Principal Components (PC) | |||
---|---|---|---|

PC 1 | PC 2 | PC 3 | |

% variance explained by component | 63% | 17% | 13% |

Activated clotting time | –30 | 90 | 6 |

K | –95 | 15 | –4 |

Angle | 92 | –26 | 5 |

Maximum amplitude | 95 | –15 | –10 |

% lysis at 30 min | –4 | 0 | 99 |

Time to maximal rate of thrombus generation | –13 | 95 | –5 |

Maximal rate of thrombus generation | 94 | –23 | 3 |

Total thrombus generation | 94 | –14 | –16 |

PCA also has assumptions, including normal distribution and more subjects than variables. Therefore, this technique is not applicable to many of the “omics” studies (ie, genomics, proteomics and metabolomics), where the number of variables far exceeds the number of subjects. In this case, an option is the Primary Latent Structures (PLS, formerly known as partial least squares) analysis, commonly followed by a discriminant analysis (PLS-DA). Although a description of the technique is beyond the scope of the chapter, the principles of PLS are similar to the PCA explained above. The PLS will result in combinations of variables, which can then be used to predict outcomes.

Cluster analysis, on the other hand, groups subjects into clusters based on their responses. One can use the factors or components derived from the PCA or the PLS to define the clusters. Cluster analysis is not really a statistical tool, rather it is a mathematical model to group subjects based on their responses. In a recent example, our group used cluster analysis to group patients according to PCA-derived TEG components (Fig. 63-5). Patients in Cluster 2 (hypocoagulable with no fibrinolysis) were more severely injured (higher ISS), but had a lesser degree of shock compared to patients in Clusters 3 and 4. Patients in Cluster 4 required significantly more blood products and had an increased hemorrhage-related mortality rate compared to the other groups.

A systematic review (SR) collects, appraises and collates all evidence addressing a specific research question by using explicit, reproducible, systematic methods. SRs were created to assist healthcare professionals in managing the daunting amount of health-related information. If the reader is naïve about a topic, SRs are the essential first step to begin to understand that topic. The Cochrane Collaboration has led the field by establishing the Cochrane Library, one of the most important, useful and regularly updated SR collections, available at http://www.cochranelibrary.com/ (accessed July 6, 2015). Their use of rigorous methods assures the reader that available evidence has been collected in a manner that minimizes bias and informs medical decision making. In addition, the *Cochrane Handbook for Systematic Reviews of Interventions* contains methodological guidance for anyone preparing or appraising systematic reviews.^{56}

Many SRs are accompanied by a meta-analysis (MA), in which the results of the SR independent studies are combined with the goal of deriving more precise estimates than each individual study.^{56} Well done SR/MAs also examine the consistency of evidence as well as explore the differences across studies. Many SR/MAs are now registered in the PROSPERO, an international registry of systematic reviews with a searchable database (http://www.crd.york.ac.uk/PROSPERO/, accessed July 7, 2015). As of July 7, 2015, a simple search using the term “trauma” listed 354 SRs completed or in progress.

In order to makes this as practical and applied as possible, we will examine these concepts using a 2015 Cochrane Library SR/MA: “Blood-clot promoting drugs for acute traumatic injury,” produced by the Cochrane Injuries Group based in the London School of Hygiene & Tropical Medicine.^{104} This is the third update of this review of RCTs assessing the effects of antifibrinolytics in trauma patients; the first was published in 2004 and the second in 2012, all available from the same Web site. Although this review was restricted to RCTs, there are SRs that include other types of studies. The search strategy included a variety of databases (Pubmed, Embase, Cochrane Injuries Group Specialised Register & Cochrane Central Register of Controlled Trials, Clinicaltrials.gov, WHO International Clinical Trials Registry Platform, and others). In addition, as it is common, the authors checked all references in the identified trials and background papers, contacted study authors and pharmaceutical companies to identify relevant published and unpublished data.

The search for unpublished data is essential to minimize publication bias, which occurs when the dissemination of research findings is influenced by the nature and direction of results. In general, this affects negative (nonsignificant) studies, which are less likely to be published and when published, are less likely to appear in a high impact, English language journal. In addition, publication bias should be further assessed by using funnel plots, which are scatter plots of the intervention effects from individual studies usually plotted on the horizontal axis and some measure of study size in the Y axis (eg, sample size, standard error, variance). The effects from small, less precise studies will spread widely at the bottom of the graph, while larger studies’ effects will cluster together at the top. Although our example did not contain a funnel plot, we refer the reader to the excellent examples on the “Cochrane Handbook, Fig. 10-4a: hypothetical funnel plots.”^{56} In the presence of publication bias (eg, unpublished small negative studies), the funnel plot will have an asymmetrical appearance and a combined effect derived in the meta-analysis may overestimate the treatment’s effect. It is important to mention that there are other reasons for asymmetry, such as trials of lower quality which overestimate effects.^{105}

The above mentioned SR was publication date and language unrestricted, but some reviews limit searches to articles in the English language, which can create substantial bias. The terms used for the search are found in their Appendix 1 and can be useful to guide the readers’ own searches. Our example’s first figure shows that out of 1371 records plus four trials already identified in previous versions of this SR, three RCTs were finally included in the meta-analysis. This information educates the reader about the amount of research available in the area and it also provides information on trials still being conducted that can be found on clinicaltrials.gov (in our example: NCT01402882; NCT01990768; NCT02187120; NCT02086500).

In a SR, each study’s risk of bias (aka as quality) should be appraised; in this example, the GRADE system, which is described in more detail later in this chapter, was used for this purpose. The risk of bias was considered low (high quality) for outcomes on mortality, need for further surgery and blood transfusion, while the quality was considered moderate for the vascular occlusive outcomes (including heart attacks, deep vein thrombosis, stroke and pulmonary embolism).

Many SRs will stop here and derive some qualitative conclusion from the summative body of evidence. Others, such as our example, will move forward to combine the results of the independent studies in a meta-analysis. Before combining studies, the heterogeneity must be assessed. Statistical heterogeneity (ie, variability in intervention effects between studies) is usually assessed using the I^{2} statistic, which describes the percentage of total variation across studies due to heterogeneity rather than chance. The I^{2} statistic should be accompanied by 95% CI or a Chi-square test, for which a larger p-value indicates less heterogeneity. When statistical heterogeneity is high (I^{2} > 50%), a meta-analysis is almost always inappropriate. In addition to statistical heterogeneity, clinical heterogeneity (variability in designs, populations, measurement of outcomes, etc) is also an important consideration.

The studies included in the post-injury antifibrinolytic SR/MA showed no evidence of statistical heterogeneity (I^{2} = 13%, *p* = 0.32). Therefore, based on the pooled data, the authors concluded that these agents reduced the risk of death from any cause from 16% to 14.5% (relative risk: 0.90, 95% CI: 0.85–0.96; *p* = 0.002). However, we should note that one of the three included studies (the abovementioned CRASH-2)^{64} was responsible for 98% of the data; the second study^{106} focused on traumatic brain injury only, and the third study assessed aprotinin^{107} (the other two used TXA) in 77 participants with major skeletal trauma and shock. Thus despite low statistical heterogeneity, there seemed to be substantial clinical heterogeneity.

In another example of conflicting clinical and statistical heterogeneity, Natanson and colleagues in a SR/MA on blood substitutes^{108} combined 16 trials studies and concluded that these products were associated with a statistically higher risk of death (relative risk: 1.30; 95% CI: 1.05–1.61) and of myocardial infarction (relative risk: 2.71; 95% CI: 1.67–4.40). However, the SR/MA encompassed different populations (elective surgery, trauma, acute care surgery and ischemic stroke patients), different controls (fluids, blood products) and different blood substitutes. Although the statistical heterogeneity among these studies was not significant for either mortality or MIs (for both, I^{2} = 0, *p* ≥ 0.60), conclusions based on a mix of studies differing at many levels must be taken with caution.^{109} It should be emphasized that heterogeneity is not a failure, rather it is an important finding that should be further explored.

It is not rare that we find articles, sometimes in the same journal’s issue, with conflicting results. How should we decide between them? Again, to make this practical and applicable, we can invoke two recent examples of these “collisions.” In the May 2013 issue of the *Journal of Trauma and Acute Care Surgery*, two articles explored the role of crystalloids in early trauma resuscitation. Both studies used data from the same database, the National Institute of Medical Sciences funded Glue Grant and seemingly arrived to disparate conclusions.^{110,111} The accompanying editorial by Dr David Hoyt was able to reconcile these differences, expertly navigating the reader through the two research studies, and coming up with a unified message.^{112} Dr Hoyt highlighted that the exclusion of patients who died within the first 48 hours in the study by Kasotakis and colleagues resulted in a dramatically different population than the study by Brown et al. Basically, the message of both studies was that fluid resuscitation should be guided by blood pressure and oxygen delivery to avoid the harmful effects of excess fluids.

In a second example, we published in March 2014 a comprehensive review on the temporal trends of post-injury MOF using the above-mentioned Glue Grant database and concluded that post-injury MOF’s incidence decreased over time while MOF case-fatality rate remained stable.^{85} Conversely, Fröhlich et al in April 2014, using the Trauma Register DGU of the German Trauma Society, reported a significant increase in MOF incidence but a decrease in case-fatality rate.^{113} We were left with two apparently disparate messages. How to reconcile them so the messages can be appropriately translated to our clinical practice and/or advance our research agenda? As astutely done by Dr Hoyt in the previous example, we should invoke the PICO framework to determine to which *P*opulation each study specifically applied and also to ensure that the *O*utcomes (ie, post-injury MOF) were indeed similarly measured. Once this is done, the differences are glaring. The entry criteria for the two studies were different: the patients enrolled in the Glue Grant study were more severely injured (ie, blunt torso trauma with hemorrhagic shock; all required at least 1 RBC/12 hours) than the German patients (24% required 1 RBC unit between hospital arrival and ICU admission). The German study population included a large proportion of traumatic brain injury victims (59%), while the Glue Grant study specifically excluded these patients. Second, the two studies used different definitions of the *O*utcome post-injury MOF. The German investigators employed the Sequential Organ Failure Assessment (SOFA) score, which assesses the dysfunction of six-organ systems including the central nervous system (CNS). In contrast, the Glue Grant investigators used the Denver MOF score, which does not assess the CNS, and a modified version of the Multiple Organ Dysfunction Score (MODS) without its CNS component. Thus, the two studies apply to different *P*opulations and to somewhat different *O*utcomes.

According to the Institute of Medicine 2011 report, “clinical practice guidelines are statements that include recommendations intended to optimize patient care that are informed by a systematic review of evidence and an assessment of the benefits and harms of alternative care options.”^{114} This report established standards for the rigorous development of guidelines including:

Based on systematic review of the evidence

Led by a knowledgeable, multidisciplinary expert panel and representatives from key affected groups

Considers relevant patient subgroups and patient preferences

Based on an explicit and transparent process that minimizes distortions, biases, and conflicts of interest

Explains the association between care options and health outcomes

Rates both the quality of evidence and the strength of recommendations

Reconsidered and revised as appropriate

The GRADE, mentioned earlier in this chapter, is widely adopted as a transparent, systematic process to appraise the quality of the summative body of evidence and define the strength of the recommendation.^{9} We will use a recent example of a guideline presented at the Eastern Association for the Surgery of Trauma (EAST) and subsequently published in the *Journal of Trauma and Acute Care Surgery*. Fox et al^{115} addressed the evaluation and management of blunt traumatic aortic injury (BTAI). They formulated several pertinent questions using the PICO framework, of which we will follow the process for PICO Question 2: “In patients with BTAI (P), should endovascular (I) repair be performed versus open repair (C) to minimize mortality, stroke, paraplegia, and renal failure (O)?” These four outcomes were deemed “critical for decision making” using the GRADE priority scale for outcomes while other outcomes (cost, length of stay) were considered less important.

Using a standardized search covering the period from 1997 to 2013, the authors selected 37 studies; all 37 addressed the mortality outcome, 21 reported on paralysis, and 12 on stroke. No reliable evidence on the renal failure outcome was available. The selection process (ie, how many articles were initially found, how many were excluded and why) was well detailed in the article, reassuring the reader that no articles were excluded due to the researchers’ bias and that all pertinent studies on this research question were considered. Often, this process is illustrated using a PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analysis) diagram, which depicts the flow of information through the SR phases, mapping out the number of records identified, included and the reasons for exclusions (http://www.prisma-statement.org/statement.htm, accessed July 7, 2015).

The grading of the evidence was accomplished using the GRADE system, which begins with the research design of the included studies.^{116} In intervention related topics, RCTs start as “high quality” while observational designs start as “low quality.” Any other design starts as “very low quality.” For diagnostic studies, the GRADE system allows observational studies to be rated initially as “high quality” if there was a direct comparison of the test results with an appropriate reference standard.^{117} From this starting point (high- or low-quality research design), other factors may increase or decrease the quality of evidence.

The factors that can further decrease the quality of the evidence fall into five domains, as follows: risk of bias, consistency, directness, precision, and publication bias. Incidentally, the Effective Health Care Program sponsored by the Agency of Healthcare Quality and Research uses similar domains.^{118}

*Risk of bias:*This domain refers to the execution of the studies. For RCTs, problems in the allocation concealment, lack of blinding, and loss to follow-up increase the risk of bias and decrease the quality. For other comparative studies, the researchers need to identify the potential for biases as described in previous sections of this chapter.*Consistency:*This refers to the degree to which included studies find either the same direction or similar magnitude of effect. In diagnostic studies, this involves looking at the sensitivity and specificity observed in the studies included in the review.*Directness:*This refers to the extent to which the population, interventions, diagnostic tests, and outcome measures are similar to those of interest. For example, if the patients in the selected studies were less injured or older than those to whom the intervention would be applied, directness would be low, and the quality would be downgraded. Lack of studies directly comparing two interventions or diagnostic tests of interest can also lower the quality.*Precision:*This refers to the degree of certainty surrounding an effect estimate with respect to a given outcome.*Publication bias:*As explained in previous sections of this chapter, this relates to the possibility of selective publication or reporting of research findings based on the favorability of direction or magnitude of effect.

These include large effects (significant relative risk >2.0 or <0.5) and evidence of dose response. In addition, if it is reasonable to assume that the observed effect would have been larger if potential confounders were accounted for, the quality may be upgraded.

In the Fox et al SR on BTAI, there were no RCTs, only observational studies, thus quality started “low.” They did not observe any serious risk of bias, inconsistency, indirectness, imprecision, or publication bias for any of the outcomes. Thus, the quality of the evidence remained at “low.” The authors evaluate heterogeneity (I^{2} = 0, indicating low heterogeneity, for all outcomes), and were able to combine the studies into a meta-analysis. Endovascular repair was associated with reduced mortality rates compared to open repair, with a relative risk of 0.56 (95% CI: 0.44–0.73). The relative risk did not indicate an effect large enough to upgrade the quality of the evidence for mortality. On the other hand, endovascular repair was associated with a much lower, significant lower rate of paraplegia (relative risk: 0.36; 95% CI: 0.19–0.71). Thus, for this outcome, the quality of the evidence was upgraded to “moderate.” Regarding stroke, there were no significant differences between the two procedures (relative risk: 1.48; 95% CI: 0.67–3.27), thus quality of the evidence remain unaltered.

Appropriately so, the EAST panel took into consideration the patients’ perspective, and deliberated that, despite a low to moderate quality of evidence, patients would likely place a high value on a less invasive procedure associated with lower mortality and risk of paraplegia. Based on these considerations as well as others logistical issues, the panel strongly recommended the use of endovascular repair in BTAI patients who do not have contraindications to endovascular repair. A limitation of this review was the lack of evaluation of long-term outcomes of REBOA.

Critical appraisal is the systematic evaluation of research reports addressing the quality of several items including:

The research question: Is it well formulated? Does it address an existing gap in current knowledge?

Internal validity: Is the research design appropriate to address the research question? Was sampling well conducted, bias and confounding minimized, effect modification explored, and appropriate analytic techniques used? Were the effects appropriately measured, their size, direction and uncertainty well estimated?

Relevance: Are these valid results relevant?

External validity: To whom these valid, relevant results apply?

There are several tools to conduct critical appraisal of evidence. Although originally created to guide peer-review of manuscripts, the *Journal of Trauma and Acute Care* standardized research methods review can be a helpful tool to guide critical appraisal (Table 63-3). The Oxford Centre for Evidence-Based Medicine (CEBM) provides appraisal worksheets specific for systematic reviews as well as for prognostic, diagnostic and therapeutic studies (http://www.cebm.net/critical-appraisal/, accessed July 7, 2015), along with helpful examples.

Checklist for statistical assessment of general papers: |

Appropriate study design used to achieve the objective(s). |

Source of subjects/data appropriately described. |

Sampling/sample size appropriately described. |

Entry and exclusion criteria clearly defined. |

Data exclusions are stated/explained and impact on results are explored. |

N reported at the start of the study, for each data set and for each analysis. |

Discrepancies in value of N between analyses clearly explained/justified. |

Missing data are explained, and impact on findings minimized/explained. |

Satisfactory follow-up/response rate. |

Appropriate measure(s) of center (eg, mean or median). |

Appropriate measure(s) of variability (eg, standard deviation or range). |

Adequate uni/bivariate statistical analyses used/described. |

Adequately multivariate statistical analyses used/described. |

Confounding and bias explored and minimized. |

Assumptions of tests applied met (particular attention paid to nonnormal data sets or small sample sizes). |

Effects (odds ratios, relative risks, risk differences) in the “expected” direction, or if not, unexpected direction explained. |

Adjustments made for multiple testing explained. |

Unit of analysis given for all comparisons. |

Alpha level given for all statistical tests. |

Actual P values are given for primary analyses. |

Unusual/complex statistical methods clearly explained. |

Method of group assignment (eg, randomization) explained and justified. |

Any data transformations clearly described and justified. |

Confidence intervals given for the main results. |

Conclusion drawn from the statistical analysis is justified. |

Adapted from BMJ statistical checklist at http://resources.bmj.com/bmj/authors/checklists-forms/statisticians-checklist (accessed October 8, 2011); nature statistical adequacy checklists at www.nature.com/nature/authors/gta and www.nature.com/ncomms/…/Checklist_of_statistical_adequacy.doc (accessed October 8, 2011).

Alternatively, the reporting standards, created to promote transparent and accurate reporting of research studies, may be used to also appraise them. The first high impact reporting guideline was the CONSORT (CONsolidated Standards Of Reporting Trials, available at http://www.consort-statement.org/, accessed July 7, 2015), mainly directed to RCTs. The CONSORT diagram, which describes the inclusions and exclusions of patients in trials, is now considered an obligatory component by most of the important medical journals. The impact of the CONSORT was so dramatic, that reporting standards quickly appeared for other types of studies, including: STROBE for observational studies; PRISMA for systematic reviews and meta-analysis; CHEERS for economic evaluations; COREQ for qualitative research; and GRADE for guidelines. These reporting standards have checklists that may guide critical appraisal. They are available at the web site of the EQUATOR network (*E*nhancing the *QUA*lity and *T*ransparency *O*f health *R*esearch), a group of renowned international experts that grew out of the work of CONSORT and other guideline development groups (http://www.equator-network.org/, accessed July 7, 2015).

*BMJ*. 1996;312(7023):71–72. [PubMed: 8555924]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 1992;268(17):2420–2425. [PubMed: 1404801]

*Health Policy*. 1995;32(1–3):125–139. [PubMed: 10156633]

*Health Program Planning: An Educational and Ecological Approach*. 4th ed. New York, NY: McGraw-Hill; 2005.

*J Trauma Acute Care Surg*. 2012;72(6):1484–1490. [PubMed: 22695411]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2011;305(19):2005–2006. [PubMed: 21586716]

*Cochrane Database Syst Rev*. 2011(11):CD001270.

*BMJ*. 2008;336(7650):924–926. [PubMed: 18436948]

*J Bone Joint Surg Am*. 2004;86-A(8):1717–1720. [PubMed: 15292420]

*J Bone Joint Surg Am*. 2005;87(12):2632–2638. [PubMed: 16322612]

*World J Surg*. 2005;29(5):567–9. [PubMed: 15830117]

*World J Surg*. 2005;29(5):561–566. [PubMed: 15827842]

*N Engl J Med*. 2009;361(2):109–112. [PubMed: 19494209]

*Clinical Practice Guidelines: Directions for a New Program*. Field ML, Lohr KN, eds. Washington, DC: National Academy Press; 1990.

*Crit Care Med*. 2010;38(10):S534–S538. [PubMed: 21164394]

*J Trauma*. 2011;71(6):1720–1725. [PubMed: 21841516]

*J Trauma Acute Care Surg*. 2011;70(6):1401–1407.

*BMJ*. 1950;2(4682):739–748. [PubMed: 14772469]

*BJSM*. 2003;37(3):197–206. [PubMed: 12782543]

*J Am Coll Cardiol*. 2011;58(24):e123–e210. [PubMed: 22070836]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2004;292(13):1612–1614. [PubMed: 15467065]

*Transfusion*. 2006;46:327–338. [PubMed: 16533273]

*Am J Prev Med*. 2012;43(3):337–350. [PubMed: 22898128]

*Health Behavior and Health Education: Theory, Research, and Practice*. 4th ed. San Francisco, CA: John Wiley & Sons; 2008.

*Health Educ Res*. 2011;26(5):872–885. [PubMed: 21536712]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 1998;280(19):1690–1691. [PubMed: 9832001]

*N Engl J Med*. 1999;340(8):618–626. [PubMed: 10029647]

*N Engl J Med*. 1999;341(4):279–283. [PubMed: 10413743]

*CMAJ*. 2012;184(8):895–899. [PubMed: 22158397]

*Statistical Power Analysis for the Behavioral Sciences*. New York, NY: Academic Press; 1977.

*JAMA*[JAMA and JAMA Network Journals Full Text]. 1995;274(22):1800–1804. [PubMed: 7500513]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2011;305(7):713–714. [PubMed: 21325190]

*Ann Intern Med*. 2007;147(12):871–875. [PubMed: 18087058]

*BMJ*. 1999 May 1; 318(7192):1209.

*Resuscitation*. 2005;65(1):65–69. [PubMed: 15797277]

*Bone Joint Res*. 2014;3(4):123–129. [PubMed: 24764547]

*Users’ Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice*. Chicago, IL: American Medical Association Press; 2002.

*Acad Emerg Med*. 2010;17(10):1104–1112. [PubMed: 21040112]

*Ann Surg*. 2016 Jun;263(6):1051–1059. [PubMed: 26720428]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2014;311(16):1622–1631. [PubMed: 24756512]

*The Theory of Response-Adaptive Randomization in Clinical Trials*. New York, NY: Wiley; 2006.

*J Biopharm Stat*. 2012;22(4):719–736. [PubMed: 22651111]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2015;313(5):471–482. [PubMed: 25647203]

*JAMA*[JAMA and JAMA Network Journals Full Text]

*Surg*. 2015;150(1):24–29. [PubMed: 25372451]

*AIDS Educ Prev*. 2006;18(Suppl):59–73. [PubMed: 16987089]

*J Acquir Immune Defic Syndr*. 2008;47(Suppl 1):S40–S46. [PubMed: 18301133]

*J Trauma Acute Care Surg*. 2014;76(4):1061–1069. [PubMed: 24662872]

*J Am Coll Surg*. 2012;214(5):756–768. [PubMed: 22321521]

*J Trauma Acute Care Surg*. 2013;75(1):166–172. [PubMed: 23940864]

*J Trauma Acute Care Surg*. 2014;76(5):1322–1327. [PubMed: 24747468]

*Statistical Methods and Scientific Inference*. Oxford, England: Hafner Publishing Co; 1956: viii, 175.

*Epidemiology*. 1990;1(1):43–46. [PubMed: 2081237]

*Restor Dent Endodont*. 2015;40(2):172–176.

*BMJ*. 1998 Apr 18;316(7139):1236–1238. [PubMed: 9553006]

*Crit Care*. 2005;9(1):34–36. [PubMed: 15693981]

*Ann Surg*. 2011;253(3):431–441. [PubMed: 21178763]

*Health Technol Assess*. 2013;17(10):1–79.

*Crit Care*. 2002;6(3):222–225. [PubMed: 12133182]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2012;308(24):2594–2604. [PubMed: 23268518]

*Crit Care Clin*. 2009;25(2):325–356. [PubMed: 19341912]

*Trials*. 2012;13:214. [PubMed: 23157733]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2006;295(10):1147–1151. [PubMed: 16522835]

*Proc R Soc Med*. 1965;58(5):295–300. [PubMed: 14283879]

*Epidemiol Perspect Innov*. 2009;6:2. [PubMed: 19534788]

*World J Surg*. 1996;20(4):392–400. [PubMed: 8662125]

*Arch Surg*[Archives of Surgery Full Text]. 1997;132(6):620–624. [PubMed: 9197854]

*Arch Surg*[Archives of Surgery Full Text]. 2005;140(5):432–438. [PubMed: 15897438]

*Surgery*. 2006;140(4):640–647. [PubMed: 17011912]

*Surgery*. 2005;138(4):749–757. [PubMed: 16269305]

*Arch Surg*[Archives of Surgery Full Text]. 2010;145(10):973–977. [PubMed: 20956766]

*Modern Epidemiology*. 2nd ed. Philadelphia, PA: Lippincott Williams & Wilkins: 1998.

*J Trauma Acute Care Surg*. 2013;75(1 Suppl 1): S97–S103. [PubMed: 23778519]

*N Engl J Med*. 2012;367(14):1355–1360. [PubMed: 23034025]

*J Trauma Injury InfectCrit Care*. 1998;45(2):291–303.

*J Trauma Acute Care Surg*. 2013;74(4):999–1004. PubMed PMID: 23511137. [PubMed: 23511137]

*Intensive*

*Care Med*. 2004;30(12):2237–2244. [PubMed: 15502934]

*J Trauma Acute Care Surg*. 2014;76(3):582–593. [PubMed: 24553523]

*J Trauma Injury Infect Crit Care*. 2005;59(3):698–704.

*J Trauma Acute Care Surg*. 2015;78(6):1168–1175. [PubMed: 26151519]

*J Trauma Acute Care Surg*. 2012;73(2):358–364. [PubMed: 22846940]

*Anesthesiology*. 2012;116(3):716–728. [PubMed: 22270506]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2015;313(5):471–482. [PubMed: 25647203]

*BMJ*. 2002 Jun 15; 324(7351):1448–1451. [PubMed: 12065273]

*Crit Care Med*. 2013;41(2):399–404. [PubMed: 23263579]

*J Trauma Acute Care Surg*. 2015;79(1):15–21. [PubMed: 26091309]

*J Trauma Acute Care Surg*. 2013;75(2):195–201. [PubMed: 23823614]

*J Trauma Acute Care Surg*. 2014 Dec;77(6):811–817; discussion 817. [PubMed: 25051384]

*J Trauma*. 2008;65(2):261–270. [PubMed: 18695460]

*Ann Intern Med*. 2002;136:111–121. [PubMed: 11790062]

*Shock*. 2009 May;31(5):438–447. [PubMed: 18838942]

*Crit Care Med*. 2007;35(9):2052–2056. [PubMed: 17568333]

*J Trauma Acute Care Surg*. 2013;74(5):1207–1214. [PubMed: 23609269]

*Am J Epidemiol*. 2003;158(3):280–287. [PubMed: 12882951]

*J Trauma Acute Care Surg*. 2013;74(5): 1223–1229; discussion 9-30. [PubMed: 23609271]

*Surgery*. 2014;156(3):570–577. [PubMed: 24962188]

*Cochrane Database Syst Rev*. 2015, Issue 5, Art No: CD004896. http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD004896.pub4/abstract. Accessed July 6, 2015.

*JAMA*[JAMA and JAMA Network Journals Full Text]. 1995;273(5):408–412. [PubMed: 7823387]

*BMC Emerg Med*. 2013;13:20. [PubMed: 24267513]

*Circ Shock*. 1982;9(2):107–116. [PubMed: 6177441]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2008;299(19):2304–2312. [PubMed: 18443023]

*JAMA*[JAMA and JAMA Network Journals Full Text]. 2008;300(11):1295–1299.

*J Trauma Acute Care Surg*. 2013;74(5):1215–1221; discussion 21-2. [PubMed: 23609270]

*J Trauma Acute Care Surg*. 2013;74(5):1207–1212; discussion 12-4. [PubMed: 23609269]

*J Trauma Acute Care Surg*. 2014;76(4):921–928. [PubMed: 24662853]

*Clinical Practice Guidelines We Can Trust*. Washington, DC: National Academies Press; 2011.

*J Trauma Acute Care Surg*. 2015;78(1):136–146. [PubMed: 25539215]

*BMJ*. 2004 Jun 19;328(7454):1490.

*BMJ*. 2008;336(7653):1106–1110. [PubMed: 18483053]

*J Clin Epidemiol*. 2010;63(5):513–523. [PubMed: 19595577]