Diagnostic testing and the average absolute likelihood ratio: application to diagnosing wide QRS complex tachycardia and other ED diseases
American Journal of Emergency Medicine (2012) 30, 18951906
Original Contribution
Diagnostic testing and the average absolute likelihood ratio: application to diagnosing wide QRS complex tachycardia and other ED diseases^{?,??}
Keith A. Marill MD?
Department of Emergency Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA
Received 27 September 2011; revised 1 April 2012; accepted 1 April 2012
Abstract The Bayesian approach to disease diagnosis in the emergency department is facilitated by the use of Likelihood ratios to evaluate Diagnostic tests. The use of dichotomous, interval, and joint LRs for single and multiple tests is reviewed, and comparison is made to regression modeling.
The clinical motivation for a single statistic to describe the average change in the odds of disease associated with the use of a particular test or series of tests is described. This new extension of the LR concept is termed the average absolute LR (AALR).
Illustrative examples include the use of elevated electrocardiogram ST segment and troponin to diagnose acute myocardial infarction, and serum Ddimer and computed tomographic angiography to diagnose pulmonary embolism. Finally, a detailed example with original data demonstrating the use of the AALR to compare QRS duration, QRS axis, and the 2 tests combined to diagnose ventricular tachycardia in patients with stable sustained regular wide QRS tachycardia is provided. Application of both tests together to patients with wide QRS complex tachycardia changes the odds of ventricular tachycardia, on average, by a factor of 3.5 (95% confidence interval, 2.46.2). Challenges are described, and methods are provided to estimate the 95% confidence interval of the LR and AALR using bootstrapping techniques. The AALR is a test statistic that may be helpful for clinicians and researchers in evaluating and comparing diagnostic testing approaches.
(C) 2012
Introduction
Diagnostic testing is a critical aspect of the practice of emergency medicine. Disease diagnosis begins with the history and physical examination. Hematologic tests and
^{?} Financial support and financial interests for this work: None.
^{??} Prior presentation: Abstract poster presentation, Society for Academic Emergency Medicine Annual Meeting; Chicago, IL; May 12, 2012.
* Zero Emerson Place, Suite 3B, Boston, MA 02114. Tel.: + 1 617 726
6636; fax: + 1 617 724 0917.
Email address: [email protected].
analyses of other body ftuids, electrocardiographic, and diverse imaging modalities are then used to diagnose definitively or more precisely define the likelihood of serious illnesses. An important challenge in the field is to compare and contrast these myriad available diagnostic tests to find the most valuable test or test combination to assess for Serious disease in any given patient.
The purpose of this study is 2fold. The first goal is to review likelihood ratios (LRs), interval LRs, and joint LRs and to provide insight into their desirable properties for assessing diagnostic tests. This includes briefty contrasting the LR approach with logistic regression modeling. The second goal is to introduce a new statistic for evaluating diagnostic tests,
07356757/$ – see front matter (C) 2012 http://dx.doi.org/10.1016/j.ajem.2012.04.002
the average absolute LR (AALR). This statistic represents an extension of the LR concept. The AALR provides a more global picture of the utility of a diagnostic test or test algorithm before the test results are known and an alternate means to compare potential testing strategies. Finally, a bootstrapping method for estimating the 95% confidence interval (CI) of the LR and AALR is provided.
Likelihood ratios
The LR has emerged as an important characteristic to evaluate the use of a diagnostic test. For a single binary test, there are 2 results labeled positive LR and negative LR. The positive and negative LRs describe the change in the odds of disease for positive and negative test results, respectively. Recall that the odds of disease is the ratio of those with to those without disease, A/B, and the probability of disease is the ratio of those with disease to all patients, A/(A + B). An LR value greater than 1 describes an increase in the likelihood of disease, a value of 1 does not change the likelihood of disease, and a number between 0 and 1 describes a decrease in the likelihood of disease. Precisely, the LR multiplied by the pretest odds of disease yields the posttest odds of disease (Fig. 1). One attraction of this approach is that this adjustment of the odds or probability of disease based on a test result is generally the way clinicians often qualitatively think (prob
ability can be adjusted using the Fagan nomogram) [1]. We constantly reassess the likelihood of disease based on new information obtained or a change in the patient’s condition.
The LR can be described in terms of the sensitivity and specificity of the test, which describe characteristics of the test for patients with and without the disease of interest, respectively (Fig. 1). Another particular strength of the LR is that although it depends on patients with and without disease, it does not depend on the relative proportions of these patients. It is not a function of the prevalence or pretest probability of disease in the sample of subjects tested.
In many instances in the emergency department (ED), a single test is not diagnostic, and multiple tests are used to adjust the odds of disease. For example, both an electrocar diogram (ECG) and troponin biomarkers are used to assess for acute myocardial infarction (MI). The simplest example would be 2 separate binary diagnostic tests. If the tests are truly independent with no interaction effect, then the LR from each test can be applied in series to determine an adjusted odds of disease. The pretest odds of disease is multiplied by the LR for each test result, whether positive or negative, to determine the final posttest odds of disease. If the 2 independent tests are congruent and both positive or both negative, then they generally produce a combined LR that is more extreme than either test alone.
If the tests are not independent, then multiplying individual LRs is not a valid approach. For example, ST
Fig. 1 The LR and regression modeling.
segment elevation on ECG testing and positive serum Troponin testing are both predictive of acute MI. Perhaps, the increase in the odds of disease with a positive serum troponin differs between patients with and without ST segment elevation myocardial infarction. It could be that the odds of disease when both tests are positive is even higher than might be predicted based on a positive test result for each individually. This example describes a positive interaction effect. Taking the product of serial multiplication of the pretest odds of disease multiplied by the LR positive for the ECG and Troponin tests individually would underestimate the posttest odds of disease.
Interval and joint LRs
In many instances, diagnostic tests are not binary but have multiple possible results or may be described on a numerical scale. The serum troponin can take on a spectrum of values that are sometimes simplified to the binary “positive” or “negative.” Such simplification can sometimes be subopti mal. Often, a middle range of troponin values are identified as indeterminate because they are associated with only a slightly higher risk of acute MI. When there are more than 2 relevant test results, interval LRs can be calculated to provide an estimate of the change in the odds of disease when each specified range of a spectrum of test results is obtained [2]. In essence, interval LRs provide finer use of the data from an interval scale than simple binary dichotomization.
Interval LRs can also be applied to the common situation where multiple tests are performed [3]. A matrix of joint LR values is obtained. Each joint LR value corresponds to a specified result interval for each test. By determining the joint LR when multiple tests are used and by comparing it to the expected result when individual LRs are multiplied in series, evidence of an interaction effect between the tests can be determined.
Receiver operator characteristic curve and area under the curve
For a given test, the receiver operator characteristic (ROC) curve plots the sensitivity and (1 – specificity) as the threshold for a positive test is varied. The area under the curve (AUC) is the total area beneath the curve. It provides a measure of the ability of the test to correctly distinguish patients with and without disease across the spectrum of possible test thresholds. For most serious illness, however, emergency physicians are primarily interested in only that portion of the ROC curve corresponding to a high test sensitivity. The AUC can be used to compare the utility of 2 individual tests. However, for the clinician, the AUC for a single diagnostic test has limited clinical use (it is the probability that a patient with disease will have a more positive test result than a patient without disease when there
are an equal number of patients with and without disease). Furthermore, the ROC and AUC concepts are not directly applicable when multiple tests are used because the ROC curve is generally used to map the characteristics of a single test. Three dimensional ROC surfaces that can simulta neously map more than one test have been described, but are not in general use.
Likelihood ratio as compared with regression modeling
Regression modeling is another analytic approach commonly used to assess the value of a diagnostic test or algorithm of multiple tests. Multiple regression can parse out the independent association of each test with disease while adjusting for the presence of other tests in the algorithm. For example, the odds of MI can be determined for patients who have a positive vs negative troponin test, while adjusting for the fact that the simultaneous presence of ST elevation on ECG also increases the odds of MI. The relative strength of the association of different tests with disease can be compared, and interactions among tests can be assessed.
Regression, however, provides a model of the sample characteristics. The fit of the model to the data must be assessed, and it is not always satisfactory. If the fit is satisfactory, then the Predictive ability of the model can be evaluated with a number of metrics including the C statistic or Net reclassification improvement . The most commonly used models assume a progressive and, most commonly linear, dependence for predictor tests that are measured on an interval scale. Conversely, LRs directly describe the sample data with minimal and ftexible assumptions.
Regression modeling can be used to calculate indirectly a modeled LR using the pretest probability of disease, but it is most commonly used to compare the post test odds of disease based on the test results. The LR, however, directly links the pretest odds to the posttest odds of disease (Fig. 1). This distinction is critical. The LR link underlies the Bayesian method of transforming the pretest to the posttest odds, mirroring the clinician thought process.
Likelihood ratio deficiencies
Although the LR has many desirable properties as a measure of diagnostic test performance, it is not perfect. High or low LRs generally signify a test with good Discriminative value. Are tests with a single high or low LR always highly useful? How can positive, negative, or interval LRs be used to compare the utility of 2 individual diagnostic tests or multiple test algorithms?
Consider a binary test with a high positive LR and a negative LR close to 1. This can generally be considered a useful test, particularly when there is a positive test result. However, what if only a small minority of patients test
positive? For most patients who test negative, the test is not particularly useful. What if the test has other undesirable
intervals, and ? means to take the sum over all of the LRs. For example, for a single dichotomous test, generally
properties such as risk of adverse effects and morbidity, discomfort, or high cost? Would the diagnostic benefit to the
AALR = 1 / N_{Total}
[N ^{+} * LR^{+} + N ^{–} / LR^{–}].
small population of patients who test positive still be worth the adverse implications to the entire group of patients tested? It would be useful to know the average benefit to be expected for our patient, given that we do not know what the test result will be a priori.
Consider a series of 2 tests, each with high positive and low negative LRs. Each test is effective in adjusting the odds of disease. However, what if many patients who test positive for one test also test negative for the other, and vice versa? The results for the 2 tests are often incongruent. In this case, the combined approach for the 2 tests may be poorly effective in altering the likelihood of disease in most patients, although individually, the tests are effective.
As clinicians and researchers evaluating diagnostic tests, we may ask a simple question, “How does this test change the odds of disease?” We ask this question not for a subgroup of patients defined by the test (test positive or test negative) but for all patients who will be subjected to the test. Only when we know the average benefit to be expected for the patient can we then make a reasonable riskbenefit assessment as to whether to perform the test. Furthermore, we would like to extend this concept to the common scenario when multiple tests or a test algorithm is used to assess for disease. “How does this test or test algorithm change the odds of disease, particularly as compared with other Diagnostic approaches?”
Average absolute LR
A new statistic was developed empirically to answer these questions, and it is termed the AALR. The AALR describes the average of the absolute change in odds of disease, whether the change is an increase or decrease. It can be used to compare the overall diagnostic utility of candidate single test or multitest approaches.
For some patients, the LR may be greater than 1, and for others, it may be less than 1. For this reason, the “absolute value” or absolute change in odds with respect to 1 is used. For those test results that increase the odds of disease with an LR greater than 1, the LR is multiplied by the proportion of patients with that test result. For those test results that lower the odds of disease with an LR value less than 1, the inverse of this value is used. For example, if the negative LR happened to be 1/2 , then the inverse of 1/2 would be 2/1 or 2.
The AALR is defined as
AALR =
1 / N_{Total}[?(N_{i} * LR_{i})(forLR N 1) + ?(N_{k} / LR_{k} )(forLRb1)],
where LR_{i} and LR_{k} are the interval LRs, N_{i} and N_{k} are the number of patients with test results within the corresponding
The AALR provides the overall average change in the odds of disease with the application of any given test or test algorithm. Thus, it allows the clinician to compare alternative testing approaches and determine which provides the greatest overall change in the odds of disease. In combination with other factors such as potential test complications, discomfort, cost, and others, this metric can be used to help to determine the best testing approach for a given patient and symptom complex.
Average absolute LR theoretical example
The following is a theoretical example contrasting the use of the AUC and the AALR when comparing 2 tests. Suppose there were 2 tests, A and B, used to diagnose a disease condition. Test A has a sensitivity and specificity of 90% and 70%, with an LR positive and negative of 3 and 1/7, respectively. Test B has a sensitivity and specificity of 70% and 90%, with an LR positive and negative of 7 and 1/3, respectively. The AUC for both tests is equal at 0.80. In practice, however, the ability of the 2 tests to adjust the odds of disease may be quite different. Suppose the prevalence of disease in the sample was 10%. Then, the AALR for test A is 5.6, and for test B, it is 3.6. For these patients, the majority do not have the disease of interest and will test negative, and test A provides more useful information in adjusting the odds of disease.
Lack of generalizability is a relative disadvantage of the
AALR due to its dependence on disease prevalence. However, this dependence on disease prevalence can also be perceived as an advantage because it allows the AALR to demonstrate how the test can be expected to perform in the population of interest. Often in emergency medicine, a falsenegative test is more dangerous than a falsepositive test, so sensitivity is maximized at the expense of specificity. This asymmetry may be best expressed by the individual LR+ and LR as opposed to the AALR. However, when multiple diagnostic tests are used and the situation becomes more complex, the overall adjustment in the odds of disease for the population of interest may still be best summarized with the AALR.
Average absolute LR application to a known clinical algorithm: diagnosing pulmonary embolism with Ddimer and
computed tomography
Pulmonary embolism (PE) is a relatively common and potentially lethal illness that is difficult to diagnose clinically. In addition to the history and physical examination to assess the pretest probability of disease, clinicians today
rely on the serum Ddimer test and PE protocol computed tomography (PE CT), and these are often used in series to determine the diagnosis. In patients with a sufficiently Low pretest probability, the Ddimer test is performed first, and if positive, then the PE CT is performed to diagnose PE. This approach is used in lieu of performing a PE CT on all of these patients to minimize the radiation exposure, time, and expense of scanning. When faced with a patient at risk for PE in which these 2 approaches are contemplated, but prior to the knowledge of any test outcomes, how can the clinician compare the average expected benefit?
Consider a theoretical study with 10 000 patients with 5% pretest probability of PE (500 patients have PE) (Fig. 2). Assume that a criterion standard conventional angiogram is available to diagnose definitively all patients. Assume that a quantitative serum Ddimer test is used with a sensitivity of 94% and a specificity of 55%, and PE CT is used with a sensitivity of 90% and a specificity of 95% when applied to patients with positive Ddimer [46]. Thus, in this study, the LR+ and LR for serum Ddimer is 2.1 and 1/9.2, respectively, and for PE CT, it is 18.0 and 1/9.5, respectively. If the Ddimer test is performed first on all 10 000 patients, then 5255 test negative and 30 of these actually have PE. All of the 4745 who test positive for Ddimer undergo a PE CT, and 4108 test negative including 47 that actually have PE. Six hundred thirtyseven patients test positive for both D dimer and PE CT and are diagnosed as having PE. Four hundred twentythree of these do and 214 do not have a PE. The AALR for this test algorithm is as follows: (1/10 000) [(5255)(9.2) + (4108)(4.5) + (637)(37.6)] = 9.1. The AALR
for PE CT alone in this example is as follows: (1/4745) [(4108)(9.5) + (637)(18.0)] = 10.3, and it would be the same if PE CT was performed on all 10 000 patients if the PE CT and Ddimer tests are independent without interaction.
Fig. 2 DDimer and PE CT algorithm for diagnosing PE.
Application of the Ddimer/PE CT test algorithm will change the odds of disease by a factor of 9.1 on average. This value is substantial but less than the value of 10.3 for PE CT alone. The benefit of the 2step Ddimer/PE CT approach is avoiding PE CT in 5255 (53%) of patients, and thus, it is widely used.
Comparing the AALR provides one simple perspective to compare the average expected benefit of the 2 test approaches for a prospective patient before any test results are known. A conventional analysis might be to calculate the ROC AUC for the 2 step Ddimer/PE CT and sole PE CT approaches, and these are 0.91 and 0.93, respectively. A disadvantage of this analytic approach is the less tangible clinical meaning of the AUC when applied to patient care. Additional information would be to provide the 2step approach LR+ 37.6 and LR 1/
6.3 where a combined positive Ddimer and negative PE CT results are considered a negative test result, and compare these LRs to the PE CT LRs provided above. However, it becomes more difficult to compare these 4 values (LR+ favors 2step approach and LR favors PE CT alone), particularly when they apply to different proportions of patients with positive and negative test results using the 2 different diagnostic approaches. Finally, one could simply note that the 2step approach yields 77 falsenegative tests in patients with PE and the sole PE CT approach would yield only 50. This, however, provides no information about most patients without disease. Table 1 demonstrates how the AALRs vary with different possible pretest probabilities of disease for the combined algorithm and for PE CT alone. The individual Ddimer, PE CT, and combined algorithm LRs do not vary with pretest probability. Falsenegative and truenegative categories for the combined algorithm include patients where the Ddimer is negative, or the Ddimer is positive and the PE CT is negative. Note that the combined approach has a greater AALR than the PE CT alone when the pretest probability of PE reaches 20%. This is because of the high sensitivity of the
Ddimer test.
Average absolute LR clinical application example: ECG diagnosis of wide QRS complex tachycardia
Regular wide QRS complex tachycardia is an uncommon but dangerous dysrhythmic presentation in the ED. Diag nosing the underlying mechanism has important therapeutic and prognostic implications. The first steps in diagnosing stable undifferentiated regular wide QRS complex tachycar dia include taking a history and physical examination and analyzing the ECG. A number of ECG characteristics may be used to differentiate the 2 major differential diagnoses, supraventricular tachycardia and ventricular tachy cardia (VT). Two automated ECG measurements that previous data suggest to be useful include the QRS duration (width) and the axis of the major QRS deftection in the limb
42 
8 
224 
9726 
7.9 
45 
5 
497 
9453 
10.0 

1% (1/99) 
85 
15 
223 
9677 
8.2 
90 
10 
495 
9405 
10.0 
5% (1/19) 
423 
77 
214 
9286 
9.1 
450 
50 
475 
9025 
10.3 
10% (1/9) 
846 
154 
202 
8798 
10.3 
900 
100 
450 
8550 
10.6 
20% (1/4) 
1692 
308 
180 
7820 
14.0 
1800 
200 
400 
7600 
11.4 
TP, true positive; FN, false negative; FP, false positive; TN, true negative. ^{a} The LR+ and LR for the combined Ddimer/PE CT test algorithm remain fixed at 37.6 and 1/6.3, respectively. ^{b} The LR+ and LR for the PE CT test alone remain fixed at 18.0 and 1/9.5, respectively. ^{c} Fractional 2 x 2 table values rounded to the nearest whole number. 
leads [7,8], although the optimal application of these tests may also depend on QRS morphology [9].
Table 1 Two diagnostic testing approaches for PE as a function of pretest probability for 10 000 patients
Pretest probability (odds) PE
DDimer/PE CT combined algorithm ^{a}
PE CT alone ^{b}
TP
FN
FP
TN
AALR
TP
FN
FP
TN
AALR
Two institutional review boardapproved retrospective cohort studies were recently performed, investigating the use of adenosine administration for undifferentiated regular wide QRS complex tachycardia [10] and the efficacy of amiodarone or procainamide infusion for the termination of sustained stable VT [11]. These studies described consecu tive patients who presented with regular sustained wide QRS complex tachycardia.
In this clinical example demonstrating the use of the AALR, one wide QRS complex ECG from each unique patient from these 2 studies was included. All patients whose diagnosis required blinded assessment of ECG characteristics by an electrophysiologist were excluded. Most diagnoses were based on tachydysrhythmia reproduction in the
electrophysiology laboratory, diagnostic response to adeno sine, or clear evidence of atrioventricular dissociation as listed in the source articles. The goal was to define the diagnostic utility of QRS duration and the QRS axis, separately and in combination, to determine the underlying rhythm diagnosis of SVT or VT. No Power calculation was performed, although a technique to do so for the LR has been described [12].
An ROC curve was constructed to determine empirically the optimal intervals for QRS duration to distinguish SVT from VT. No distinction was made for patients already taking antidysrhythmic agents, which could lengthen the QRS duration. No distinction was made for right, left, or undefined bundlebranch block morphology. Previous data have suggested that a bizarre upward and rightward (180?270?) QRS axis is associated with VT. Axis was dichotomized to this quadrant, or all others.
SVT VT
300.00
250.00
QRS interval (msec)
200.00
150.00
100.00
300.00
250.00
QRS interval (msec)
200.00
150.00
100.00
25.0 20.0 15.0 10.0 5.0
.0 5.0
10.0
15.0
20.0
25.0
Frequency Frequency
Fig. 3 QRS interval and wide QRS tachycardia diagnosis.
130 b QRS b 160 
QRS >= 160 

SVT 
48 
36 
11 

VT 
15 
28 
49 

1B: QRS axis (deg) and rhythm 

– 90 b axis b 181 
180 b axis b 271 

SVT 
86 
9 

VT 
69 
23 

1C: QRS duration, QRS Axis (deg), and rhythm 

QRS <= 130 
130 b QRS b 160 
QRS >= 160 

SVT 90 baxis b 181 
46 
31 
9 

180 baxis b 271 
2 
5 
2 

VT – 90 b axis b 181 
13 
21 
35 

180 b axis b 271 
2 
7 
14 
Likelihood ratios and the AALR for identifying VT were determined for QRS duration and QRS axis individually, and for the 2 tests combined. For each statistic, open source R software (version 2.12), along with the freely downloadable boot package, was used to bootstrap 10 000 replicates by sampling with replacement, and the bias corrected and accelerated (BCa) percentile method was used to estimate the 95% CI. SPSS version 17 (IBM Corp, Somers, NY) was used to compute the ROC AUC.
Table 2 Patient numbers based on test results and underlying rhythm 1A: QRS duration (ms) and rhythm
One hundred eightyseven patients were included: 95 with SVT and 92 with VT. Eight (8%) and 23 (25%) patients with SVT and VT were taking oral antidysrhythmic medicines, respectively. Inspection of the QRS width data
(Fig. 3) and associated ROC curve suggested that the optimal intervals in milliseconds for distinguishing VT from SVT were as follows: QRS <= 130, 130 b QRS b 160, and QRS >=
160. Table 2 contains the data for the 3 tests: QRS duration, QRS axis, and the 2 tests combined. Both LR and AALR results are listed in Table 3, and joint LRs for the tests combined are depicted graphically using a semilog plot that provides symmetry above and below the value 1 on the vertical axis (Fig. 4). The ROC AUC for each of the 2 tests was 0.75 (95% CI, 0.680.82) and 0.58 (95% CI, 0.500.66)
for QRS width and axis, respectively.
QRS duration may be more predictive than QRS axis, but neither test has tremendous ability to distinguish VT
1
7.2
0
4
1.4
1
1
0.7
0.3
0.1
Likelihood Ratio (Log Scale)
QRS Axis (deg) QRS Duration (msec)
Fig. 4 Joint interval LRs for VT as a function of QRS duration and axis.
2A: LRs and AALR for QRS duration (ms)
QRS <= 130
LR (95% CI) 0.3 (0.20.5)
130 b QRS b 160
0.8 (0.51.2)
QRS >= 160
4.6 (2.89.7)
AALR (95% CI)
3.0 (2.14.4)
2B: LRs and AALR for QRS axis (deg)
– 90 b axis b 181
LR (95% CI) 0.8 (0.71.0)
180 b Axis b 271
2.6 (1.36.2)
AALR (95% CI)
1.5 (1.22.2)
2C: LRs (95% CI) and AALR for combined QRS duration (ms) and axis (deg)
QRS <= 130 130 b QRS b 160
– 90 b axis b 181 0.3 (0.20.5) 0.7 (0.41.1)
180 b axis b 271 1 (0.14.3) ^{a} 1.4 (0.46.2) ^{a}
QRS >= 160
4 (2.39.8)
7.2 (2.337.0) ^{a}
AALR (95% CI)
3.5 (2.46.2)
^{a} These joint LR intervals included sensitivity or specificity estimates close to zero. They required replacement of zero in the bootstrap samples with the lowest population value whose upper 95% CI was the estimated sensitivity or specificity using the binomial sampling procedure described in the text.
from SVT based on their AALRs and ROC AUCs. Without the benefit of the AALR, however, it is difficult to determine whether and how much predictive improvement there is from using the 2 tests together. All patients with
130 b QRS b 160 demonstrate a gain, whereas some patients with a QRS of 130 or less demonstrate the same and others a complete loss of discriminatory ability when moving from the single QRS duration test to the combined QRS duration and axis tests. Similarly, some patients with a QRS of 160 or more gain, whereas others lose. The degree of benefit, if any, from combined testing depends on the distribution of test results, and it cannot be discerned from the LRs alone.
Conversely, the AALR succinctly summarizes the discriminatory benefit of the combined testing strategy as compared with the individual tests. Although there is some overlap of the QRS duration and combined test AALR CIs, using the 2 tests in a combined fashion may have some benefit, with the AALR improving from 3.0 to 3.5.
It is interesting to note that even for 2 independent tests, the combined AALR will usually be less than the product of the individual AALRs. This contraction of the AALR is due to incongruent tests results (one test positive and the other negative). As additional tests are performed, the incremental average benefit may decrease due to increasing complex incongruent test results. A summary of the AALR as compared with other diagnostic test assessment techniques is provided in Table 4.
Limitations of the AALR
The AALR depends on the LR for each test result or combination of test results for multiple tests and the distribution of test results in the sample studied. Unlike the individual LRs, the distribution of test results depends on
the pretest probability of disease. A sample with a high pretest probability of disease would tend to have a higher proportion of patients with a positive test result and positive LR. This dependence of the AALR on the pretest probability of disease may limit its generalizability to populations that differ from the derivation population.
For tests with interval data, as with the LRs, the AALR may vary depending on the number and cutoffs of the LR intervals chosen. However, the AALR can be used to assist the process of determining the optimal LR intervals for clinical use.
As the number of tests and test result combinations increase, there is a Rapid increase in the required sample size to reasonably populate each possible test result combination. Each unique set of test results is a silo. The results from all other test result combinations provide limited benefit when calculating the LR for a particular combination. This is in contrast to regression modeling where the totality of data from all test results in the sample is used to calculate the odds ratio for each test.
If the sample estimate for a test includes a sensitivity or specificity of 1 or 0, then the LR or inverse LR may be zero or infinity. This becomes problematic when calculating the LRs, AALR, and their CIs. Approaches can be taken to use a less extreme, and likely more realistic, estimate for the sensitivity or specificity.
Average absolute LR CI
For any given test or combination of tests, the 95% CI of the LR can be estimated using a number of closed formulas, and the log method is commonly used [12,13]. Bootstrap resampling with replacement methods can also be used [14,15]. Bootstrap resampling means to draw repeated
Table 4 Comparison of analytic techniques for diagnostic test assessment
Analytic technique
Use Strength Limitation Potential use
Logistic regression modeling
Comparing the odds of disease as a function of test outcome
Allows for adjustment for confounding factors
Inclusion of all relevant predictors tends to decrease the remaining uncertainty in the model and narrow the associated CIs
Does not span pretest to posttest odds
May assume linearity of predictors
Assessment of diagnostic tests particularly when adjustment for confounding factors and interactions
is necessary
Assessment of incremental test benefit after adjustment for other tests
LRs
Binary (positive/ negative)
Allows for the Bayesian adjustment of the pretest odds of disease
Mirrors clinician thought process
Independent of disease prevalence
Does not adjust for confounders
Assesses the utility or informativeness of a binary test
Interval Finer use of interval or
ordinal test data by allowing
N 2 possible test outcomes
Joint Allows for the combined results of >=2 diagnostic tests
For interval or ordinal data, this approach provides better delineation of the truly useful test result ranges.
Allows for assessment of interaction between 2 tests
The CI for each interval LR tends to widen as the number of patients within incrementally smaller intervals of interest decreases.
Even finer partition of data requires a larger sample size to avoid wide CIs.
Assesses the utility of a test with N 2 test outcomes of interest
Determines the utility of multiple tests used together to assess for disease
ROC curve Allows visualization of the
LR slopes and intervals, and the general AUC
Graphic depiction of numerical data
It is often graphically difficult to assess CIs or N 1 test simultaneously
Qualitative assessment of diagnostic test use
ROC AUC Quantitative measure of
diagnostic test utility
It can be calculated along with its 95% CI and used to compare diagnostic tests.
Independent of disease prevalence
It is not used clinically to adjust the pretest odds
of disease.
It is not generally used to assess multiple combined tests.
Used to compare the informativeness between >=1 alternative diagnostic tests
AALR Quantitative measure of diagnostic test use as applied clinically with Bayesian approach
Mirrors clinician thought process and provides average benefit prior to knowledge of test result
Can be used to compare single or combined testing algorithms
Test result is a function
of sample disease prevalence.
Larger sample required for multiple possible test outcomes to avoid wide CIs
Assesses and compares the expected average change in the odds of disease for single and combined test algorithms
random samples from a theoretical population, and the sample elements withdrawn are replaced in the population each time. A collection of samples withdrawn are the items of interest. Characteristics such as the distribution of the means of the collection of samples withdrawn can be used to estimate CIs. The 95% CI for the AALR cannot easily be described with a closed formula or extension of the LR log method.
The bootstrapping approach for estimating the 95% CI of the AALR is complex but feasible. When bootstrapping is performedtoestimatethe 95% CIofthe LR, thereisindependent uncertainty in the sample estimate of the test sensitivity and specificity, and the number of patients with and without disease often differ. Sensitivity and specificity must often be resampled separately, and the results must then be combined.
The AALR adds an additional layer of uncertainty corresponding to the distribution of test results. The distribution of patients with disease and the associated distribution of patients with various test results will vary across different samples of the larger patient population. The test result distribution uncertainty must also be resampled. When determining the AALR CI, the lowest possible value for any LR in a bootstrap sample should be 1. All values between 0 and 1 should be inverted. An example of code written in R to bootstrap the AALR for the QRS duration test in the example above is provided (Appendix A).
Perhaps the greatest challenge when estimating the 95% CI of the LR or AALR occurs when the estimated sensitivity or specificity is close to or equal to 0 or 1, and the study sample size is relatively small. As the specificity of a test approaches 1, the denominator of the positive LR, 1 – specificity, approaches zero, and the LR+ approaches infinity. Similarly, as the sensitivity approaches zero, the inverse of the LR+ approaches infinity. When closed formulas are used to calculate the LR CI and some cells contain zero, 0.5 is sometimes empirically added to all cells to eliminate the infinite values [12,16].
For small study sample sizes with sample sensitivity and specificity estimates near the edges of 0 or 1, bootstrap resampling may yield many bootstrap samples with estimates of infinity, and this may skew the distribution of the bootstrap estimates. In part, this is an artifact due to the coarseness of the bootstrap samples induced by a relatively small sample size. The possibilities of fine gradations of specificity and sensitivity approaching, but not equaling, 1 or 0 are unavailable.
Estimating a CIfromthisdistributioncan beproblematic and requires some assumption and approximation. A number of approaches can be used to mitigate the infinity that may result when a bootstrap sample has a specificity of 1 or sensitivity of 0. A slight offset can be introduced routinely to remove actual infinite values by replacing 1 with 1.000001, for example, in the expression for the LR. When the estimated sensitivity or specificity is close to the boundary of 0 or 1, many bootstrap samples may take a value of exactly 0 or 1. For example, if 29 of 30 patients without disease test negative, then the specificity will be close to 1. Many bootstrap samples will have a specificity of 1. It makes little sense to include these in the CI because one person without disease has already tested positive, so the test cannot have perfect specificity. Furthermore, these values may greatly distort the CI results. Using random binomial sampling, the highest population specificity with a lower 95% CI of 29/30 (0.967) for a sample size of 30 can be determined. It is 0.999. An example of this computation is provided in Appendix B. This population specificity value can be substituted for 1 in the bootstrap samples, and it then may
replace 1 as the upper 95% CI for the sample of interest.
Conclusion
Likelihood ratios allow the incorporation of the Bayesian approach to the assessment of diagnostic test utility, which is
similar to the thought process commonly used by clinicians. However, a test or combination of tests may sometimes be less helpful than the LRs suggest. The AALR provides an estimate of the total average change in the odds of disease as a result of the use of a diagnostic test or combination of tests. The AALR can be used as a single valuable statistic to evaluate and compare different test algorithms. The 95% CI of the AALR can be estimated by the bootstrapping technique, and a method is provided.
Future directions include further development and standardization of methods to address extreme values when bootstrapping LR and AALR CIs, as well as the possible development of a power calculation to deter mine the sample size required to prove a result above a defined AALR threshold. Finally, as opposed to multi plying LRs for 2 independent diagnostic tests, the progressive contraction expected in the AALR when combining 2 or more independent or interacting tests may be further defined.
Acknowledgments
The author gratefully acknowledges the thoughtful conversations with Dr Brian K. Nelson that inspired this work, and the encouragement and support of Mrs Lori B. Glazier, Loretta Benkert, and Dr James Ware.
References
 Fagan TJ. Letter: nomogram for Bayes theorem. N Engl J Med 1975;293:257.
 Brown MD, Reeves MJ. Evidencebased emergency medicine/skills for evidencebased emergency care. Interval likelihood ratios: another advantage for the evidencebased diagnostician. Ann Emerg Med 2003;42:298303.
 Liteplo AS, Marill KA, Villen T, et al. Emergency thoracic ultrasound in the differentiation of the etiology of shortness of breath (ETUDES): sonographic Blines and Nterminal probraintype natriuretic peptide in diagnosing congestive heart failure. Acad Emerg Med 2009;16: 20110.
 Courtney DM. Pulmonary embolism. In: Adams JG, Barton ED, Collings J, DeBlieux PMC, Gisondi MA, Nadel ES, editors. Emergency Medicine. Philadelphia, PA: Saunders Elsevier; 2008.
p. 71521.
 Thompson BT, Hales CA. Diagnosis of acute pulmonary embolism. In: www.uptodate.com. Alphen aan den Rijn, the Netherlands, Wolters Kluwer Health. Accessed 3/18/12.
 Stein PD, Fowler SE, Goodman LR, et al. Multidetector computed tomography for acute pulmonary embolism. N Engl J Med 2006;354: 231727.
 Wellens HJ, Bar FW, Lie KI. The value of the electrocardiogram in the differential diagnosis of a tachycardia with a widened QRS complex. Am J Med 1978;64:2733.
 Akhtar M, Shenasa M, Jazayeri M, et al. Wide QRS complex tachycardia. Ann Intern Med 1988;109:90512.
 Wellens HJ. Ventricular tachycardia: diagnosis of broad QRS complex
tachycardia. Heart 2001;86:57985.
 Marill KA, Wolfram S, deSouza IS, et al. Adenosine for widecomplex tachycardia: Efficacy and safety. Crit Care Med 2009;37:25128.
 Marill KA, deSouza IS, Nishijima DK, et al. Amiodarone or Procainamide for the Termination of Sustained Stable Ventricular Tachycardia: An Historical Multicenter Comparison. Acad Emerg Med 2010;17:297306.
 Simel DL, Samsa GP, Matchar DB. Likelihood ratios with confidence: sample size estimation for diagnostic test studies. J Clin Epidemiol 1991;44:76370.
 Altman DG. Diagnostic tests. In: Altman DG, Machin D, Bryant TN, Gardner MJ, editors. Statistics with confidence. Oxford, UK: BMJ books, Blackwell Publishing; 2000. p. 10810.
 Efron B, Tibshirani RJ. An introduction to the bootstrap. Boca Raton (Fla): Chapman & Hall/CRC; 1993.
 Haukoos JS, Lewis RJ. Advanced statistics: bootstrapping confidence intervals for statistics with “difficult” distributions. Acad Emerg Med 2005;12:3605.
 Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies. Ann Hum Genet 1956;20:30911.
Appendix A
# Calculating likelihood ratios and the AALR for the # diagnosis of VT in patients with Wide QRS
# tachycardia using QRS duration
# Lines beginning with # are treated as explanatory # comments in R, not code
# QRS Duration Test: First find 3 interval likelihood ratios # Durations: Long (QRSN160), Medium (130bQRSb160), # Short (QRSb130)
# QRSN160
# Bootstrap sensitivity, specificity, and construct LR # bootstrap
senslbrep(1:0,c(49,43))
speclbrep(1:0,c(84,11))
senslfb function(sensl,i) {return (mean(sensl[i] + .000001
))}
senslbb boot(sensl, senslf, R=10000)
speclfb function(specl,i) {return (1/(1.000001mean(specl [i])))}
speclbb boot(specl, speclf, R=10000) lrlb(senslb$t*speclb$t)
# Analyze LR bootstrap finding median, and standard and # BCa percentile 95% CIs
# To obtain bca CI on a nonboot result, use a dummy boot # and replace t and t0 with the results of interest.
dummybrep(1:0,c(6,4))
dummyfb function(dummy,i) {return (mean(dummy[i]))} dummybb boot(dummy, dummyf, R=10000)
m$tbmatrix(lrl,nrow=10000,byrow=T) m$t0b4.60
median(lrl)
boot.ci(dummyb, t0=m$t0, t=m$t, type=c(“perc”, “bca”)) # 130bQRSb160
sensmbrep(1:0,c(28,64))
specmbrep(1:0,c(59,36))
sensmfb function(sensm,i) {return (mean(sensm[i] +
.000001))}
sensmbb boot(sensm, sensmf, R=10000)
specmfb function(specm,i) {return (1/(1.000001mean (specm[i])))}
specmbb boot(specm, specmf, R=10000) lrmb(sensmb$t*specmb$t)
# Replace the elements t0 and t of the dummyboot, # dummyb, as above.
m$tbmatrix(lrm,nrow=10000,byrow=T) m$t0b0.80
median(lrm)
boot.ci(dummyb, t0=m$t0, t=m$t, type=c(“perc”, “bca”)) # QRSb130
sensshbrep(1:0,c(15,77))
specshbrep(1:0,c(47,48))
sensshfb function(senssh,i) {return (mean(senssh[i] +
.000001))}
sensshbb boot(senssh, sensshf, R=10000)
specshfb function(specsh,i) {return (1/(1.000001mean (specsh[i])))}
specshbb boot(specsh, specshf, R=10000) lrshb(sensshb$t*specshb$t)
m$tbmatrix(lrsh,nrow=10000,byrow=T) m$t0b0.32
median(lrsh)
boot.ci(dummyb, t0=m$t0, t=m$t, type=c(“perc”, “bca”))
# Calculate the AALR and associated 95% CI # First bootstrap the distribution of test results
shbrep(1:0,c(63,124)) tbc(0,1,0)
mbrep(t,c(63,64,60))
lbrep(0:1,c(127,60))
fbdata.frame(sh,m,l)
ffbfunction(f,i) {return (c(mean(f$sh[i]), mean(f$m[i]), mean(f$l[i])))}
fbbboot(f, ff, R=10000) colnames(fb$t)bc(“sh”,”m”,”l”) distbfb$t
# All LR’s that are less than one in the boot samples must be # raised to the 1 power (inverted) to yield the CI for the
# AALR (1 is the lowest LR estimate possible for AALR)
for (i in 1:10000)
{ if (lrsh[i]b1) lrsh[i]b(1/lrsh[i])}
{ if (lrm[i]b1) lrm[i]b(1/lrm[i])}
aalrb (dist[,1] * lrsh) + (dist[,2]* lrm) + (dist[,3]* lrl) m$tbmatrix(aalr,nrow=10000,byrow=T)
m$t0b2.96 median(aalr)
boot.ci(dummyb, t0=m$t0, t=m$t, type=c(“perc”, “bca”))
Appendix B
# Producing the 95% CI for a random binomial sample of # size 30 with probability of one set at 0.999.
pbrbinom(1000, size=30, prob=.999)/30 qbquantile(p, c(.025,.975))
q