Can medical record reviewers reliably identify errors and adverse events in the ED?
a b s t r a c t
Background: Chart review has been the mainstay of medical quality assurance practices since its introduction more than a century ago. The validity of chart review, however, has been vitiated by a lack of methodological rigor.
Objectives: By measuring the degree of interrater agreement among a 13-member review board of emergency physicians, we sought to validate the reliability of a chart review-based quality assurance process using comput- erized screening based on explicit case parameters.
Methods: All patients presenting to an urban, tertiary care academic medical center emergency department (an- nual volume of 57,000 patients) between November 2012 and November 2013 were screened electronically. Cases were programmatically flagged for review according to explicit criteria: return within 72 hours, procedural evaluation, floor-to-ICU transfer within 24 hours of admission, death within 24 hours of admission, physician complaints, and patient complaints. Each case was reviewed independently by a 13-member emergency depart- ment quality assurance committee all of whom were board certified in emergency medicine and trained in the use of the tool. None of the reviewers were involved in the care of the specific patients reviewed by them. Re- viewers used a previously validated 8-point Likert scale to rate the (1) coordination of patient care, (2) presence and severity of adverse events, (3) degree of Medical error, and (4) quality of medical judgment. Agreement among reviewers was assessed with the intraclass correlation coefficient for each parameter.
Results: Agreement and the degree of significance for each parameter were as follows: coordination of patient care (ICC = 0.67; P b .001), presence and severity of adverse events (ICC = 0.52; P = .001), degree of medical error (ICC = 0.72; P b .001), and quality of medical judgment (ICC = 0.67; P b .001).
Conclusion: Agreement in the chart Review process can be achieved among physician-reviewers. The degree of agreement attainable is comparable to or superior to that of similar studies reported to date. These results high- light the potential for the use of computerized screening, explicit criteria, and training of expert reviewers to im- prove the reliability and validity of chart review-based quality assurance.
(C) 2016
Background
Chart review has been the mainstay of medical Quality assurance activities since its introduction by Codman [1] more than a century ago at the Massachusetts General Hospital. The validity of chart review, however, has been vitiated by a lack of methodological rigor [2].
? Prior presentations: National Meeting of the Society for Academic Emergency Medi- cine in Dallas, TX, June 2014. New England Regional Meeting for Society for Academic Emergency Medicine, April 2014, New Haven, CT.
* Corresponding author at: Department of Emergency Medicine, Beth Israel Deaconess Medical Center, West, Clinical Center 2, One Deaconess Rd, Boston, MA 02215. Tel.: +1 303 900 3055.
E-mail address: rklasco@bidmc.harvard.edu (R.S. Klasco).
1 These authors contributed equally to this study as first authors.
Commonly cited weaknesses include unclear inclusion criteria, un- systematic case identification, implicit methods, subjective standards, inadequate reviewer training, lack of internal consistency, and confla- tion of correlation and causation (ie, the imprecise use of scientific terms such as dependence, association, and correlation) [3-6]. Reviews have also suffered from confusion between assessments of process (ie, physician performance) and outcome (ie, results of care) [7]. Remedia- tion of these weaknesses began in the 1960s with the work of Donabedian [8], who differentiated between assessments of process and assessments of outcome.
In previous studies, having higher numbers of reviewers and more experienced reviewers leads to greater agreement. Brennan et al [18] demonstrated good agreement (kappa = 0.57) between senior physi- cians and physician-reviewers trained in the use of an explicit adverse event form for the presence of adverse events. This comports with our process and agrees well with our finding of an intraclass correlation
http://dx.doi.org/10.1016/j.ajem.2016.03.001 0735-6757/(C) 2016
Fig. 1. Case identification flowchart.
coefficient (ICC) = 0.52 for “presence and severity of adverse events.” Although kappa and ICC are not synonymous, Fleiss and Cohen [19] con- sider the weighted kappa to be equivalent to the ICC.
In a later study of similarly trained reviewers, Brennan et al [20] found good agreement (kappa = 0.57) for causation of an adverse event and (kappa = 0.62) for negligence. Our findings of ICC = 0.72 for “degree of medical error” in conjunction with our finding of ICC =
0.52 for “presence and severity of adverse events” seem similar in intent and magnitude.
Hayward et al [21] noted poorer levels of agreement among physician-reviewers for focused quality problems and resource utiliza- tion (kappa <= 0.2) than were observed in our study. Hayward et al mea- sured agreement among pairs of reviewers and hypothesized that higher levels of agreement might have been achieved with larger num- bers of reviewers. Thomas et al [22] found moderate to poor interrater reliability among 3 physicians for adverse events. Therefore, our use of 13 reviewers per case may partially explain our superior results. The im- plicit (ie, subjective) criteria of Hayward et al, in contrast to our explicit methods, may have also contributed to the difference in results. In fur- ther support of this notion, Hofer et al [23] noted an intermediate level of agreement (ICC = 0.16-0.46) with the use of structured implicit criteria, perhaps suggesting a dose-response phenomenon for method- ological rigor.
Localio et al [24] observed frequent disagreement regarding the oc- currence of adverse events. Such a finding might be expected, however, as that study used pairs of reviewers and implicit criteria. Localio et al also hypothesized that greater reviewer experience might have pro- duced greater levels of agreement. Our use of experienced, trained re- viewers may then have contributed to our superior results. The work of Allison et al [4] supports this contention. In their study, several rounds of training and refinement improved interrater reliability from 80% to 96%.
We sought to address these issues through the use of explicit case definitions and programmatic case identification within a robust, protocol-driven QA process in which all cases were reviewed by a 13- member board of board-certified emergency physicians.
Methods
Study design, goals, and oversight
The study sample was a prospective cohort comprised of all patients presenting to a tertiary care academic emergency department (ED) be- tween November 2012 and November 2013. The ED has an annual cen- sus of 57,000 patients. Institutional review board jurisdiction was waived by the study hospital institutional review board.
To assess the degree of agreement among reviewers engaged in chart review, we used 6 predefined High-risk conditions that are com- monly used in QA processes: return to the ED within 72 hours, proce- dures performed in the ED (eg, intubation, tube thoracostomy), transfer from floor-to-ICU within 24 hours of admission, death within 24 hours of admission, complaints from physicians outside of the ED, and complaints from patients.
The ED QA committee provided oversight. The QA committee is inte- grated into the medical center’s overall QA operations through formal processes and procedures as described previously [9].
Selection of participants
An electronic medical record was created for each patient. The data- base of all electronic medical records was searched for cases meeting 1 or more of the above 6 predefined high-risk criteria using an electronic QA dashboard that interfaced with a commercially available health in- formation system [10]. Cases meeting criteria were flagged for
individual reviews by the 13 different QA committee members. The QA committee reviewers were all board-certified, attending-level physi- cians. A case identification flowchart can found in Fig. 1. The QA score- card review instrument can found in Fig. 2 and Fig. 3.
Data collection and processing
Thirteen board-certified emergency physician-reviewers reviewed each case independently. Each of the physician-reviewers was trained in the use of the rating instrument. None were involved in the care of the patients whose cases were reviewed. Each case was scored accord- ing to an 8-point Likert scale (the QA Scorecard) (Fig. 2) to determine whether (1) errors were made by the ED team, (2) adverse events oc- curred, (3) medical judgment of the ED team was adequate, and
(4) care was coordinated appropriately (ie, an assessment of the ade- quacy of communication among the ED team members, resident over- sight, and ED length of stay in relation to complexity of the case).
Provision was made for free-text comments by the reviewers. Each case was adjudicated in a manner consistent with our previous work [11]. Confidence intervals, P values, and ICCs were generated using Microsoft Excel 2010.
Agreement
Interrater reliability was assessed for the 13-member group for each criterion by an ICC which was calculated for each criterion according to the method of Fisher [12].
Results
Cases
Computer-generated random selection among the 926 cases flagged by computerized screening yielded 44 cases for analysis by the 13- member review board. These 44 cases were further classified by reason for review as return within 72 hours, procedural evaluation, floor-to- ICU within 24 hours, death within 24 hours, physician complaint, and patient complaint (Table 1).
Agreement
Among the 13-member board, agreement was assessed for each of the 44 cases randomly selected for review using the ICC (Table 2). Re- sults for agreement and the degree of significance for each parameter were as follows: “coordination of patient care” (0.67; P b .001), “degree of medical error” (0.72; P b .001), and “quality of medical judgment” (0.67; P b .001) (Table 2). The only parameter for which acceptable re- liability was not attained was “presence and severity of adverse events” (0.52; P = .001).
Discussion
Our study demonstrates that high levels of agreement are achievable in the chart review process. As described below, this was accomplished through the incorporation of a number of techniques, each of which has been independently associated with improved agreement: explicit re- view criteria; independent focus on processes and outcomes;
Were there Adverse Event(s) resulting from the care of the ED team? |
|||
Score |
Description |
Performance level |
QA response |
1 |
No adverse event occurred |
No error/no harm |
No reviewer feedback to team necessary, no QA committee review necessary |
2 |
An event may have occurred that had the capacity to cause injury, but did not reach patient |
Near miss |
|
3 |
An event occurred that may have reached the patient, but did not cause harm |
Reviewer gives feedback to team, but no QA review necessary |
|
4 |
Circumstances or events required additional monitoring or Screening tests (e.g., telemetry, serial physical examinations or lab test) but did not require additional treatment |
Monitoring only |
Discussion in QA committee with appropriate feedback and +/- remediation |
5 |
An event occurred that resulted in the need for treatment or intervention, and caused temporary Patient harm/injury/need for additional treatment |
Minor |
|
6 |
An event occurred that resulted in initial (if outpatient) or Prolonged hospitalization and caused temporary patient harm/injury/Disease progression |
Moderate |
|
7 |
An event occurred that resulted in permanent patient harm/injury/disease progression |
Major |
|
8 |
An event directly contributed to death of patient (n.b., do not check if patient death was unrelated to event) |
Death |
Fig. 3. Were there adverse event(s) resulting from the care of the ED team?
automated methods for case identification; large reviewer group; trained, board-certified reviewers, and a QA process that had been fully integrated into routine departmental operations.
Assessment of agreement
Assessment of agreement is among the most commonly cited weak- nesses of QA studies [12,13]. This weakness has also been noted specif- ically in the emergency medicine literature [14,15]. We measured the degree of agreement among our 13-member committee of physician- reviewers using Fisher’s ICC [12]. Although the establishment of cut points within a continuous variable to define ordinal categories (eg, good, fair, poor) necessitates value judgments, Osborne’s [16] Best Prac- tices in Quantitative Methods sets a value of greater than or equal to 0.60 for ICCs as the cutoff for acceptable reliability. By this rigorous standard, we demonstrated acceptable reliability for “coordination of patient care” (ICC = 0.67, P b .001), “degree of medical error” (ICC = 0.72, P b .001), and “quality of medical judgment” (ICC = 0.67, P b .001).
The only parameter for which acceptable reliability was not attained was “presence and severity of adverse events” (ICC = 0.52, P = .01).
According to the less stringent but oft-cited schema of Landis and Koch [17], we demonstrated moderate agreement for “presence and se- verity of adverse events” (ie, kappa =0.41-0.60) and substantial agree- ment (ie, kappa =0.61-0.80) for “coordination of patient care,” “degree of medical error,” and “quality of medical judgment.” As discussed below, this degree of agreement is comparable to or superior to similar studies reported to date.
Methodology, general
Case review studies have come under generalized criticism for weak methodology. We were able to satisfy 7 of the 8 general methods rec- ommended by Gilbert et al [2] for chart review: training, case selection, definition of variables, abstraction forms, meetings, monitoring, and testing of interrater agreement. The eighth recommendation, blinding,
Cases reviewed by QA flag criteria
Reason for review |
Cases, total (#) |
Cases, total (%) |
Cases, reviewed (#) |
Cases, reviewed (%) |
Return within 72 h |
333 |
36.0% |
20 |
45.5% |
Procedural evaluation |
267 |
28.8% |
14 |
31.8% |
Floor-to-ICU within 24 h |
122 |
13.2% |
5 |
11.4% |
Death within 24 h |
65 |
7.0% |
4 |
9.1% |
Physician complaint |
131 |
14.1% |
1 |
2.3% |
Patient complaint |
8 |
0.9% |
0 |
0.0% |
Interrater reliability of case review
QA category |
ICC |
95% Confidence interval |
P value |
Coordination of patient care |
0.67 |
0.47-0.81 |
b.001 |
Presence and severity of adverse events |
0.52 |
0.22-0.73 |
.001 |
Degree of medical error |
0.72 |
0.55-0.94 |
b.001 |
Quality of medical judgment |
0.67 |
0.46-0.81 |
b.001 |
was not practicable in our QA process, the details of which were de- scribed in our previous work [9].
Methodology, specific
Implicit vs explicit: Limitations in the repeatability of subjective cat- egorical scoring hospital QA activities have been noted since the 1950s [25,26,21]. The superiority of explicit (ie, objective) rather than implicit (ie, subjective) methods was demonstrated by Brook and Appel [7] in the 1970s and was extended by Brennan et al [18] in the 1980s. We therefore restricted our purview to the explicit criteria (ie, return within 72 hours, procedural evaluation, floor-to-ICU within 24 hours, death within 24 hours, physician complaint, patient complaint) that we had validated in our previous work [9].
Process vs outcome: We adapted the methodology of Brook and Appel [7], defining processes as actions that the physician under- takes on behalf of the patient (eg, coordination of care, medical judg- ment) and defining outcomes as the results of care (eg, Medical errors, adverse events).
Automated vs manual: Although automated techniques for the sys- tematic identification of cases–initially through the use of machine- readable paper–have also been available since the 1950s, it was not widely practicable methodology until the advent of computerized med- ical records in the 1990s [27]. Myers et al demonstrated the viability of automated case identification within a computerized medical record system with a limited set of Healthcare Effectiveness Data and Informa- tion Set measures in the Harvard Pilgrim system [28]. The benefits of such an approach with regard to interrater agreement have been vali- dated by other investigators [29,30].
Groups vs individuals: A number of investigators have argued that adequate agreement can be achieved among physician-reviewers with regard to groups of patients, but reliable judgments cannot be rendered with regard to individual patients [31,7]. In contrast, we found good agreement among reviewers with regard to the care of individual pa- tients. These results may be a function of our use of explicit criteria, au- tomated case identification, reviewer training, large reviewer panels, and training. This contention is supported by the work of several other investigators [24,22,27].
Limitations
The potential for outcome bias is the most significant limitation of our study. Outcome bias might have occurred in our study secondary to lack of blinding. Caplan et al [32] described an inverse relationship between the severity of outcome and the reviewers’ impression of the quality of care. Among emergency physicians, Gupta et al [33] also dem- onstrated that agreement was highest at the extremes of outcome. That is, agreement was highest with good outcomes and bad outcomes and was lowest with intermediate outcomes. Our data might appear to con- tradict these observations, with similar levels of agreement being found for “presence and severity of adverse events” (ICC = 0.52) and “quality of medical care” (ICC = 0.67), yet case-by-case concordance on these parameters was not sought in our study and should not be inferred. Our results might be strengthened with blinding, which is not currently supported by our automated QA system.
Collegiality may also pose a limitation, as all of the physicians on the QA committee were also colleagues of the physicians who cared for the
study patients. Although collegiality may have biased physician- reviewers toward judging care more favorably, a consistent bias toward agreement cannot be inferred in light of the potentially divergent influ- ence of competing biases (eg, outcome bias).
Finally, the use of 13 independent reviewers for each chart was a limitation. Such resource-intensive redundancy is not practicable in most departments. Nevertheless, it demonstrates the concept that in- creasing levels of agreement may be associated with larger numbers of reviewers. Determining the optimal number of reviewers that opti- mize efficiency while maintaining acceptable levels of agreement will be an important focus for further research.
Conclusions
Agreement in the chart review process can be achieved among physician-reviewers. The degree of agreement attainable is comparable to or superior to that of similar studies reported to date. These results highlight the potential for the use of computerized screening, explicit criteria, and training of expert reviewers to improve the reliability and validity of chart review-based QA [34].
References
- Codman EA. A study in hospital efficiency : as demonstrated by the case report of the first
five years of a private hospital. Facsimile; 1869-1940[Originally Published in; 1918].
Gilbert EH, Lowenstein SR, Koziol-McLain J, Barta DC, Steiner J. Chart reviews in emergency medicine research: where are the methods? Ann Emerg Med 1996; 27(3):305-8.
- Sanazaro PJ, Mills D. A critique of the use of generic screening in quality assessment. JAMA 1991;265(15):1977-81.
- Allison JJ, Wall TC, Spettell CM, Calhoun J, Fargason CA, Kobylinski, et al. The art and science of chart review. Jt Comm J Qual Patient Saf 2000;26(3):115-36.
- Panacek EA. Performing chart review studies. Air Med J 2007;26(5):206-10.
- Altman N, Krzywinski M. Points of significance: association, correlation and causa- tion. Nat Methods 2015;12(10):899-900.
- Brook RH, Appel FA. Quality-of-care assessment: choosing a method for peer review. N Engl J Med 1973;288(25):1323-9.
- Donabedian A. Evaluating the quality of medical care. Milbank Mem Fund Q 1966; 44(3):166-206.
- Klasco RS, Wolfe RE, Wong M, Edlow J, Chiu D, Anderson PD, Grossman SA. Assessing the rates of error and adverse events in the ED. Am J Emerg Med 2015;33:1786-9.
- Handel DA, Wears RL, Nathanson LA, Pines JM. Using information technology to im- prove the quality and safety of emergency care. Acad Emerg Med 2011;18(6): e45-51.
- Handler JA, Gillam M, Sanders AB, Klasco R. Defining, identifying, and measuring error in emergency medicine. Acad Emerg Med 2000;7(11):1183-8.
- Fisher RA. Statistical methods for research workers [Internet]. 5th ed., revised and enlarged. London: Oliver and Boyd; 1934[Available from: http://www.haghish. com/resources/materials/Statistical_Methods_for_Research_Workers.pdf].
- Dans PE. Clinical peer review: burnishing a tarnished icon. Ann Intern Med 1993; 118(7):566-8.
- Badcock D, Kelly A-M, Kerr D, Reade T. The quality of medical record review studies in the international emergency medicine literature. Ann Emerg Med 2005;45(4):444-7.
- Worster A, Bledsoe RD, Cleve P, Fernandes CM, Upadhye S, Eva K. Reassessing the methods of medical record review studies in emergency medicine research. Ann Emerg Med 2005;45(4):448-51.
- Osborne JW. Best practices in quantitative methods. SAGE; 2008.
- Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33(1):159-74.
- Brennan TA, Localio RJ, Laird NL. Reliability and validity of judgments concerning ad- verse events suffered by hospitalized patients. Med Care 1989;27(12):1148-58.
- Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Meas 1973;33(3):613-9.
- Brennan TA, Localio AR, Leape LL, Laird NM, Peterson L, Hiatt HH, Barnes BA. Identi-
fication of adverse events occurring during hospitalization. A cross-sectional study of litigation, quality assurance, and medical records at two teaching hospitals. Ann Intern Med 1990;112(3):221-6.
Hayward RA, McMahon LF, Bernard AM. Evaluating the care of general medicine in- patients: how good is implicit review? Ann Intern Med 1993;118(7):550-6.
- Thomas EJ, Lipsitz SR, Studdert DM, Brennan TA. The reliability of medical record re- view for estimating Adverse event rates. Ann Intern Med 2002;136(11):812-6.
- Hofer TP, Asch SM, Hayward RA, Rubenstein LV, Hogan MM, Adams J, Kerr EA. Profiling quality of care: is there a role for peer review? BMC Health Serv Res 2004;4(1):9.
- Localio AR, Weaver SL, Landis JR, Lawthers AG, Brennan TA, Hebert L, Sharp TJ. Iden- tifying adverse events caused by medical care: degree of physician agreement in a retrospective chart review. Ann Intern Med 1996;125(6):457-64.
- Myers RS. Hospital statistics don’t tell the truth. Mod Hosp 1954;83(1):53-4.
- Rosenfeld LS. Quality of medical care in hospitals. Am J Public Health Nations Health 1957;47(7):856-65.
- Myers RS, Slee VN, Hoffmann RG. The medical audit protects the patient, helps the physician, and serves the hospital. Mod Hosp 1955;85(3):77-83.
- Luck J, Peabody JW, Lewis BL. An automated scoring algorithm for computerized clinical vignettes: evaluating physician performance against explicit quality criteria. Int J Med Inform 2006;75(10-11):701-7.
- Cassidy LD, Marsh GM, Holleran MK, Ruhl LS. Methodology to improve data qual- ity from chart review in the managed care setting. Am J Manag Care 2002;8(9): 787-93.
- Rubenstein LV, Kahn KL, Reinisch EJ, Sherwood MJ, Rogers WH, Kamberg C, et al. Changes in quality of care for five diseases measured by implicit review, 1981 to 1986. JAMA 1990;264(15):1974-9.
- Myers RS, Slee VN, Hoffmann RG. The medical audit protects the patient, helps the physician, and serves the hospital. Mod Hosp 1955;85(3):77-83.
- Caplan RA, Posner KL, Cheney FW. Effect of outcome on physician judgments of appropriateness of care. JAMA 1991;265(15):1957-60.
- Gupta M, Schriger DL, Tabas JA. The presence of outcome bias in emergency physi- cian retrospective judgments of the quality of care. Ann Emerg Med 2011;57(4): 323-8 [e9].
- Office of the National Coordinator for Health information technology. Health IT en- abled quality improvement. A vision to achieve better health and health care [Inter- net]. Available from: https://www.healthit.gov/sites/default/files/HITEnabledQuality Improvement-111214.pdf