Article, Cardiology

Developing neural network models for early detection of cardiac arrest in emergency department

a b s t r a c t

Background: Automated surveillance for cardiac arrests would be useful in overcrowded emergency departments. The purpose of this study is to develop and test artificial neural network (ANN) classifiers for early detection of patients at risk of cardiac arrest in emergency departments.

Methods: This is a single-center electronic health record -based study. The primary outcome was the development of cardiac arrest within 24 h after prediction. Three ANN models were trained: multi- layer perceptron (MLP), long-short-term memory (LSTM), and hybrid. These were compared to other classifiers including the modified early warning score , logistic regression, and random forest. We used AUROC, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for the comparison.

Results: During the study period, there were a total of 374,605 ED visits and 2,910,321 patient status updates. The ANN models (MLP, LSTM, and hybrid) achieved higher AUROC (AUROC: 0.929, 0.933, and 0.936; 95% confidential interval: 0.926-0.932, 0.930-0.936, and 0.933-0.939, respectively) compared to the non-ANN models, and the hybrid model exhibited the best performance. The ANN classifiers dis- played higher performance in most of the test characteristics when the threshold levels of the classifiers were fixed to display the same positive result as those at the three MEWS thresholds (score >= 3, >=4, and

>=5), and when compared with each other.

Conclusions: The ANN improves upon MEWS and conventional machine learning algorithms for the pre- diction of cardiac arrests in emergency departments. The hybrid ANN model utilizing both baseline and sequence information achieved the best performance.

(C) 2019

Introduction

The increase in the number of patients and the overcrowding of emergency departments (EDs) are causing treatment delays, which affects the incidence of adverse outcomes including inpatient mor- tality [1]. To solve these problems, it would be desirable to resolve the overcrowding of EDs; however, achieving this in reality pre- sents many difficulties. Ultimately, it is important to identify patients more likely to experience adverse outcomes in an over- crowded ED with limited resources, and take appropriate manage- ment measures [2]. Although most EDs use a triage system to identify critically ill patients, if the initial findings pertaining to

* Corresponding author.

E-mail address: [email protected] (J. Kim).

vital signs and mental status at the time of the visit were not sig- nificantly affected, patients who are likely to deteriorate quickly can be overlooked.

Previous studies of in-hospital patients suggest that a large number of cardiac arrests can be predicted, because the sign of a deterioration (such as a change in a vital sign) can be present up to 24 h before the event [3]. Because a single vital sign does not accurately predict patient prognosis [4], many Clinical scores have been developed to provide an early warning of cardiac arrest in hospitalized patients [5]. However, a simple clinical score does not include variables that are difficult to quantify, such as a chief complaint or change in vital signs. Furthermore, several attempts have been made to predict cardiac arrest using various machine- learning algorithms, and some hospitals are already using theses scores and alarm systems [6-8]. These methods performed well

https://doi.org/10.1016/j.ajem.2019.04.006 0735-6757/(C) 2019

in identifying hospitalized patients who were more likely to suffer a cardiac arrest, but they were not developed for primary use in EDs. Although some studies have been conducted on patients in EDs limited to specific patient groups (such as those presenting with sepsis or chest pain), there has been no comprehensive study of all patients in EDs [9,10].

Recently, there have been several attempts to adopt various artificial neural network (ANN) algorithms to predict adverse clin- ical events [6,11,12]. The performance of the ANN algorithm is strongly influenced by the type of information provided, and by a design that facilitates the maximum utilization of information. Therefore, it may vary greatly based on the patient population, type of information used, and structure of the network. Until now, there has been no published attempt to describe what kind of information to use, how to train the ANN for predicting cardiac arrests in EDs, or how to determine its performance after training. We hypothesized that utilizing ANN algorithms would lead to better Predictive performance than that obtained with current popular scoring systems, such as the Modified Early Warning Score or algorithms used in previous studies [8,13], and would help to detect cardiac arrest earlier with better accuracy in the ED. Therefore, the main purpose of this study is to identify opti- mized network designs of ANNs for the prediction of cardiac arrest in EDs and to test the performance of the trained networks. Because there were no existing instructions for developing and optimizing these networks, we will also share our methodologies and experiences. Accordingly, we will report on the performance of our algorithms, and share the methodology for designing appro- priate warning systems by observing sensitivity and specificity

changes according to threshold values.

Methods

Study design, setting, and primary outcome

This is a retrospective study assessing the performance of vari- ous ANN models for the prediction of cardiac arrest in ED patients. The electronic health records (EHR) of a tertiary academic hospital with >80,000 annual ED visits was used for this study. The primary outcome was the development of cardiac arrest within 24 h, after a new update on patient condition (vital signs and consciousness) in the ED. The institutional review boards of the study site (Seoul National University Bundang Hospital) approved the study and provided a waiver of informed consent.

Study population and predictors

The inclusion criteria were non-traumatic ED visits of patients (aged 18 and over) from 2008 to 2016. Exclusion criteria were out-of-hospital cardiac arrest (OHCA), dead on arrival (DOA), and transfer to another hospital from the ED.

The following variables were used to predict an event of cardiac arrest: demographic information (age and sex), chief complaints, vital signs (systolic blood pressure [SBP], diastolic blood pressure [DBP], heart rate [HR], respiratory rate [RR], and body temperature [BT]), and level of consciousness (measured according to the AVPU scale).

The variables are grouped into ‘baseline’ and ‘updatable’. The baseline variables include demographic information, chief complaints, the first measurements of vital signs, and level of con- sciousness at ED triage. The updatable variables include vital signs and level of consciousness measured within 6 h of each prediction. The ‘update’ on the current state is only made when there is a new measurement.

Data partition and preprocessing

A prediction is made at each of the update events. For conve- nience, we constructed a large dataset composed of all the predic- tion events paired with the corresponding baseline and updatable variables for up to 12 recent updates. The dataset was partitioned into training, validation, and test sets with a partition ratio of 70:15:15. Because a patient can visit the same ED multiple times and each of the visits can produce multiple measurements of the same variables, we used the patient unique identification number for data partitioning to prevent information leakage between datasets.

Categorical variables were one-hot encoded, and numerical variables were centered and scaled with the corresponding stan- dard deviations. The chief complaint was embedded into a vector space of 64 dimensions because of its high cardinality (>1000) using the following procedure: 1) pairing chief complaints for each visit with their corresponding discharge diagnoses, 2) calculating the pointwise mutual information (PMI) of each of the co- occurring pair constructing a PMI matrix [14], and 3) applying sin- gular value decomposition to the PMI matrix (MPMI = URV/), where MPMI is the PMI matrix, U is an m x m unitary matrix, R is a diag- onal m x n matrix with non-negative real numbers on the diagonal and V/ is the conjugate transpose of V, which is an n x n unitary matrix. Term m indicates the total number of unique symptoms coded in the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), and term n indicates the total number of unique diagnoses coded in the International Classification of Dis- eases, 10th Revision (ICD-10). We used left singular vectors (in U) with the largest singular values (N = 64) for the embedding [15].

ANN models and training

We designed three ANN models, as shown in Fig. 1. The first was a model with a multilayer perceptron (MLP) architecture. The MLP model uses baseline variables (age, sex, chief complaint vector embedding, initial vital signs, and level of consciousness) and the last status of updatable variables for prediction (last status of vital signs and level of consciousness). The second model consists of sequential long-short-term memory (LSTM) layers stacked on MLP layers. This LSTM model uses up to 12 recent updates within the past 6 h in a sequential format. The third one is ‘the hybrid’ model, in which the baseline and sequence data are processed sep- arately first, and then fused to predict a single outcome. Specifi- cally, baseline variables are processed within the MLP architecture while sequence data are processed within the LSTM architecture. They are then concatenated into a single vector to be fed into the MLP architecture. The specific design choices for the models, such as number of units and layers, were made using a grid search with a search space that was determined empirically. We used 20% of the training dataset for the grid search to improve the search speed.

The models were trained using the whole training dataset with a batch size of 128. We used 20% of the permuted training dataset for each training session (epoch) in a sequential manner. The losses were measured using binary cross-entropy and the optimization algorithm used was the adaptive moment estimation (Adam). Where there was no significant improvement in loss for five itera- tions, an early stopping rule was applied to stop the training.

Statistical analyses

Categorical variables were reported using frequencies and pro- portions, and continuous variables were reported using the median

Fig. 1. A) Structure of multilayer perceptron (MLP) model, B) Structure of, long-short-term memory (LSTM) model, C) Structure of Hybrid model; ReLu, Rectified Linear Units; tanh, Hyperbolic Tangent.

and interquartile range (IQR). Wilcoxon’s rank-sum test, chi- squares test, or Fisher’s exact test were performed (as appropriate) for comparisons between groups.

For comparison, we built two non-ANN models including the logistic regression model and random forest (RF) model. The ANN models were compared to the non-ANN models and MEWS based on AUROC, sensitivity, specificity, positive predictive value (PPV),

and negative predictive value (NPV). Using the latter four test char- acteristics required a fixed threshold level. The threshold level for each model was set such that the model would have the same chance of producing positive test result as MEWS with specific cut- off values (3, 4, and 5) in the ED.

If p < 0.05, this was considered significant. The ANN models were trained on Python 3.6.5 using Tensorflow 1.9.0 and Keras

2.1.6. Initial data handling, construction of non-ANN models, and overall model comparison were performed using R package 3.5.0 (R Foundation for Statistical Computing, Vienna, Austria).

Results

There were 374,605 eligible ED visits of 233,763 patients during the study period from 2008 to 2016. There were 2,910,321 updates in patient condition, and thus the same number of prediction events (7.8 times per visit, on average). The outcome event (cardiac arrest) occurred for 1097 patients (0.3%). Table 1 shows the clinical characteristics of ED visits partitioned into three groups, based on the unique patient identification number.

The ANN hyper-parameters previously described were deter- mined with a grid search using 20% of the training dataset. The ANN models were trained based on the hyper-parameters using the whole training dataset. Fig. 2 (Table 2) shows the overall pre- dictive performance of the ANN models as well as MEWS and non-ANN models. The three ANN models (MLP, LSTM, and hybrid) achieved higher AUROC (AUROC: 0.929, 0.933, and 0.936; 95% CI:

0.926-0.932, 0.930-0.936, and 0.933-0.939, respectively) than the best performing of the non-ANN models (RF model: AUROC: 0.923; 95% CI 0.919-0.926; p < 0.001 for all pairs). Among ANNs, the hybrid model showed the best performance compared with the MLP and LSTM models, with a p-value difference <0.001.

Comparing the test characteristics (sensitivity, specificity, and predictive values) required a fixed threshold level for each classi- fier. We determined three MEWS thresholds (score >=3, >=4, and

>=5) as baselines and set the threshold levels for non-MEWS classi- fiers to have the same probability of producing a positive result as MEWS. Compared to MEWS, all three ANN classifiers consistently showed improved performance in sensitivity, positive and negative predictive values, but not in specificity (p < 0.001 for all pairs; Fig. 3, Supplementary Table 1). The improvements over the top- performing non-ANN classifier (RF) were only statistically signifi- cant in ANN classifiers using the LSTM architecture (LSTM and hybrid model, p < 0.001 for all pairs).

Discussion

In a busy ED with many patients whose medical stability is unknown, an accurate AI-system for the prediction of cardiac

Fig. 2. Predictive performance (AUROC) of three artificial neural network (ANN) models, modified early warning score (MEWS), Logistic regression, and Random forest. The three ANN models show higher AUROC compared to the non-ANN models.

Table 2

AUROC of models

P for difference

AUROC vs. RF vs. MLP vs. LSTM ANN: MLP 0.929 (0.926-0.932) <0.001

ANN: LSTM 0.933 (0.930-0.936) <0.001 <0.001

ANN: Hybrid 0.936 (0.933-0.939) <0.001 <0.001 <0.001

MEWS 0.886 (0.882-0.891)

Logistic 0.914 (0.910-0.918)

RF 0.923 (0.919-0.926)

ANN, artificial neural network; MLP, multilayer perceptron; LSTM, long-short-term memory; MEWS, modified early warning score; Logistic, logistic regression; RF, random forest.

Table 1

Baseline characteristics of ED visit cases.

Training set

validation set

Test set

P value

(N = 261,926)

(N = 56,368)

(N = 56,311)

Age

52.0 (35.0-68.0)

51.0 (35.0-68.0)

51.0 (35.0-67.0)

0.537

Sex, male, N (%)

120,007 (45.8)

26,026 (46.2)

25,577 (45.4)

0.041

Initial SBP, median (IQR)

132.0 (118.0-150.0)

132.0 (118.0-150.0)

132.0 (118.0-150.0)

0.488

Initial DBP, median (IQR)

76.0 (67.0-86.0)

76.0 (67.0-86.0)

76.0 (67.0-86.0)

0.734

Initial HR, median (IQR)

82.0 (72.0-96.0)

82.0 (72.0-96.0)

82.0 (72.0-96.0)

0.071

Initial RR, median (IQR)

18.0 (18.0-20.0)

18.0 (18.0-20.0)

18.0 (18.0-20.0)

0.475

Initial BT, median (IQR)

36.6 (36.3-36.9)

36.6 (36.3-36.9)

36.6 (36.3-36.9)

0.891

Initial AVPU, N (%)

- A

252,223 (96.3)

54,297 (96.3)

54,278 (96.4)

0.763

- V

4225 (1.6)

878 (1.6)

884 (1.6)

- P

4087 (1.6)

897 (1.6)

875 (1.6)

- U

1391 (0.5)

296 (0.5)

274 (0.5)

Cardiac arrest, N (%)

784 (0.3)

150 (0.3)

163 (0.3)

0.411

Disposition, N (%)

- Discharged

200,454 (76.5)

43,213 (76.7)

43,101 (76.5)

0.372

- Admitted to ICU

8176 (3.1)

1690 (3.0)

1743 (3.1)

- Admitted to Ward

52,843 (20.2)

11,382 (20.2)

11,360 (20.2)

- Death

Length of stay, minutes, median (IQR)

453 (0.2)

208.0 (123.0-421.0)

83 (0.1)

205.0 (121.0-418.0)

107 (0.2)

206.0 (122.0-418.0)

0.051

IQR, interquartile range; SBP, systolic blood pressure; DBP, diastolic blood pressure; HR, heart rate; RR, respiratory rate; BT, body temperature; ICU, intensive care unit.

Fig. 3. Comparison of sensitivity, specificity, positive predictive value, and negative predictive value of each model according to different threshold levels. Three MEWS threshold levels (score >= 3, >=4 and >=5) were used, and other models were set to show positive results similar to those in MEWS.

arrest can provide great benefits. In this study, we trained and tested various ANN classifiers to identify patients at risk of devel- oping cardiac arrest in the ED. The results show that ANN-based classifiers, especially those utilizing both static and dynamic infor- mation, can improve upon the conventional scoring system (MEWS) and a popular high-performing machine-learning algo- rithm (RF). To the best of our knowledge, this is the first study test- ing the performance of the ANN model for prediction of cardiac arrest in EDs.

Churpek et al. [8] compared the performance of various machine-learning algorithms for the prediction of deterioration in hospitalized patients. In the study, they demonstrated that RF achieves the best performance while the MLP classifier has a lower performance. However, the performance of ANNs can vary greatly depending on their architectures and training methodolo- gies [16]. Unlike feedforward networks, recurrent neural net- works (including LSTM) process sequence data by iterating through the elements of the sequence data and store state infor- mation during the process [17]. Therefore, they are better suited to handling time-series data such as vital signs. Recently, Kwon

et al. [6] showed that their LSTM classifier achieves a better per- formance when predicting cardiac arrest in hospitalized patients. This finding is supported by our present study, where LSTM clas- sifiers demonstrated better performance than both the RF and MLP classifiers.

In this study, we found that the superiority of the LSTM archi- tecture is applicable to the ED situation. It is important to point out that this architecture is equally applicable to ED patients who may experience dynamic changes in the early stages of a dis- ease. In addition, we also found that the inclusion of the chief complaint (which has a very high cardinality) and initial vital sta- tus (as separate inputs to a hybrid network of LSTM and MLP) can improve the accuracy. Using high-cardinality variables for machine learning has been perceived as a difficult task for machine learning practitioners. In addition to embedding in a vector space, we could have utilized impact coding, which is a more direct approach.

In this study, the ANN model displayed a statistically significant difference in indexes such as AUROC compared with other classifi- cation techniques; however, the degree of actual numerical differ-

ence was relatively small. This is not because the ANN model is not significantly different from other models, but because even the MEWS (which has a relatively low accuracy compared to other tools) seems to have very high performance (AUROC 0.886) in our ED. This contrasts with previous studies on the accuracy of pre- dicting poor outcomes or cardiac arrest using MEWS in the ED, with an AUROC slightly better than 0.7 [10,18]. Given the statisti- cally significant differences in our study, early warning systems using the ANN model may be more useful in other EDs, where tools such as MEWS have performed poorly.

Lastly, we need to discuss how the prediction models such as ours could be tested and refined in a prospective manner. Trained Machine learning models can be exported to existing EHR sys- tems. Such implementations may help clinicians to detect adverse events earlier and take preventive measures. In addition, with proper adjustment of thresholds, they can even reduce ”alarm fatigues” and improve overall efficiency of Health care delivery. In this context, prediction accuracy would not reflect the real per- formance of a model because they will influence the future events. Instead, we can directly measure the beneficial effects, such as the reduction of cardiac arrest events or alarm frequency, using randomization. The randomization can take place among individuals within an EHR system or among institutions within a large multi-center study. Considering the relative scarcity of sig- nificant adverse events that will not be detected without the help of such implementations, we think large multicenter study should be a preferable choice.

There are some limitations in our study. First, in order to eval- uate the adequacy of the early warning system, it may not be sufficient to solely compare indicators that do not include tempo- ral concepts such as sensitivity and PPV [19]. We assumed that the prediction system worked properly if a cardiac arrest occurred within 24 h of the last information input. Based on this, it is difficult to gauge whether the early warning system using this model provides the clinician with information about a dete- riorating patient in a timely manner. Furthermore, considering that the neural network model is a black box model that can only yield prediction results (but does not know why the prediction results are reported), the clinician may not understand why such a prediction was made, and it may be impossible to determine what action to take. There is no clinical study to evaluate a model including temporal adequacy, and further study on this will be needed from a methodological perspective to improve the model.

Second, this study only examined the performance of the ANN models to identify patients who are more likely to have a cardiac arrest. It is not yet clear how many cardiac arrests can be pre- vented by using our model. A prospective study is required evalu- ating how the warning system using this ANN model can affect therapeutic decisions and patient outcomes. There have been some reports that early warning systems helped to improve the outcome for sepsis patients [7,20]. We believe that helping clinicians pre- check and identify patients who are more likely to have a cardiac arrest in an overcrowded ED may help to alert the clinician. Addi- tionally, we believe that this model could be applied to the triage system as well as to the early warning system for patients staying in the ED.

Conclusion

The ANN model improves upon MEWS and conventional machine learning algorithms in the prediction of cardiac arrest in

EDs. We found that the hybrid ANN model utilizing both baseline and sequence information achieved the best performance.

Supplementary data to this article can be found online at https://doi.org/10.1016/j.ajem.2019.04.006.

Author contributions

Research conception & design: D-H Jang, J Kim, Y H Jo, J H Lee, J E Hwang. Data analysis and interpretation: J Kim, Y H Jo, I Park, S M Park, D K Lee. Drafting of the manuscript: D-H Jang, J Kim, J E Hwang, D Kim, H Chang. Critical revision and editing: D-H Jang, J Kim, I Park, S M Park, D K Lee. Approval of final manuscript: all authors.

Disclosures

Conflicts of interest: none

Acknowledgements

The authors thank Division of Statistics in Medical Research Collaborating Centre at Seoul National University Bundang Hospital for statistical analyses. This study was supported by Seoul National University Bundang Hospital (SNUBH) grant 13-2017-015.

References

  1. Sun BC, Hsia RY, Weiss RE, Zingmond D, Liang LJ, Han W, et al. Effect of emergency department crowding on outcomes of admitted patients. Ann Emerg Med 2013;61:605-11 e6.
  2. Nannan Panday RS, Minderhoud TC, Alam N, Nanayakkara PWB. Prognostic value of early warning scores in the emergency department (ED) and acute medical unit (AMU): a narrative review. Eur J Intern Med 2017;45:20-31.
  3. Hodgetts TJ, Kenward G, Vlachonikolis IG, Payne S, Castle N. The identification of risk factors for cardiac arrest and formulation of activation criteria to alert a medical emergency team. Resuscitation 2002;54:125-31.
  4. Hong W, Earnest A, Sultana P, Koh Z, Shahidah N, Ong ME. How accurate are vital signs in predicting clinical outcomes in critically ill emergency department patients. Eur J Emerg Med 2013;20:27-32.
  5. Prytherch DR, Smith GB, Schmidt PE, Featherstone PI. ViEWS-towards a national early warning score for detecting adult inpatient deterioration. Resuscitation 2010;81:932-7.
  6. Kwon JM, Lee Y, Lee Y, Lee S, Park J. An algorithm based on deep learning for predicting in-hospital cardiac arrest. J Am Heart Assoc 2018:7.
  7. McCoy A, Das R. Reducing patient mortality, length of stay and readmissions through machine learning-based sepsis prediction in the emergency department, intensive care unit and hospital floor units. BMJ Open Qual 2017;6:e000158.
  8. Churpek MM, Yuen TC, Winslow C, Meltzer DO, Kattan MW, Edelson DP. Multicenter comparison of Machine learning methods and conventional regression for predicting clinical deterioration on the wards. Crit Care Med 2016;44:368-74.
  9. Liu N, Koh ZX, Goh J, Lin Z, Haaland B, Ting BP, et al. Prediction of adverse cardiac events in emergency department patients with chest pain using machine learning for variable selection. BMC Med Inform Decis Mak 2014;14:75.
  10. Ong ME, Lee Ng CH, Goh K, Liu N, Koh ZX, Shahidah N, et al. Prediction of cardiac arrest in critically ill patients presenting to the emergency department using a machine learning score incorporating Heart rate variability compared with the modified early warning score. Crit Care 2012;16:R108.
  11. Hu SB, Wong DJ, Correa A, Li N, Deng JC. Prediction of clinical deterioration in hospitalized adult patients with hematologic malignancies using a neural network model. PLoS One 2016;11:e0161401.
  12. Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digital Med 2018;1:18.
  13. Subbe CP, Kruger M, Rutherford P, Gemmel L. Validation of a modified early warning score in medical admissions. QJM 2001;94:521-6.
  14. Church KW, Hanks P. Word association norms, mutual information, and lexicography. Comput Linguist 1990;16:22-9.
  15. Zheng A, Casari A. Feature engineering for machine learning: principles and techniques for data scientists. O’Reilly Media, Inc.; 2018.
  16. Penny W, Frost D. Neural networks in clinical medicine. Med Decis Making 1996;16:386-98.
  17. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9:1735-80.
  18. Ho LO, Li H, Shahidah N, Koh ZX, Sultana P, Hock Ong ME. Poor performance of the modified early warning score for predicting mortality in critically ill patients presenting to an emergency department. World J Emerg Med 2013;4:273-8.
  19. Scully CG, Daluwatte C. Evaluating performance of early warning indices to predict physiological instabilities. J Biomed Inform 2017;75:14-21.
  20. Umscheid CA, Betesh J, VanZandbergen C, Hanish A, Tait G, Mikkelsen ME, et al. Development, implementation, and impact of an automated early warning and response system for sepsis. J Hosp Med 2015;10:26-31.

Leave a Reply

Your email address will not be published. Required fields are marked *