Comparison of artificial intelligence versus real-time physician assessment of pulmonary edema with lung ultrasound

a b s t r a c t

Background: Lung ultrasound can evaluate for pulmonary edema, but data suggest moderate inter-rater reliability among users. Artificial intelligence (AI) has been proposed as a model to increase the accuracy of B line interpre- tation. Early data suggest a benefit among more novice users, but data are limited among average residency- trained physicians. The objective of this study was to compare the accuracy of AI versus real-time physician assessment for B lines.

Methods: This was a prospective, observational study of adult Emergency Department patients presenting with suspected pulmonary edema. We excluded patients with active COVID-19 or interstitial lung disease. A physician performed thoracic ultrasound using the 12-zone technique. The physician recorded a video clip in each zone and provided an interpretation of positive (>=3 B lines or a wide, dense B line) or negative (<3 B lines and the absence of a wide, dense B line) for pulmonary edema based upon the real-time assessment. A research assistant then uti- lized the AI program to analyze the same saved clip to determine if it was positive versus negative for pulmonary edema. The physician sonographer was blinded to this assessment. The video clips were then reviewed indepen- dently by two expert physician sonographers (ultrasound leaders with >10,000 prior ultrasound image reviews) who were blinded to the AI and initial determinations. The experts reviewed all discordant values and reached consensus on whether the field (i.e., the area of lung between two adjacent ribs) was positive or negative using the same criteria as defined above, which served as the gold standard.

Results: 71 patients were included in the study (56.3% female; mean BMI: 33.4 [95% CI 30.6-36.2]), with 88.3%

(752/852) of lung fields being of adequate quality for assessment. Overall, 36.1% of lung fields were positive for pulmonary edema. The physician was 96.7% (95% CI 93.8%-98.5%) sensitive and 79.1% (95% CI

75.1%-82.6%) specific. The AI software was 95.6% (95% CI 92.4%-97.7%) sensitive and 64.1% (95% CI

59.8%-68.5%) specific.

Conclusion: Both the physician and AI software were highly sensitive, though the physician was more specific. Future research should identify which factors are associated with increased diagnostic accuracy.

(C) 2023

  1. Introduction

Dyspnea is one of the most common reasons for presentation to the Emergency Department (ED), comprising nearly 5 million visits in 2020 [1]. There are a multitude of causes for acute dyspnea in ED patients (e.g., heart failure, asthma, chronic obstructive pulmonary disease

* Corresponding author.

E-mail addresses: [email protected] (M. Gottlieb), [email protected] (D. Patel), [email protected] (M. Viars), [email protected] (J. Tsintolas), [email protected] (G.D. Peksa), [email protected] (J. Bailitz).

[COPD], pulmonary embolism). One of the more common etiologies is acute decompensated heart failure, which requires urgent diagnosis and targeted interventions to reduce morbidity and mortality. However, it can be challenging to diagnose this clinically, as patients often have more than one medical condition which can predispose to dyspnea [2-4]. Moreover, many of the common history and physical examination findings have been found to have poor diagnostic utility [5]. Chest radio- graphs may also be less accurate, with data suggesting that thoracic Point-of-care ultrasound may be superior for identifying pulmonary edema [6].

When using POCUS, pulmonary edema is identified by the presence of sonographic B lines. B lines are discrete, beam-like vertical


0735-6757/(C) 2023

hyperechoic reverberation artifacts arising from the pleural line that extend to the bottom of the ultrasound screen and move synchronously with Lung sliding [7]. The presence of three or more B lines in the space between two contiguous ribs are suggestive of pulmonary edema [7]. Identification of these findings can help improve diagnostic accuracy overall and shorten the time to diagnosis by assessing for these while the clinician is at the patient’s bedside [6]. This could also be useful in resource-limited settings and to assess responses to therapeutic interventions [8].

Importantly, POCUS is a user-dependent skill that requires structured training and assessment [9,10]. Artificial intelligence (AI) has been increasingly utilized to automate assessments and improve diagnostic accuracy. Prior studies focused on AI for pulmonary edema have primarily been performed by non-physicians or early learners [11,12], while others have been limited by small sample sizes [13]. There are limited data directly comparing the diagnostic accuracy of AI with real-time assessment by trained physicians. This aspect is criti- cal to better understand the potential role in comparison with typical use of B line assessment in practice and how AI would compare.

The primary aim of this study was to directly compare the sensitivity and specificity of lung ultrasound with AI versus a trained physician sonographer performing this in real-time. As a secondary objective, we sought to compare the sensitivity and specificity among patients with a low versus high body mass index (BMI).

  1. Methods

This was a prospective, observational study comparing the diagnostic accuracy of AI with real-time physician assessment of B lines among pa- tients with suspected pulmonary edema. We followed the Strengthening the Reporting of Observational Studies in Epidemiology guidelines [14]. The study was conducted at Rush University Medical Center, a 70,000 visit per year ED located in Chicago, Illinois. Rush University Medical Center is a quaternary care hospital with a three-year Emergency Medi- cine residency program and a Clinical Ultrasound fellowship. This study was approved by the Rush University Medical Center institutional review board and all patients signed informed consent.

Adult (age >= 18 years) ED patients with suspected cardiogenic pulmonary edema were identified by research assistants via a conve- nience sample of when the physician sonographer was present. The phy- sician sonographer was present three days per week from 8:00-16:00. Patients were eligible for inclusion if they were willing to participate in the study, spoke English as their primary language, and the treating clini- cian had a Clinical concern for pulmonary edema. Patients were excluded

if they had Symptomatic COVID-19 infection or interstitial lung disease, were clinically unstable, declined to participate, had been previously enrolled in the study, or if the treating clinician was not concerned for pulmonary edema. Prior to beginning the study, research assistants col- lected the patient’s age, gender, height and weight (to calculate BMI), and relevant past medical history (defined as hypertension, heart failure, end-stage renal disease, cirrhosis, and asthma/COPD).

All ultrasound examinations were performed by a single physician sonographer, who was a new ultrasound fellow that had previously completed an Emergency Medicine residency. The physician received a one-hour study-specific training session followed by 25 proctored lung ultrasound examinations by an expert sonographer (fellowship- trained physician with Advanced Emergency Medicine Ultrasound Focused Practice Designation certification). The physician did not receive any additional training and the study was initiated in early fel- lowship, so as to best reflect the average residency-trained physician who would use thoracic ultrasound in practice [15].

The ultrasound examinations were performed using a C1-5 trans- ducer (Venue; GE Healthcare) in the lung preset at a depth of 18 cm [16]. Images were obtained in the sagittal plane between two adjacent rib spaces. A 12-zone technique (six views per side) was utilized. This included the following views: Left-Anterior-Inferior, Left-Anterior- Superior, Left-Lateral-Inferior, Left-Lateral-Superior, Left-Posterior- Inferior, Left-Posterior-Superior, Right-Anterior-Inferior, Right- Anterior-Superior, Right-Lateral-Inferior, Right-Lateral-Superior, Right-Posterior-Inferior, and Right-Posterior-Superior. The physician sonographer recorded a single six-second video clip from each region. The physician sonographer reported a real-time assessment of whether the lung field was positive (>=3 B lines or a wide, dense B line) or negative (<3 B lines and the absence of a wide, dense B line) for pulmonary edema in each region. The research assistant then utilized the AI soft- ware (Auto B-Lines; GE Healthcare) to retrospectively analyze the same saved clip to determine if it was positive versus negative for pul- monary edema (Fig. 1). We used a similar protocol for the AI software, wherein we defined positive as >=3 B lines or a large, dense B line and neg- ative as <3 B lines and the absence of a large, dense B line. The physician sonographer was blinded to the AI assessment. Lung regions were excluded if they were unable to be adequately visualized (e.g., cardiac interference, obscuring structure).

The clips were then reviewed by two expert sonographers (ultra- sound leaders with >10,000 prior ultrasound image reviews) who were blinded to both the initial assessment and AI findings, as well as any clinical, laboratory, or alternate imaging findings from the patient. The experts began by reviewing images together to reach consensus

Image of Fig. 1

Fig. 1. Artificial intelligence for the sonographic assessment of Pulmonary Edema. A, Positive Lung field; B, Negative Lung field.

on image acceptability and interpretation of B lines. After this initial stage, they independently reviewed all video loops and provided sepa- rate interpretations. The experts then reviewed all discordant values and reached consensus on whether the field was positive or negative using the same criteria as defined above.

We reported descriptive statistics for the overall population. We cal- culated the overall accuracy, sensitivity, specificity, positive likelihood ratio (LR+), and negative likelihood ratio with 95% confidence in- tervals (CI) for AI and real-time physician assessment. We also per- formed subgroup analysis based upon a BMI <30 kg/m2 versus a BMI

>=30 kg/m2.

Our sample size was determined using the following assumptions: power 80%, alpha 5%, overall accuracy of the physician 89%, and AI accu- racy no <5% different from the physician accuracy (84%). The resultant sample size required a minimum of 732 measurements in each group.

  1. Results

Two-hundred-eighty-four patients were approached for the study. Of those, 84 were excluded due to no clinical concern for pulmonary edema, 52 for symptomatic COVID-19, 41 declined participation, 20 did not speak English, seven had interstitial lung disease, five were clinically unstable, and four had previously been enrolled. A total of 71 patients were included in the study, with a mean age of 62 years and 53.5% were women. The mean BMI was 33.4 (95% CI 30.6-36.2) kg/m2. See Table 1 for patient demographic information.

Out of 852 potential lung fields, 88.3% (n = 752) were able to be adequately visualized for analysis. Overall, 36.1% of lung fields were positive for pulmonary edema. The physician was 96.7% (95% CI 93.8%-98.5%) sensitive and 79.1% (95% CI 75.1%-82.6%) specific

(Table 2). The AI software was 95.6% (95% CI 92.4%-97.7%) sensitive and 64.1% (95% CI 59.8%-68.5%) specific.

In a subgroup analysis of BMI <30 kg/m2, the physician was 96.5%

(95% CI 92.6%-98.7%) sensitive and 67.9% (95% CI 60.2%-74.8%) specific while the AI software was 98.3% (95% CI 95.0%-99.6%) sensitive and 51.2% (95% CI 43.4%-59.0%) specific. In patients with BMI >=30 kg/m2, the physician was 90.7% (95% CI 83.1%-95.7%) sensitive and 71.0%

(95% CI 65.7%-76.0%) specific, and the AI software was 96.9% (95%

CI 91.2%-99.4%) sensitive and 85.0% (95% CI 80.6%-88.8%) specific.

Table 1

Patient demographics.

Age in years, mean (min-max)

62 (28-89)


Women, n (%)

38 (53.5%)

Men, n (%)

32 (45.1%)

Transgender, n (%)

1 (1.4%)

BMI, mean (95% CI)

Past medical history, n (%)

33.4 (30.6-36.2)

Heart Failure

57 (80.3%)


56 (78.9%)

Asthma or COPD

24 (33.8%)


23 (32.4%)


3 (4.2%)

BMI, body mass index; CI, confidence interval; COPD, chronic obstructive pulmonary disease; ESRD, end-stage renal disease.

  1. Discussion

In this prospective study directly comparing real-time physician as- sessment versus AI, we found that both physician and AI assessment were highly sensitive for detecting pulmonary edema. However, the AI was less specific, with more false positives than real-time physician assessment. Overall, this suggests that AI parallels real-time physician assessment with excellent ability to exclude the diagnosis of pulmonary edema, though is more limited in its role for ruling in pulmonary edema. A prior study Russell et al. studied 29 patients where lung ultrasound was performed by medical students or resident physicians and assessed later against experts [11]. They reported an intraclass correlation coeffi- cient (ICC) of 0.56 for AI versus 0.82 when comparing between experts. Similar to our study, they found that AI was highly sensitive but had a lower specificity and tended to overcount B lines. In contrast, Moore et al. studied 80 patients (including patients with COVID-19) using a handheld device and reported an ICC of 0.84, though they found that their AI software tended to undercount compared with expert assessment [12]. Another study of four patients (including only one with pulmonary edema and another with interstitial lung disease) re- ported an ICC ranging from 0.485 to 0.826 [13]. Others have evaluated patients with COVID-19, with one study of 90 patients identifying an ICC of 0.52-0.53 [17], while another study of 10 patients reported a

Cohen’s kappa of 0.822 [18].

We also sought to better understand the impact of BMI on these measures, given that some studies have identified differences in diag- nostic accuracy based upon this [19-22]. When we analyzed the data by BMI, both the physician and AI assessment were more sensitive with low BMI patients, whereas they were more specific with the higher BMI patients. This difference was even more pronounced with the AI software which had significantly higher specificity than the physician among high BMI patients. This may be due to attenuation of the ultra- sound waves with greater soft tissue, leading to fewer B lines [23,24]. Future work is needed to better understand what factors are associated with improved AI accuracy.

Our study has several key strengths compared with prior work. First, many of the studies discussed above have focused primarily on specific numbers of B lines, rather than overall field positivity. Assessing specific numbers of B lines has been shown to have substantial variability in the literature even among experts, so the variability in specific numbers with AI is not surprising [25]. Moreover, the difference between single numbers may be less useful to the treating clinician than simply defin- ing a field as positive or negative [25]. Therefore, we focused on the more clinically useful measure of positive versus negative lung fields. Additionally, our study differed by using real-time (rather than delayed) assessment, which more closely reflects actual practice in the clinical environment. In contrast to most recent studies, we also excluded pa- tients with COVID-19, which is important as COVID-19 has been shown to produce different findings on lung ultrasound compared with non-COVID-19 pulmonary edema and may limit applicability to non-COVID-19 patients [26].

However, there are also several important limitations to consider. This was performed at a single center and may not reflect other insti- tutions. Additionally, all examinations were performed by a single physician, who was an ultrasound fellow. While the fellow was trained only to the level of the American College of Emergency Physi- cians emergency ultrasound guidelines to best reflect the average

Table 2

Diagnostic accuracy of real-time physician assessment and artificial intelligence for pulmonary edema using ultrasound.


Accuracy (95% CI)

Sensitivity (95% CI)

Specificity (95% CI)

LR + (95% CI)

LR-(95% CI)


85.4% (82.6%-87.8%)

96.7% (93.8%-98.5%)

79.1% (75.1%-82.6%)

4.61 (3.87-5.49)

0.04 (0.02-0.08)

Artificial Intelligence

75.4% (72.2%-78.4%)

95.6% (92.4%-97.7%)

64.1% (59.8%-68.5%)

2.66 (2.36-3.01)

0.07 (0.04-0.12)

CI, confidence interval; LR+, positive likelihood ratio; LR-, negative likelihood ratio.

user, they did receive additional training in other POCUS applications during the portion of their fellowship dedicated to enrollment, which may have led to greater overall ultrasound skill than the average user. Moreover, we utilized a single AI software, and these findings may not apply to other software or future AI updates. The AI was conducted using saved loops. While this was important to en- sure the physician and AI software used the exact same images, it did not allow use of the AI quality indicator, which may have influenced the accuracy. Finally, we did not compare accuracy of AI using differ- ent probes or depths and future work may be needed to determine how these affect AI accuracy.

  1. Conclusion

Both physician and AI software were highly sensitive for pulmonary edema, though the physician was more specific. This suggests that the greatest utility of AI for this application may be for excluding the pres- ence of pulmonary edema, while caution should be exercised when in- terpreting positive findings. Future research should identify which factors are associated with increased diagnostic accuracy.

Prior presentations



This study was supported by the Society for Academic Emergency Medicine Foundation/Academy of Emergency Ultrasound Research Grant. The ultrasound machines were temporarily donated by GE Healthcare only for the length of the study. GE Healthcare did not have influence over the study design, data acquisition, data analysis, manuscript, or decision to publish.

Declaration of Competing Interest

We have no conflicts of interest to declare and this manuscript has not been submitted elsewhere.


The authors wish to thank all of the patients who were a part of this study and GE Healthcare for donating the ultrasound machine. The authors also wish to thank Faith Geevarghese, Fae Kayarian, Leslie Mar- tinez, Jonas Neichin, and Simone Ymson for their help with the study.


  1. Centers for Disease Control and Prevention. National Center for Health Statistics. Na- tional Hospital Ambulatory Medical Care Survey. Emergency Department Summary Tables. Accessed April 13, 2023. https://www.cdc.gov/nchs/data/nhamcs/web_

tables/2020-nhamcs-ed-web-tables-508.pdf; 2020.

  1. Rutten FH, Cramer MJM, Lammers JWJ, Grobbee DE, Hoes AW. Heart failure and chronic obstructive pulmonary disease: an ignored combination? Eur J Heart Fail. 2006;8(7):706-11. https://doi.org/10.1016/j.ejheart.2006.01.010.
  2. Jabbour A, Macdonald PS, Keogh AM, et al. Differences between beta-blockers in pa- tients with chronic heart failure and chronic obstructive pulmonary disease: a ran- domized crossover trial. J Am Coll Cardiol. 2010;55(17):1780-7. https://doi.org/10. 1016/j.jacc.2010.01.024.
  3. Hawkins NM, Jhund PS, Simpson CR, et al. Primary care burden and treatment of pa- tients with heart failure and chronic obstructive pulmonary disease in Scotland. Eur J Heart Fail. 2010;12(1):17-24. https://doi.org/10.1093/eurjhf/hfp160.
  4. Martindale JL, Wakai A, Collins SP, et al. Diagnosing acute heart failure in the emer- gency department: a systematic review and Meta-analysis. Acad Emerg Med. 2016; 23(3):223-42. https://doi.org/10.1111/acem.12878.
  5. Chiu L, Jairam MP, Chow R, et al. Meta-analysis of point-of-care lung ultrasonogra- phy versus chest radiography in adults with symptoms of acute decompensated heart failure. Am J Cardiol. 2022;174:89-95. https://doi.org/10.1016/j.amjcard. 2022.03.022.
  6. Volpicelli G, Elbarbary M, Blaivas M, et al. International evidence-based recommen- dations for point-of-care lung ultrasound. Intensive Care Med. 2012;38(4):577-91. https://doi.org/10.1007/s00134-012-2513-4.
  7. Pang PS, Russell FM, Ehrman R, et al. Lung ultrasound-guided emergency De- partment Management of Acute Heart Failure (BLUSHED-AHF): a randomized controlled pilot trial. JACC Heart Fail. 2021;9(9):638-48. https://doi.org/10. 1016/j.jchf.2021.05.008.
  8. Gullett J, Donnelly JP, Sinert R, et al. Interobserver agreement in the evaluation of B-lines using bedside ultrasound. J Crit Care. 2015;30(6):1395-9. https://doi.org/ 10.1016/j.jcrc.2015.08.021.
  9. Gottlieb M, Duanmu Y. Beyond the numbers: assessing competency in point-of-care ultrasound. Ann Emerg Med. 2023;81(4):427-8. https://doi.org/10.1016/j. annemergmed.2023.01.020.
  10. Russell FM, Ehrman RR, Barton A, Sarmiento E, Ottenhoff JE, Nti BK. B-line quantifi- cation: comparing learners novice to lung ultrasound assisted by machine artificial intelligence technology to expert review. Ultrasound J. 2021;13(1):33. https://doi. org/10.1186/s13089-021-00234-6.
  11. Moore CL, Wang J, Battisti AJ, et al. Interobserver agreement and correlation of an automated algorithm for B-line identification and quantification with expert Sonologist review in a handheld Ultrasound device. J Ultrasound Med. 2022;41 (10):2487-95. https://doi.org/10.1002/jum.15935.
  12. Short J, Acebes C, Rodriguez-de-Lema G, et al. Visual versus automatic ultrasound scoring of lung B-lines: reliability and consistency between systems. Med Ultrason. 2019;21(1):45-9. https://doi.org/10.11152/mu-1885.
  13. von Elm E, Altman DG, Egger M, et al. The strengthening the reporting of observa- tional studies in epidemiology (STROBE) statement: guidelines for reporting obser- vational studies. Ann Intern Med. 2007;147(8):573-7. https://doi.org/10.7326/ 0003-4819-147-8-200710160-00010.
  14. Ultrasound Guidelines. Emergency, point-of-care and clinical ultrasound guidelines in medicine. Ann Emerg Med. 2017;69(5):e27-54. https://doi.org/10.1016/j. annemergmed.2016.08.457.
  15. Duggan NM, Goldsmith AJ, Saud AAA, Ma IWY, Shokoohi H, Liteplo AS. Optimizing lung ultrasound: the effect of depth, gain and focal position on sonographic B- lines. Ultrasound Med Biol. 2022;48(8):1509-17. https://doi.org/10.1016/j. ultrasmedbio.2022.03.015.
  16. Damodaran S, Kulkarni AV, Gunaseelan V, Raj V, Kanchi M. Automated versus man- ual B-lines counting, left ventricular outflow tract velocity time integral and inferior vena cava collapsibility index in COVID-19 patients. Indian J Anaesth. 2022;66(5): 368-74. https://doi.org/10.4103/ija.ija_1008_21.
  17. Tsaban G, Galante O, Almog Y, Ullman Y, Fuchs L. Feasibility of machine integrated point of care lung ultrasound automatic B-lines tool in the Corona-virus 2019 critical care unit. Crit Care. 2021;25(1):345. https://doi.org/10.1186/s13054-021-03770-8.
  18. Schuh S, Man C, Cheng A, et al. Predictors of non-diagnostic ultrasound scanning in children with suspected appendicitis. J Pediatr. 2011;158(1):112-8. https://doi.org/ 10.1016/j.jpeds.2010.07.035.
  19. Trout AT, Sanchez R, Ladino-Torres MF, Pai DR, Strouse PJ. A critical evaluation of US for the diagnosis of pediatric acute appendicitis in a real-life setting: how can we im- prove the diagnostic value of sonography? Pediatr Radiol. 2012;42(7):813-23. https://doi.org/10.1007/s00247-012-2358-6.
  20. Macaione I, Galvano A, GRACEffa G, et al. Impact of BMI on preoperative axillary ultra- sound assessment in patients with early breast Cancer. Anticancer Res. 2020;40(12): 7083-8. https://doi.org/10.21873/anticanres.14736.
  21. Jeeji AK, Ekstein SF, Ifelayo OI, et al. Increased body mass index is associated with de- creased imaging quality of point-of-care abdominal aortic ultrasonography. J Clin Ul- trasound. 2021;49(4):328-33. https://doi.org/10.1002/jcu.22929.
  22. Abu-Zidan FM, Hefny AF, Corr P. Clinical ultrasound physics. J Emerg Trauma Shock. 2011;4(4):501-3. https://doi.org/10.4103/0974-2700.86646.
  23. Uppot RN. Technical challenges of imaging & image-guided interventions in obese patients. Br J Radiol. 2018;91(1089):20170931. https://doi.org/10.1259/bjr. 20170931.
  24. Haaksma ME, Smit JM, Heldeweg MLA, Pisani L, Elbers P, Tuinman PR. Lung ultra- sound and B-lines: B careful! Intensive Care Med. 2020;46(3):544-5. https://doi. org/10.1007/s00134-019-05911-8.
  25. Arntfield R, VanBerlo B, Alaifan T, et al. Development of a convolutional neural net- work to differentiate among the etiology of similar appearing pathological B lines on lung ultrasound: a deep learning study. BMJ Open. 2021;11(3):e045120. https://doi. org/10.1136/bmjopen-2020-045120.

Leave a Reply

Your email address will not be published. Required fields are marked *