A computable case definition for patients with SARS-CoV2 testing that occurred outside the hospital

Abstract Objective To identify a cohort of COVID-19 cases, including when evidence of virus positivity was only mentioned in the clinical text, not in structured laboratory data in the electronic health record (EHR). Materials and Methods Statistical classifiers were trained on feature representatio...

Full description

Saved in:

Bibliographic Details
Published in:	JAMIA open Vol. 6; no. 3; p. ooad047
Main Authors:	Wang, Lijing, Zipursky, Amy R, Geva, Alon, McMurry, Andrew J, Mandl, Kenneth D, Miller, Timothy A
Format:	Journal Article
Language:	English
Published:	United States Oxford University Press 01-10-2023
Subjects:	Computational linguistics Electronic records Emergency medical services Health aspects Hospitals Language processing Machine learning Massachusetts Medical records Natural language interfaces Patient education Research and Applications Massachusetts COVID-19 text classification natural language processing machine learning
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Objective To identify a cohort of COVID-19 cases, including when evidence of virus positivity was only mentioned in the clinical text, not in structured laboratory data in the electronic health record (EHR). Materials and Methods Statistical classifiers were trained on feature representations derived from unstructured text in patient EHRs. We used a proxy dataset of patients with COVID-19 polymerase chain reaction (PCR) tests for training. We selected a model based on performance on our proxy dataset and applied it to instances without COVID-19 PCR tests. A physician reviewed a sample of these instances to validate the classifier. Results On the test split of the proxy dataset, our best classifier obtained 0.56 F1, 0.6 precision, and 0.52 recall scores for SARS-CoV2 positive cases. In an expert validation, the classifier correctly identified 97.6% (81/84) as COVID-19 positive and 97.8% (91/93) as not SARS-CoV2 positive. The classifier labeled an additional 960 cases as not having SARS-CoV2 lab tests in hospital, and only 177 of those cases had the ICD-10 code for COVID-19. Discussion Proxy dataset performance may be worse because these instances sometimes include discussion of pending lab tests. The most predictive features are meaningful and interpretable. The type of external test that was performed is rarely mentioned. Conclusion COVID-19 cases that had testing done outside of the hospital can be reliably detected from the text in EHRs. Training on a proxy dataset was a suitable method for developing a highly performant classifier without labor-intensive labeling efforts. Lay Summary For a significant period at the start of the COVID-19 pandemic, some hospitals routinely tested every patient who came to the emergency room or was admitted for COVID-19 with a polymerase chain reaction (PCR) test. However, they may have skipped this test if the patient reported a recent positive test outside the hospital, and these patients would be treated as if they had tested positive at the hospital. These patients are hard to detect for later study, because hospitals will not have an electronic record of a positive test. In this work, we hypothesized that we could detect these patients by teaching machine learning methods to read the text in electronic health records, where positive tests would be mentioned by clinicians. We found that we could detect these patients with high accuracy, as validated by a clinician, that there are many additional cases that we can find this way, and that many of these cases would be hard to detect with other methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Kenneth D. Mandl and Timothy A. Miller contributed equally to this work.
ISSN:	2574-2531 2574-2531
DOI:	10.1093/jamiaopen/ooad047