Applying computer text mining algorithms for oversampling tumor mutation status in medical records for the NCI Patterns of Care studies

•Congressionally Mandated 2019 National Cancer Institute Patterns of Care Study (POC), used novel sampling strategies.•Stratification for NSCLC patients was based on tumor mutation status testing results.•Text mining algorithms identified cases with positive EGFR/ALK mutations for oversampling in th...

Full description

Saved in:

Bibliographic Details
Published in:	International journal of medical informatics (Shannon, Ireland) Vol. 177; p. 105157
Main Authors:	Liu, Benmei, Stevens, Jennifer, Beverungen, Gary, Halpern, Michael T.
Format:	Journal Article
Language:	English
Published:	Ireland Elsevier B.V 01-09-2023
Subjects:	ALK EGFR Non-small cell lung cancer Text mining algorithm Tumor mutation ALK Text mining algorithm Tumor mutation EGFR Non-small cell lung cancer
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	•Congressionally Mandated 2019 National Cancer Institute Patterns of Care Study (POC), used novel sampling strategies.•Stratification for NSCLC patients was based on tumor mutation status testing results.•Text mining algorithms identified cases with positive EGFR/ALK mutations for oversampling in the POC study.•Text mining algorithm achieved 77.6% sensitivity, 90.8% specificity, and 84.8% overall accuracy.•The approach can be generalized to oversample patients with rare conditions in studies using electronic medical records. Backgrounds: The National Cancer Institute (NCI) conducts Patterns of Care (POC) studies for selected cancer sites under a Congressional Mandate. These studies aim to collect treatment information beyond what is typically collected by the NCI’s Surveillance, Epidemiology, and End Results (SEER) Program. The 2019 POC study focused on non-small cell lung cancer (NSCLC) and melanoma cancer sites. For the NSCLC cases, one of the primary sampling objectives was to oversample patients who tested positive for EGFR/ALK mutations, but initial information on mutation test results was unavailable prior to selecting the study sample. Methods: To address this, text mining algorithms were developed to screen all eligible NSCLC cases from the SEER database. These algorithms were designed to identify the mutation test status, allowing for stratified sampling based on SEER registry, sex, race/ethnicity, and tumor mutation test results. Results: The final NSCLC sample included 2,434 patients aged 20+ with advanced stage (IIIB-IVB) NSCLC diagnosed in 2017 and 2018. Among this sample, 692 cases (13.2%) tested positive for EGFR/ALK mutations. An evaluation of the text mining algorithms performance, based on cases where both algorithm results and known EGFR/ALK status from medical chart abstraction were available, showed good results: sensitivity of 77.6%, specificity of 90.8%, and an overall accuracy 84.8%. Conclusions: The adaption of text mining algorithm proved effective in oversample patients with uncommon conditions in studies where electronic medical records are accessible. The 2019 POC study provides valuable data for researchers to evaluate cancer therapy details and patient characteristics, particularly among those with EGFR/ALK test positive cases.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1386-5056 1872-8243
DOI:	10.1016/j.ijmedinf.2023.105157