Identification and analysis of misclassified work-zone crashes using text mining techniques

•Applied text mining techniques to process unstructured text in the crash narrative.•Developed a unigram + bigram noisy-OR classifier to score the probability of missed work zone (WZ) crashes.•Identified 201 missed WZ crashes from the top 450 cases with high unigram + bigram noisy-OR scores in 2019....

Full description

Saved in:
Bibliographic Details
Published in:Accident analysis and prevention Vol. 159; p. 106211
Main Authors: Sayed, Md Abu, Qin, Xiao, Kate, Rohit J., Anisuzzaman, D.M., Yu, Zeyun
Format: Journal Article
Language:English
Published: Elsevier Ltd 01-09-2021
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Applied text mining techniques to process unstructured text in the crash narrative.•Developed a unigram + bigram noisy-OR classifier to score the probability of missed work zone (WZ) crashes.•Identified 201 missed WZ crashes from the top 450 cases with high unigram + bigram noisy-OR scores in 2019.•Conducted ad-hoc analysis of when and where WZ crashes are more likely to be missed as well as the plausible causes of missing. Work zone safety management and research relies heavily on the quality of work zone crash data. However, it is possible that a police officer may misclassify a crash in structured data due to: restrictive options in the crash report; a lack of understanding about their importance; lack of time due to police officers’ work load; and ignorance of work zone as one of the crash contributing factors. Consequently, work zone crashes are under representative in crash statistics. Crash narratives contain valuable information that is not included in the structured data. The objective of this study is to develop a classifier that applies text mining techniques to quickly find missed work zone (WZ) crashes through the unstructured text saved in the crash narratives. The study used three-year crash data from 2017 to 2019. The data from 2017 to 2018 was used as training data, and the 2019 data was used as testing data. A unigram + bigram noisy-OR classifier was developed and proven to be an efficient and effective means of classifying work zone crashes based on key information in the crash narrative. The ad-hoc analysis of misclassified work zone crashes sheds light on when, where and the plausible reasons as to why work zone crashes are more likely to be missed.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0001-4575
1879-2057
DOI:10.1016/j.aap.2021.106211