Measuring the effect of different types of unsupervised word representations on Medical Named Entity Recognition

•Medical Entity Recognition is crucial for accurate clinical text processing.•Our approach implements neural networks and word embeddings.•The focus is on robust dense word representations.•This work serves as a guide to choose the right corpora, algorithm and parameters. This work deals with Natura...

Full description

Saved in:
Bibliographic Details
Published in:International journal of medical informatics (Shannon, Ireland) Vol. 129; pp. 100 - 106
Main Authors: Casillas, Arantza, Ezeiza, Nerea, Goenaga, Iakes, Pérez, Alicia, Soto, Xabier
Format: Journal Article
Language:English
Published: Ireland Elsevier B.V 01-09-2019
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Medical Entity Recognition is crucial for accurate clinical text processing.•Our approach implements neural networks and word embeddings.•The focus is on robust dense word representations.•This work serves as a guide to choose the right corpora, algorithm and parameters. This work deals with Natural Language Processing applied to the clinical domain. Specifically, the work deals with a Medical Entity Recognition (MER) on Electronic Health Records (EHRs). Developing a MER system entailed heavy data preprocessing and feature engineering until Deep Neural Networks (DNNs) emerged. However, the quality of the word representations in terms of embedded layers is still an important issue for the inference of the DNNs. The main goal of this work is to develop a robust MER system adapting general-purpose DNNs to cope with the high lexical variability shown in EHRs. In addition, given that EHRs tend to be scarce when there are out-domain corpora available, the aim is to assess the impact of the word representations on the performance of the MER as we move to other domains. In this line, exhaustive experimentation varying information generation methods and network parameters are crucial. We adapted a general purpose sequential tagger based on Bidirectional Long-Short Term Memory cells and Conditional Random Fields (CRFs) in order to make it tolerant to high lexical variability and a limited amount of corpora. To this end, we incorporated part of speech (POS) and semantic-tag embedding layers to the word representations. One of the strengths of this work is the exhaustive evaluation of dense word representations obtained varying not only the domain and genre but also the learning algorithms and their parameter settings. With the proposed method, we attained an error reduction of 1.71 (5.7%) compared to the state-of-the-art even that no preprocessing or feature engineering was used. Our results indicate that dense representations built taking word order into account leverage the entity extraction system. Besides, we found that using a medical corpus (not necessarily EHRs) to infer the representations improves the performance, even if it does not correspond to the same genre.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1386-5056
1872-8243
DOI:10.1016/j.ijmedinf.2019.05.022