Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment
Self-supervised learning provides an opportunity to explore unlabeled chest X-rays and their associated free-text reports accumulated in clinical routine without manual supervision. This paper proposes a Joint Image Text Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray im...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
04-09-2021
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Self-supervised learning provides an opportunity to explore unlabeled chest
X-rays and their associated free-text reports accumulated in clinical routine
without manual supervision. This paper proposes a Joint Image Text
Representation Learning Network (JoImTeRNet) for pre-training on chest X-ray
images and their radiology reports. The model was pre-trained on both the
global image-sentence level and the local image region-word level for
visual-textual matching. Both are bidirectionally constrained on Cross-Entropy
based and ranking-based Triplet Matching Losses. The region-word matching is
calculated using the attention mechanism without direct supervision about their
mapping. The pre-trained multi-modal representation learning paves the way for
downstream tasks concerning image and/or text encoding. We demonstrate the
representation learning quality by cross-modality retrievals and multi-label
classifications on two datasets: OpenI-IU and MIMIC-CXR |
---|---|
DOI: | 10.48550/arxiv.2109.01949 |