VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image caption...

Full description

Saved in:

Bibliographic Details
Published in:	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 18009 - 18019
Main Authors:	Chen, Jun, Guo, Han, Yi, Kai, Li, Boyang, Elhoseiny, Mohamed
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01-01-2022
Subjects:	Adaptation models Computational modeling Linguistics Representation learning Semantics Training Vision + language; Efficient learning and inferences; Representation learning; Transfer/low-shot/long-tail learning Visualization
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the linguistic knowledge from a large pre-trained language model (PLM) and quickly adapt it to new domains of image captioning. To effectively utilize a pretrained model, it is critical to balance the visual input and prior linguistic knowledge from pretraining. We propose VisualGPT, which employs a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the PLM with a small amount of in-domain image-text data. The proposed self-resurrecting activation unit produces sparse activations that prevent accidental overwriting of linguistic knowledge. When trained on 0.1%, 0.5% and 1% of the respective training sets, VisualGPT surpasses the best baseline by up to 10.0% CIDEr on MS COCO [43] and 17.9% CIDEr on Conceptual Captions [63]. Furthermore, VisualGPT achieves the state-of-the-art result on IU X-ray [15], a medical report generation dataset. Our code is available at https://github.com/Vision-CAIR/VisualGPT.
ISSN:	2575-7075
DOI:	10.1109/CVPR52688.2022.01750