Multi-view semantic understanding for visual dialog

Visual dialog, as a challenging cross-media task, requires answering a sequence of questions based on a given image and dialog history. Hence the key problem becomes how to answer visually grounded questions based on ambiguous reference information from dialog. In this work, we propose a novel metho...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge-based systems Vol. 268; p. 110427
Main Authors: Jiang, Tianling, Zhang, Zefan, Li, Xin, Ji, Yi, Liu, Chunping
Format: Journal Article
Language:English
Published: Elsevier B.V 23-05-2023
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Visual dialog, as a challenging cross-media task, requires answering a sequence of questions based on a given image and dialog history. Hence the key problem becomes how to answer visually grounded questions based on ambiguous reference information from dialog. In this work, we propose a novel method called Multi-View Semantic Understanding for Visual Dialog (MVSU) to resolve the visual coreference resolution problem. The model consists of two main textual processing modules, SRR (Semantic Retention RNN) and CRoT (Coreference Resolution on Text). Specifically, the SRR module generates word features that have semantical meaning by considering contextual information. The CRoT module is from a textual perspective to divide all useful nouns and pronouns into different clusters that serve as the supplement of the detailed information for semantic understanding. In experiments, we demonstrate that MVSU enhances the ability to understand the semantical information on the VisDial v1.0 dataset.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2023.110427