Multi-view semantic understanding for visual dialog
Visual dialog, as a challenging cross-media task, requires answering a sequence of questions based on a given image and dialog history. Hence the key problem becomes how to answer visually grounded questions based on ambiguous reference information from dialog. In this work, we propose a novel metho...
Saved in:
Published in: | Knowledge-based systems Vol. 268; p. 110427 |
---|---|
Main Authors: | , , , , |
Format: | Journal Article |
Language: | English |
Published: |
Elsevier B.V
23-05-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Visual dialog, as a challenging cross-media task, requires answering a sequence of questions based on a given image and dialog history. Hence the key problem becomes how to answer visually grounded questions based on ambiguous reference information from dialog. In this work, we propose a novel method called Multi-View Semantic Understanding for Visual Dialog (MVSU) to resolve the visual coreference resolution problem. The model consists of two main textual processing modules, SRR (Semantic Retention RNN) and CRoT (Coreference Resolution on Text). Specifically, the SRR module generates word features that have semantical meaning by considering contextual information. The CRoT module is from a textual perspective to divide all useful nouns and pronouns into different clusters that serve as the supplement of the detailed information for semantic understanding. In experiments, we demonstrate that MVSU enhances the ability to understand the semantical information on the VisDial v1.0 dataset. |
---|---|
ISSN: | 0950-7051 1872-7409 |
DOI: | 10.1016/j.knosys.2023.110427 |