Question difficulty estimation via enhanced directional modality association transformer

Estimating the difficulty of a question in video QAs is one of the important reasoning steps to answer the question. However, no previous question difficulty estimators consider the association between multiple modalities though video QA is intrinsically a multi-modal task involving both text and vi...

Full description

Saved in:

Bibliographic Details
Published in:	Applied intelligence (Dordrecht, Netherlands) Vol. 53; no. 23; pp. 28434 - 28445
Main Authors:	Kim, Bong-Min, Park, Gyu-Min, Park, Seong-Bae
Format:	Journal Article
Language:	English
Published:	New York Springer US 01-12-2023 Springer Nature B.V
Subjects:	Artificial Intelligence Computer networks Computer Science Datasets Estimation Language Machines Manufacturing Mechanical Engineering Natural language Processes Questions Special Issue on IEA/AIE2022 Question difficulty estimation Transformer Multi-modality Video question answering Directional modality association
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Estimating the difficulty of a question in video QAs is one of the important reasoning steps to answer the question. However, no previous question difficulty estimators consider the association between multiple modalities though video QA is intrinsically a multi-modal task involving both text and video. To solve this problem, this paper proposes a novel question difficulty estimator using an enhanced directional modality attention transformer (DiMAT++). The proposed estimator adopts a CNN backbone network and a transformer to express a video modality and RoBERTa to express a text modality. However, these modalities are insufficient to classify the difficulty level of a question correctly, since they affect each other during performing video QAs. Therefore, in the proposed estimator, DiMAT++ captures directional associations from text modality to video modality and vice versa. DiMAT, the previous version of DiMAT++, does not represent the sequential information for each modality though it is designed to express the directional associations. Thus, DiMAT++ revises it to accept the sequential representations of each modality as its input. The effectiveness of the proposed estimator is verified with two benchmark video QA data sets. The experimental results indicate that the proposed estimator outperforms three baselines of heterogeneous attention mechanism (HAM), multi-modal fusion transformer (MMFT), and DiMAT, which proves that DiMAT++ is effective in improving the performance of video question difficulty estimation.
ISSN:	0924-669X 1573-7497
DOI:	10.1007/s10489-023-04988-5