Effectiveness of Text, Acoustic, and Lattice-Based Representations in Spoken Language Understanding Tasks

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) m...

Full description

Saved in:

Bibliographic Details
Published in:	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors:	Villatoro-Tello, Esau, Madikeri, Srikanth, Zuluaga-Gomez, Juan, Sharma, Bidisha, Saeed Sarfjoo, Seyyed, Nigmatulina, Iuliia, Motlicek, Petr, Ivanov, Alexei V., Ganapathiraju, Aravind
Format:	Conference Proceeding
Language:	English
Published:	IEEE 04-06-2023
Subjects:	Acoustics Annotations Benchmark testing Cross-modal Attention Human-computer Interaction Manuals Pipelines Signal processing Speech Recognition Spoken Language Understanding Task analysis Word Consensus Networks
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.
ISSN:	2379-190X
DOI:	10.1109/ICASSP49357.2023.10095168