BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion
Spoken languages often utilise intonation, rhythm, intensity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expression to the language. Re...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
05-06-2023
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Spoken languages often utilise intonation, rhythm, intensity, and structure,
to communicate intention, which can be interpreted differently depending on the
rhythm of speech of their utterance. These speech acts provide the foundation
of communication and are unique in expression to the language. Recent
advancements in attention-based models, demonstrating their ability to learn
powerful representations from multilingual datasets, have performed well in
speech tasks and are ideal to model specific tasks in low resource languages.
Here, we develop a novel multimodal approach combining two models, wav2vec2.0
for audio and MarianMT for text translation, by using multimodal attention
fusion to predict speech acts in our prepared Bengali speech corpus. We also
show that our model BeAts ($\underline{\textbf{Be}}$ngali speech acts
recognition using Multimodal $\underline{\textbf{At}}$tention
Fu$\underline{\textbf{s}}$ion) significantly outperforms both the unimodal
baseline using only speech data and a simpler bimodal fusion using both speech
and text data. Project page: https://soumitri2001.github.io/BeAts |
---|---|
DOI: | 10.48550/arxiv.2306.02680 |