An Interpretable and Generalizable Speech Detector Based on a CNN-LSTM Framework

Speech brain-computer interface (speech BCI) aims to reconstruct speech from recorded brain signals. Real-time speech BCI relies on speech detection, which is greatly impacted by the selection of speech-related neural frequency features. However, most studies did not investigate this aspect when des...

Full description

Saved in:
Bibliographic Details
Published in:ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 13231 - 13235
Main Authors: Wan, Zijun, Wu, Yunying, Ticha, Mohamed Baha Ben, Le Godais, Gael, Kahane, Philippe, Chabardes, Stephan, Chen, Weidong, Zhang, Shaomin, Yvert, Blaise
Format: Conference Proceeding
Language:English
Published: IEEE 14-04-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Speech brain-computer interface (speech BCI) aims to reconstruct speech from recorded brain signals. Real-time speech BCI relies on speech detection, which is greatly impacted by the selection of speech-related neural frequency features. However, most studies did not investigate this aspect when designing speech detectors. In this study, both electrocorticography (ECoG) dataset and stereo-electroencephalography (sEEG) dataset were utilized to investigate the impact of brain signal type on the contribution of frequency bands to speech detection. We calculated the mutual information (MI) between neural frequency bands and the audio envelope and found that the distributions of frequency bands varied between the two types of brain signals. Specifically, the 40-60Hz of ECoG signal and 0-20Hz of sEEG signal got the highest MI values. To address this, we propose a two-module detector that combines convolutional neural networks and long short-term memory (CNN-LSTM) for feature extraction and speech prediction. Our detector outperformed three commonly used detectors, including Linear discriminant analysis (LDA), Support Vector Machine (SVM), and LSTM. Notably, a high correlation was found between CNN output and the frequency bands, and high MI values were observed in both types of brain signals. These findings confirm the interpretability and generalizability of our proposed speech detector.
ISSN:2379-190X
DOI:10.1109/ICASSP48485.2024.10445835