Lipreading With Local Spatiotemporal Descriptors

Visual speech information plays an important role in lipreading under noisy conditions or for listeners with a hearing impairment. In this paper, we present local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely on visual input. Spatiotemporal local binary p...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on multimedia Vol. 11; no. 7; pp. 1254 - 1265
Main Authors:	Guoying Zhao, Barnard, M., Pietikainen, M.
Format:	Journal Article
Language:	English
Published:	New York, NY IEEE 01-11-2009 Institute of Electrical and Electronics Engineers The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Applied sciences Artificial intelligence Auditory system Computer science; control theory; systems Confusion Data mining Exact sciences and technology Hearing Humans Impairment Lipreading Lips local binary patterns Machine vision Multimedia Pattern recognition. Digital image processing. Computational geometry Robustness Segmentation spatiotemporal descriptors Spatiotemporal phenomena Speech and sound recognition and synthesis. Linguistics Speech recognition Tongue TV broadcasting Visual visual speech recognition Working environment noise Cluster analysis Image recognition Grey level image visual speech recognition Auditory disorder Speaker recognition Letter Image segmentation Lipreading Lip Gray scale spatiotemporal descriptors Classification Speech recognition Database Robustness Visual information local binary patterns Spatial database Speaker
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Visual speech information plays an important role in lipreading under noisy conditions or for listeners with a hearing impairment. In this paper, we present local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely on visual input. Spatiotemporal local binary patterns extracted from mouth regions are used for describing isolated phrase sequences. In our experiments with 817 sequences from ten phrases and 20 speakers, promising accuracies of 62% and 70% were obtained in speaker-independent and speaker-dependent recognition, respectively. In comparison with other methods on AVLetters database, the accuracy, 62.8%, of our method clearly outperforms the others. Analysis of the confusion matrix for 26 English letters shows the good clustering characteristics of visemes for the proposed descriptors. The advantages of our approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2009.2030637