3D head-talk: speech synthesis 3D head movement face animation

Speech-driven 3D human face animation has made admirable progress. However, synthesizing 3D facial speakers with head motion is still an unsolved problem. This is because head motion, as a speech-independent appearance representation, is difficult to model by a speech-driven approach. To solve this...

Full description

Saved in:

Bibliographic Details
Published in:	Soft computing (Berlin, Germany) Vol. 28; no. 1; pp. 363 - 379
Main Authors:	Yang, Daowu, Li, Ruihui, Yang, Qi, Peng, Yuyi, Huang, Xibei, Zou, Jing
Format:	Journal Article
Language:	English
Published:	Berlin/Heidelberg Springer Berlin Heidelberg 2024
Subjects:	Artificial Intelligence Computational Intelligence Control Data Analytics and Machine Learning Engineering Mathematical Logic and Foundations Mechatronics Robotics Motion field generator Speech-driven 3D face animations Head motion
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Speech-driven 3D human face animation has made admirable progress. However, synthesizing 3D facial speakers with head motion is still an unsolved problem. This is because head motion, as a speech-independent appearance representation, is difficult to model by a speech-driven approach. To solve this problem, we propose 3D head-talk, which generates 3D face animations combined with extreme head motion. In this work, we face a key challenge to generate natural head movements that match the speech rhythm. We first form an end-to-end autoregressive model by combining a dual-tower and single-tower Transformer, with a speech encoder encoding the long-term audio environment, a facial grid encoder encoding subtle changes in the vertices of the 3D facial grid, and a single-tower decoder automatically regressing to predict a series of 3D facial animation grids. Next, the predicted 3D facial animation sequence is edited by a motion field generator containing head motion to obtain an output sequence containing extreme head motion. Finally, the natural 3D face animation under extreme head motion is presented in combination with the input audio. The quantitative and qualitative results show that our method outperforms current state-of-the-art methods, and stabilizes the non-area region while maintaining the appearance of extreme head motion.
ISSN:	1432-7643 1433-7479
DOI:	10.1007/s00500-023-09292-5