EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

Guanwen Feng1,2,3, Haoran Cheng1,2,3, Yunan Li1,2,3, Zhiyuan Ma1,2,3, Chaoneng Li1,2,3, Zhihao Qian3, Qiguang Miao1,2,3, Chi-Man Pun4
1Xi’an Key Laboratory of Big Data and Intelligent Vision, Xidian University
2China and Key Laboratory of Collaborative lntelligence Systems, Ministry of Education, Xidian University
3School of Computer Science and Technology, Xidian University
4Department of Computer and Information Science, University of Macau

EmoSpeaker generates emotional talking-head videos with input audio, emotion and source image. At the same time, it can also generate talking-head of different emotional intensities by adjusting the fine-grained emotion. Use the slider here to adjust the 15 fine-grained levels.

Angry Level 1


Angry Level 15

Happy Level 1


Happy Level 15

Sad Level 1


Sad Level 15


Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. Existing methods often rely on multiple emotional portraits or a video clip to capture different emotional expressions, while some utilize emotion labels for animation generation. However, these approaches lack precise control over facial emotional expression and face issues with lip synchronization accuracy.

In order to address these challenges, we propose an visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction mechanism. Additionally, we propose an emotion intensity control method using a fine-grained emotion matrix. Through these, effective control over emotional expression in the generated videos and finer classification of emotion intensity are accomplished. Subsequently, a series of 3DMM coefficient generation networks are designed to predict 3D coefficients, followed by the utilization of a rendering network to generate the final video.

Our experimental results demonstrate that our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.


Overview of EmoSpeaker
The overview of EmoSpeaker, (a) Source Coeff Extraction: Extract 68 facial keypoints and 3DMM coefficients from the reference image for training and generation purposes. (b) Visual Attribute-Guided Audio Decoupler: Input the audio into three consecutive audio encoders to obtain separate low-level and high-level audio encodings. Utilizing a shared AU decoder to obtain AU-related features and compare them with AU coefficients extracted from the training videos for comparative learning. (c) Fine-grained Emotion Coefficient Prediction Mechanism: Manually specify emotion categories and intensity labels. During inference, adjust the sliding window size of the input audio to obtain a fine-grained emotion vector synchronized with the audio. Combine them with content vectors to predict expression, emotion, and pose coefficients through ExpNet, EmoNet and PoseNet. (d) Emotion Face Renderer: Utilize the predicted 3DMM coefficients to generate motion vectors for latent facial keypoints, animating the facial image.

Please refer to our paper for more details.

Quality Comparison

We compare EmoSpeaker with recent works in emotional talking-head generation.

Image Alt Text

Source Image



Image Alt Text

Source Image



Image Alt Text

Source Image



Eight Emotional Results

EmoSpeaker can generate eight kinds of emotional talking-head.

Emotional Granularity

EmoSpeaker can also generate talking-head of different emotional intensities by adjusting the fine-grained emotion. Here are 15 fine-grained demonstrations of multiple emotions. Use the slider here to adjust the fine-grained.

Angry Level 1


Angry Level 15

Contempt Level 1


Contempt Level 15

Disgusted Level 1


Disgusted Level 15

Fear Level 1


Fear Level 15

Happy Level 1


Happy Level 15

Sad Level 1


Sad Level 15

Surprised Level 1


Surprised Level 15