EmoSpeaker

Abstract

Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. Existing methods often rely on multiple emotional portraits or a video clip to capture different emotional expressions, while some utilize emotion labels for animation generation. However, these approaches lack precise control over facial emotional expression and face issues with lip synchronization accuracy.

In order to address these challenges, we propose an visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction mechanism. Additionally, we propose an emotion intensity control method using a fine-grained emotion matrix. Through these, effective control over emotional expression in the generated videos and finer classification of emotion intensity are accomplished. Subsequently, a series of 3DMM coefficient generation networks are designed to predict 3D coefficients, followed by the utilization of a rendering network to generate the final video.

Our experimental results demonstrate that our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization.

Overview

The overview of EmoSpeaker, (a) Source Coeff Extraction: Extract 68 facial keypoints and 3DMM coefficients from the reference image for training and generation purposes. (b) Visual Attribute-Guided Audio Decoupler: Input the audio into three consecutive audio encoders to obtain separate low-level and high-level audio encodings. Utilizing a shared AU decoder to obtain AU-related features and compare them with AU coefficients extracted from the training videos for comparative learning. (c) Fine-grained Emotion Coefficient Prediction Mechanism: Manually specify emotion categories and intensity labels. During inference, adjust the sliding window size of the input audio to obtain a fine-grained emotion vector synchronized with the audio. Combine them with content vectors to predict expression, emotion, and pose coefficients through ExpNet, EmoNet and PoseNet. (d) Emotion Face Renderer: Utilize the predicted 3DMM coefficients to generate motion vectors for latent facial keypoints, animating the facial image.

Please refer to our paper for more details.

Quality Comparison

We compare EmoSpeaker with recent works in emotional talking-head generation.

Source Image

EmoSpeaker

EAMM

Source Image

EmoSpeaker

EAMM

Source Image

EmoSpeaker

EAMM

Eight Emotional Results

EmoSpeaker can generate eight kinds of emotional talking-head.

Emotional Granularity

EmoSpeaker can also generate talking-head of different emotional intensities by adjusting the fine-grained emotion. Here are 15 fine-grained demonstrations of multiple emotions. Use the slider here to adjust the fine-grained.