LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space

Guanwen Feng 1,2,3†, Zhihao Qian1,2,3†, Yunan Li 1,2,3*, Siyu Jin1,2,3, Qiguang Miao 1,2,3*, Chi-Man Pun4
1School of Computer Science and Technology, Xidian University, China 2Xi’an Key Laboratory of Big Data and Intelligent Vision, Xidian University, China 3Key Laboratory of Collaborative Intelligence Systems, Ministry of Education, Xidian University, China 4Department of Computer and Information Science, University of Macau, Macao 999078, China.

LES-Talker a novel one-shot talking head generation model with high interpretability, designed to achieve fine-grained emotion editing across emotion types, levels, and facial units.

Abstract

While existing one-shot talking head generation models have achieved progress in coarse-grained emotion editing, there is still a lack of fine-grained emotion editing models with high interpretability. We argue that for an approach to be considered fine-grained, it needs to provide clear definitions and sufficiently detailed differentiation.

We present LES-Talker, a novel one-shot talking head generation model with high interpretability, to achieve fine-grained emotion editing across emotion types, emotion levels, and facial units. We propose a Linear Emotion Space (LES) definition based on Facial Action Units to characterize emotion transformations as vector transformations. We design the Cross-Dimension Attention Net (CDAN) to deeply mine the correlation between LES representation and 3D model representation. Through mining multiple relationships across different feature and structure dimensions, we enable LES representation to guide the controllable deformation of 3D model. In order to adapt the multimodal data with deviations to the LES and enhance visual quality, we utilize specialized network design and training strategies. Experiments show that our method provides high visual quality along with multilevel and interpretable fine-grained emotion editing, outperforming mainstream methods.



Proposed Method

lestalker


Linear Emotion Space (LES) based on Facial Action Units (AUs) supports our LES-Talker model, offering exceptional interpretability. It enables fine-grained editing across 8 emotion types, 17 facial units, and continuous levels above 0. It can be driven by a lightweight use of video (requiring a sequence of images to provide the AU source) or by audio alone.