FaceEditTalker: Interactive Talking Head Generation with Facial Attribute Editing

Guanwen Feng1,2,3, Zhiyuan Ma1,2,3, Yunan Li1,2,3, Junwei Jing3, Jiahao Yang3, Qiguang Miao1,2,3

1Xi’an Key Laboratory of Big Data and Intelligent Vision, Xidian University, 2China and Key Laboratory of Collaborative lntelligence Systems, Ministry of Education, Xidian University, 3School of Computer Science and Technology, Xidian University

Paper Code

Abstract

Recent advances in audio-driven talking head generation have achieved impressive results in lip synchronization and emotional expression. However, they largely overlook the crucial task of facial attribute editing. This capability is crucial for achieving deep personalization and expanding the range of practical applications, including user-tailored digital avatars, engaging online education content, and brand-specific digital customer service. In these key domains, the flexible adjustment of visual attributes—such as hairstyle, accessories, and subtle facial features—is essential for aligning with user preferences, reflecting diverse brand identities, and adapting to varying contextual demands. In this paper, we present FaceEditTalker, a unified framework that enables controllable facial attribute manipulation while generating high-quality, audio-synchronized talking head videos. Our method consists of two key components: an image feature space editing module, which extracts semantic and detail features and allows flexible control over attributes like expression, hairstyle, and accessories; and an audio-driven video generation module, which fuses these edited features with audio-guided facial landmarks to drive a diffusion-based generator. This design ensures temporal coherence, visual fidelity, and identity preservation across frames. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art approaches in lip-sync accuracy, video quality, and attribute controllability.

The flowchart of EditFaceTalker

FaceEditTalker Pipeline Illustration

By providing a single reference image, audio input, and optional facial attribute input, our method generates high-quality, facially editable speaker videos by predicting facial landmark maps and performing linear edits on the feature semantic encoding of the image, combined with a diffusion model. This method demonstrates good generalization ability and achieves high lip-sync accuracy. In this figure, the image input used is a portrait from outside the dataset.

The overview of FaceEditTalker

FaceEditTalker Technical Architecture Diagram

The framework consists of two main modules: (a) Image Feature Space Editing Module, which extracts editable semantic and stochastic codes from the reference image using a dual-layer latent encoding structure. Fine-grained attribute manipulation is enabled through optional spatial editing on the semantic codes. (b) Audio-Driven Video Generation Module, which leverages the audio input to infer driving landmarks. During the diffusion process, the stochastic codes guide dynamic generation, while the semantic codes serve as conditional inputs to ensure attribute consistency and visual fidelity throughout the video.

Quality Comparison

We compared FaceEditTalker with several state-of-the-art methods, including Wav2Lip which primarily focuses on lip synchronization, SadTalker based on 3D facial modeling, and diffusion model-based generation methods such as DiffTalk, EchoMimic, and Hallo. Below shows the quality comparison results in regular generation mode, without involving attribute editing functionality.

Attribute Editing Generation

FaceEditTalker can generate talking head videos with different facial attributes. Below shows the results generated by adjusting different facial attribute parameters under the same audio input and identity.

5 O'Clock Shadow

Arched Eyebrows

Attractive

Bags Under Eyes

Bald

Bangs

Big Lips

Big Nose

Black Hair

Blond Hair

Blurry

Brown Hair

Bushy Eyebrows

Chubby

Double Chin

Eyeglasses

Goatee

Gray Hair

Heavy Makeup

High Cheekbones

Male

Mouth Slightly Open

Mustache

Narrow Eyes

No Beard

Oval Face

Pale Skin

Pointy Nose

Receding Hairline

Rosy Cheeks

Sideburns

Smiling

Straight Hair

Wavy Hair

Wearing Earrings

Wearing Hat

Wearing Lipstick

Wearing Necklace

Wearing Necktie

Young