Recent advances in audio-driven talking head generation have achieved impressive results in lip synchronization and emotional expression. However, they largely overlook the crucial task of facial attribute editing. This capability is crucial for achieving deep personalization and expanding the range of practical applications, including user-tailored digital avatars, engaging online education content, and brand-specific digital customer service. In these key domains, the flexible adjustment of visual attributes—such as hairstyle, accessories, and subtle facial features—is essential for aligning with user preferences, reflecting diverse brand identities, and adapting to varying contextual demands. In this paper, we present FaceEditTalker, a unified framework that enables controllable facial attribute manipulation while generating high-quality, audio-synchronized talking head videos. Our method consists of two key components: an image feature space editing module, which extracts semantic and detail features and allows flexible control over attributes like expression, hairstyle, and accessories; and an audio-driven video generation module, which fuses these edited features with audio-guided facial landmarks to drive a diffusion-based generator. This design ensures temporal coherence, visual fidelity, and identity preservation across frames. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art approaches in lip-sync accuracy, video quality, and attribute controllability.
By providing a single reference image, audio input, and optional facial attribute input, our method generates high-quality, facially editable speaker videos by predicting facial landmark maps and performing linear edits on the feature semantic encoding of the image, combined with a diffusion model. This method demonstrates good generalization ability and achieves high lip-sync accuracy. In this figure, the image input used is a portrait from outside the dataset.
The framework consists of two main modules: (a) Image Feature Space Editing Module, which extracts editable semantic and stochastic codes from the reference image using a dual-layer latent encoding structure. Fine-grained attribute manipulation is enabled through optional spatial editing on the semantic codes. (b) Audio-Driven Video Generation Module, which leverages the audio input to infer driving landmarks. During the diffusion process, the stochastic codes guide dynamic generation, while the semantic codes serve as conditional inputs to ensure attribute consistency and visual fidelity throughout the video.
We compared FaceEditTalker with several state-of-the-art methods, including Wav2Lip which primarily focuses on lip synchronization, SadTalker based on 3D facial modeling, and diffusion model-based generation methods such as DiffTalk, EchoMimic, and Hallo. Below shows the quality comparison results in regular generation mode, without involving attribute editing functionality.
Slide left or right to view more comparison videos
FaceEditTalker can generate talking head videos with different facial attributes. Below shows the results generated by adjusting different facial attribute parameters under the same audio input and identity.