Neural Face Skinning for Mesh-agnostic Facial Expression Cloning


1Visual Media Lab, KAIST, Republic of Korea
2Flawless AI, United States of America

We present a method that enables direct retargeting between two facial meshes with different shapes and mesh structures.

Abstract

Accurately retargeting facial expressions to a face mesh while enabling manipulation is a key challenge in facial animation retargeting. Recent deep-learning methods address this by encoding facial expressions into a global latent code, but they often fail to capture fine-grained details in local regions. While some methods improve local accuracy by transferring deformations locally, this often complicates overall control of the facial expression. To address this, we propose a method that combines the strengths of both global and local deformation models. Our approach enables intuitive control and detailed expression cloning across diverse face meshes, regardless of their underlying structures. The core idea is to localize the influence of the global latent code on the target mesh. Our model learns to predict skinning weights for each vertex of the target face mesh through indirect supervision from predefined segmentation labels. These predicted weights localize the global latent code, enabling precise and region-specific deformations even for meshes with unseen shapes. We supervise the latent code using Facial Action Coding System (FACS)-based blendshapes to ensure interpretability and allow straightforward editing of the generated animation. Through extensive experiments, we demonstrate improved performance over state-of-the-art methods in terms of expression fidelity, deformation transfer accuracy, and adaptability across diverse mesh structures.

Key idea

Brief illustration of the key idea.

Our method localizes the influence of global expression on the local regions in the target identity mesh by utilizing per-vertex skinning weights, enabling precise expression cloning on the local region.

Overview

Method overview. Illustration of the data flow at inference (a) and training (b). For simplicity, the encoders are omitted from (b) and the dotted red box indicates the losses that are exclusively applied to the ICT data.
Our method localizes a global expression code for each vertex based on target mesh geometry. We employ a skinning encoder that predicts the per-vertex skinning weight from the target face mesh. The predicted skinning weight is used to localize the global expression code. Then the local deformation is predicted using the localized expression code to produce a deformed mesh.



Supervision using the segmentation


Segmentation label for ICT face mesh.

A key aspect of our method is predicting the skinning weights that capture the relationship between facial regions and the global expression code. In this paper, we propose a strategy where the skinning encoder implicitly learns to predict skinning weights for the given mesh using supervision from a predefined segmentation labels. This indirect supervision aims to regularize the skinning encoder by ensuring consistent skinning weights for corresponding face regions across different face shapes. While the encoder can converge without this supervision, the estimated weights may lack the desired consistency.


Visualization of face skinning weight on the ICT face mesh and various unseen meshes. We extracted the maximum value in z_Skin for each vertex and calculated the mode value on each triangle for the rendering.
To supervise the skinning encoder using the segmentation labels, we created a segmentation map that divides the face into several regions based on ICT by referring to the facial muscle group. We do not force z_Skin to take a strict one-hot form, allowing for some degrees of spatial correlation to remain during training. The supervision is applied exclusively when the training data is obtained from ICT.

Experiments

To evaluate the expression quality, we conducted a self-retargeting task in which the model retargets a face mesh with an expression to the same mesh in a neutral expression. We also conducted experiments on inverse rigging to validate if our constructed expression code conforms to the grammar of the FACS-based ICT blendshape model.



ICT with unseen IDs and expressions (randomly sampled blendshapes)






ICT with unseen IDs and expressions (captured using LiveLink App)






Multiface with unseen ID and expressions (EXP_free_face)






Expression cloning on stylized face mesh





Application


Editing expressions





Please refer to the paper for more details!