Neural Face Skinning for Mesh-agnostic Facial Expression Cloning

Abstract

Accurately retargeting facial expressions to a face mesh while enabling manipulation is a key challenge in facial animation retargeting. Recent deep-learning methods address this by encoding facial expressions into a global latent code, but they often fail to capture fine-grained details in local regions. While some methods improve local accuracy by transferring deformations locally, this often complicates overall control of the facial expression. To address this, we propose a method that combines the strengths of both global and local deformation models. Our approach enables intuitive control and detailed expression cloning across diverse face meshes, regardless of their underlying structures. The core idea is to localize the influence of the global latent code on the target mesh. Our model learns to predict skinning weights for each vertex of the target face mesh through indirect supervision from predefined segmentation labels. These predicted weights localize the global latent code, enabling precise and region-specific deformations even for meshes with unseen shapes. We supervise the latent code using Facial Action Coding System (FACS)-based blendshapes to ensure interpretability and allow straightforward editing of the generated animation. Through extensive experiments, we demonstrate improved performance over state-of-the-art methods in terms of expression fidelity, deformation transfer accuracy, and adaptability across diverse mesh structures.

Key idea

Brief illustration of the key idea.

Our method localizes the influence of global expression on the local regions in the target mesh by utilizing per-vertex skinning weights, enabling precise expression cloning on the local region.

Overview

Illustration of the data flow at inference (a) and training (b). For simplicity, the encoders are omitted from (b) and the dotted red box indicates the losses that are exclusively applied to the ICT data.

For the training, we utilize the synthetic data created by using the ICT-Facekit and the real data from the Multiface dataset for the training. During the training, we supervise identity and expression encoders using ICT blendshape.

In addition to identity and expression encoders, we employ a skinning encoder to predict the per-vertex skinning weight from the target mesh. The skinning weight is used to localize the global expression code. Given the localized expression code, the decoder outputs the local deformation, which deforms the target mesh.

Supervision using the segmentation

Segmentation label for ICT face mesh.

A key aspect of our method is predicting the skinning weights that capture the relationship between local regions and the global expression code.

However, we cannot directly supervise the model with skinning weight as blendshape face models do not utilize skinning weights. To address this, we supervise the skinning encoder using a segmentation label.

To supervise the skinning encoder using the segmentation label, we created a segmentation map that divides the face into several regions based on ICT by referring to the facial muscle group. We do not force z_Skin to take a strict one-hot form, allowing for some degrees of spatial correlation to remain during training. The supervision is applied exclusively when the training data is obtained from ICT.

Visualization of face skinning weight on the ICT face mesh and various unseen meshes. We extracted the maximum value in z_Skin for each vertex and calculated the mode value on each triangle for the rendering.

This indirect supervision aims to regularize the skinning encoder by ensuring consistent skinning weights for corresponding face regions across different face shapes. While the skinning encoder can converge without this supervision, the estimated weights may lack the desired consistency.

Experiments

To evaluate the expression quality, we conducted a self-retargeting task in which the model retargets a face mesh with an expression to the same mesh in a neutral expression. We also conducted experiments on inverse rigging to validate if our constructed expression code conforms to the grammar of the FACS-based ICT blendshape model.

ICT with unseen IDs and expressions (randomly sampled blendshapes)

ICT with unseen IDs and expressions (captured using LiveLink App)

Multiface with unseen ID and expressions (EXP_free_face)

Expression cloning on stylized face mesh

Application

Editing expressions

Limitation

One of the limitations is that the editability is bound to the ICT blendshape, which cannot handle head pose and neck movement. This can be solved simply by including these movements in the base blendshape. The second limitation arises from the skinning encoder, as the skinning weight prediction is inaccurate when the shape largely deviates from the learned data, as shown in the figure. Lastly, temporal jittering occurs, as shown in the videos. Integrating a temporal module would be a potential direction for future work.

BibTeX


      @inproceedings{cha2025neural,
        title={Neural Face Skinning for Mesh-agnostic Facial Expression Cloning},
        author={Cha, Sihun and Yoon, Serin and Seo, Kwanggyoon and Noh, Junyong},
        booktitle={Computer Graphics Forum},
        pages={e70009},
        year={2025},
        organization={Wiley Online Library}
      }