Accurately retargeting facial expressions to a face mesh while enabling manipulation is a key challenge in facial animation retargeting. Recent deep-learning methods address this by encoding facial expressions into a global latent code, but they often fail to capture fine-grained details in local regions. While some methods improve local accuracy by transferring deformations locally, this often complicates overall control of the facial expression. To address this, we propose a method that combines the strengths of both global and local deformation models. Our approach enables intuitive control and detailed expression cloning across diverse face meshes, regardless of their underlying structures. The core idea is to localize the influence of the global latent code on the target mesh. Our model learns to predict skinning weights for each vertex of the target face mesh through indirect supervision from predefined segmentation labels. These predicted weights localize the global latent code, enabling precise and region-specific deformations even for meshes with unseen shapes. We supervise the latent code using Facial Action Coding System (FACS)-based blendshapes to ensure interpretability and allow straightforward editing of the generated animation. Through extensive experiments, we demonstrate improved performance over state-of-the-art methods in terms of expression fidelity, deformation transfer accuracy, and adaptability across diverse mesh structures.
Our method localizes the influence of global expression on the local regions in the target mesh by utilizing per-vertex skinning weights, enabling precise expression cloning on the local region.
For the training, we utilize the synthetic data created by using the ICT-Facekit and the real data from the Multiface dataset for the training. During the training, we supervise identity and expression encoders using ICT blendshape.
In addition to identity and expression encoders, we employ a skinning encoder to predict the per-vertex skinning weight from the target mesh. The skinning weight is used to localize the global expression code. Given the localized expression code, the decoder outputs the local deformation, which deforms the target mesh.
A key aspect of our method is predicting the skinning weights that capture the relationship between local regions and the global expression code.
However, we cannot directly supervise the model with skinning weight as blendshape face models do not utilize skinning weights. To address this, we supervise the skinning encoder using a segmentation label.
To supervise the skinning encoder using the segmentation label, we created a segmentation map that divides the face into several regions based on ICT by referring to the facial muscle group. We do not force z_Skin to take a strict one-hot form, allowing for some degrees of spatial correlation to remain during training. The supervision is applied exclusively when the training data is obtained from ICT.
This indirect supervision aims to regularize the skinning encoder by ensuring consistent skinning weights for corresponding face regions across different face shapes. While the skinning encoder can converge without this supervision, the estimated weights may lack the desired consistency.