Speech Driven Facial Feature Synthesis

We Propose an Emotion Dependent Domain Adaptation for Speech Driven Affective Facial Feature Synthesis.
Audio-to-visual (A2V) mapping models are trained separately for each six emotion (angry, disgust, fear, happy, sad and surprise) and neutral categories. A feature representation transfer based domain adaptation scheme is proposed to augment affective representations for each emotion category from the abundant neutral audio-visual recordings. The domain adaptation is used to better train the affective A2V models. The proposed affective facial synthesis system is realized in two stages: first, speech emotion recognition extracts soft emotion category likelihoods for the utterances; then, the emotion dependent A2V mapping models perform a soft affective facial synthesis.

Following sample videos are generated from speech signal for each of seven emotions, using Domain Adaptation, Model Adaptation, No transfer Learning and Ground Truths (extracted via DLib)

Samples 1: Angry