We propose a fully-automatic and speaker-independent framework for speech-driven affective synthesis and animation of arm gestures. The affective content of speech is represented by using the continuous attributes activation, valence, and dominance.
- Motion capture synthesis (Orig): uses the captured true motion in the animations.
- Affect-only driven synthesis (A): uses models of affect attributes for synthesizing gestures
- Prosody-only driven synthesis (P): uses models of prosody features for synthesizing gestures.
- Joint affect and prosody driven synthesis (AP): uses models of affect attributes and prosody features fusion for synthesizing gestures.
- Prosody given affect driven synthesis (P|A): uses conditional models of prosody features given affect attributes for synthesizing gestures.
- Prosody given estimated affect driven synthesis (P|A’) : uses conditional models of prosody features given estimated affect attributes for synthesizing gestures.
We use a dyadic, multi-speaker, and multimodal dataset for the speaker-independent framework. In this dataset speakers frequently take turns and mostly hold floor for a short time. We also use additional test audio data from a TED talk for demonstration purposes.
Following sample animation videos use two sets of methods: A, P, AP, P|A, Orig and P|A, P|A’, Orig .
TED talk sample: Methods P, P|A’
In all the above videos we used M=40 gesture clusters. In order to better understand the effect of number of gesture clusters on animation quality, we also present animation results with lower number of gesture clusters (M=10) and higher number of clusters (M=90) in below videos for the P|A scenario.
Please use web browsers Chrome, Safari, or Internet Explorer to view the videos.