Multimodal analysis of speech and arm motion for prosody-driven synthesis of beat gestures, Elif Bozkurt, Yücel Yemez, Engin Erzin, Speech Communication 85 (2016) 29–42
We propose a framework for joint analysis of speech prosody and arm motion towards automatic synthesis and animation of gestures from speech in speaker-dependent and independent settings. The multimodal datasets used in speaker-dependent and independent systems are in Turkish and in English, respectively.
Animation systems:
1. Motion capture synthesis (MOCAP): uses the captured true motion in the animations.
2. Baseline synthesis (BASELINE): creates an animation from a sequence of random gestures via gesture segment selection based solely on joint angle continuity.
3. HSMM based synthesis (HSMM): proposed system.
Animations generated in speaker-independent setting:
We use a dyadic, multi-speaker, and multimodal dataset for the speaker-independent framework. In this dataset speakers frequently take turns and mostly hold floor for a short time. We also use additional test audio data from an audio-book and a TED talk for demonstration purposes.
Animations generated in speaker-dependent setting:
We perform subjective tests using the speaker-dependent setting for gesture synthesis and consider below two cases for the proposed method depending on the rhythm cost weight (β) value used in the unit selection algorithm,
- HSMM (more rhythm): β = 0.5
- HSMM (less rhythm): β = 0.9
Below we present 8 video clips from pairwise subjective A/B tests in Turkish and 2 video clips for synthesis results with English input speech. The same speaker’s speech data is used in all of the animations.
Examples from pairwise subjective evaluations (These animations are in Turkish).
Please use web browsers Chrome, Safari, or Internet Explorer to view the videos.