Multimodal analysis of speech and arm motion for prosody-driven synthesis of beat gestures, Elif Bozkurt, Yücel Yemez, Engin Erzin, Speech Communication 85 (2016) 29–42

We propose a framework for joint analysis of speech prosody and arm motion towards automatic synthesis and animation of gestures from speech in speaker-dependent and independent settings. The multimodal datasets used in speaker-dependent and independent systems are in Turkish and in English, respectively.

Animation systems:

1. Motion capture synthesis (MOCAP): uses the captured true motion in the animations.

2. Baseline synthesis (BASELINE): creates an animation from a sequence of random gestures via gesture segment selection based solely on joint angle continuity.

3. HSMM based synthesis (HSMM): proposed system.

Animations generated in speaker-independent setting:

We use a dyadic, multi-speaker, and multimodal dataset for the speaker-independent framework. In this dataset speakers frequently take turns and mostly hold floor for a short time. We also use additional test audio data from an audio-book and a TED talk for demonstration purposes.

https://youtu.be/rgEWE36toHA

https://youtu.be/EqBWjZyZkEc

https://youtu.be/0e92_WELFTU

https://youtu.be/Y3d6lfaHRWA

https://youtu.be/GPkwhL7JcVU

https://youtu.be/NGnBQA4on2E

https://youtu.be/QBQNO-2ii4Q

https://youtu.be/kgstQxu48VM

https://youtu.be/uQHmr3tM7eE

https://youtu.be/uaRTUYRdiFk

https://youtu.be/ep7pRC_xPLQ

https://youtu.be/IPCQ3xZdNVg

Animations generated in speaker-dependent setting:

We perform subjective tests using the speaker-dependent setting for gesture synthesis and consider below two cases for the proposed method depending on the rhythm cost weight (β) value used in the unit selection algorithm,

HSMM (more rhythm): β = 0.5
HSMM (less rhythm): β = 0.9

Below we present 8 video clips from pairwise subjective A/B tests in Turkish and 2 video clips for synthesis results with English input speech. The same speaker’s speech data is used in all of the animations.

Examples from pairwise subjective evaluations (These animations are in Turkish).

https://youtu.be/-yqIdEtx-3M

https://youtu.be/M7CfynI4LEs

https://youtu.be/A66bB9WIgTs

https://youtu.be/B_6nxVLoDUI

https://youtu.be/buG1kn2MavE

https://youtu.be/qVubGWs5Jm4

https://youtu.be/ThXMGk1oykA

https://youtu.be/bopIap_wH7g

Please use web browsers Chrome, Safari, or Internet Explorer to view the videos.

MVGL

Prosody-Driven Synthesis of Beat Gestures

Animation systems:

Animations generated in speaker-independent setting:

Animations generated in speaker-dependent setting:

Multimedia, Vision and Graphics Laboratory