Databases

Nod & Smile Events on  IEMOCAP – 2022

Head nod and smile backchannel events have been annotated on the  Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [1], and studied in two recents works [2, 3].

The Nod&Smile backchannel annotations are available upon request for academic purposes. You may contact E. Erzin.

[1] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, and S.S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, December 2008.

[2]  B. B. Turker, Z. Buc¸inca, E. Erzin, Y. Yemez, and M. Sezgin, “Analysis of engagement and user experience with a laughter responsive social robot,” in Proc. 18th Annu. Conf. Int. Speech Commun. Assoc, 2017, pp. 844–848.

[3]. B. B. Turker, E. Erzin, T. M. Sezgin, and Y. Yemez, “Audiovisual prediction of head-nod and turn-taking events in dyadic interactions,” in Proc. Interspeech, 2018.

eHRI Database – 2021

The Engagement in Human-Robot Interaction (eHRI) database containing natural interactions between two human participants and a robot under a story-shaping game scenario. The audio-visual recordings provided with the database are fully annotated at 5-intensity scale for head nods and smiles as well as with speech transcription and continuous engagement values. The database includes 24 video clips from recordings of  12 distinct groups of participants. Total duration is 142 minutes.

The eHRI database will be available soon for academic purposes. Readme for eHRI.

KUTM-FID Database – 2019

The food intake database (KUTM-FID) is constructed from tracheal TM recordings. KUTM-FID has recordings of 8 subjects, 4 males and 4 females, between 22-29 years old. Participants did not receive incentives for participation. Intake signals are collected in laboratory environment with the iASUS NT3 TM at 16 kHz sampling rate. All participants have no history of chewing and swallowing abnormalities. Data is collected in clean lab conditions. Subjects consume the prepared foods in the same amount and order. A total of 10 different intake tasks are recorded including chewing and swallowing of solids (potato chips, cake, biscuits, stick crackers, peanut and chocolate), and swallowing of liquids (water, milk, fizzy drink, and fruit juice). Each subject visits the laboratory 5 times and consumes these 10 different food items at one recording session. Average duration of each visit for each subject is around 7 min, and the total duration is 276 min.

The KUTM-FID database will be available soon for academic purposes.

JESTKOD Database – 2016

The JESTKOD database consists of dyadic interaction recordings of 10 participants, 4 female and 6 male, ages from 20 to 25. Agreement and disagreement interactions of the 5 dyads are collected in 5 sessions, all in Turkish. Each participant inter- acted with the same partner for both agreement and disagreement settings and only appeared in one session. In each session, there are 19-23 clip recordings of 2-4 minutes, where in each clip participants pick a topic that they agree or disagree, and engage into a dyadic interaction. The total duration of the recordings is 259 minutes.

The JESTKOD database is available upon request for academic purposes. You may contact E. Erzin.

Sample files from the JESTKOD

STCM Database – 2015

The synchronous throat and close-talk acoustic microphone (STCM) dataset consists of 799 TIMIT-like phonetically-balanced sentences in Turkish, where speech samples are recorded from one male speaker under clean conditions at 16-kHz sampling rate with a total duration of 45~minutes. An IASUS-GP3 headset and Sony condenser tie-pin microphone are used respectively for the parallel throat and close-talk microphone recordings. Phone level transcripts are also available.

The STCM database is available upon request for academic purposes. You may contact E. Erzin.

Audio-Visual Database (MVGL-AVD)

The MVGL audio-visual database has been collected for multimodal speaker identification/verification applications. The audio-visual data have been acquired using Sony DSR-PD150P video camera at Multimedia Vision and Graphics Laboratory of Koç University. The database includes 50 subjects, where each subject utters ten repetitions of her/his name as the secret phrase. A set of impostor data is also available with each subject in the population, uttering five different names from the population.

The MVGL-AVD database is available upon request for academic purposes.

Sample images and videos from the MVGL-AVD database:

Story Telling Audio-Visual Database (MVGL-MASAL)

The MVGL-MASAL is a gesture-speech database. The database includes four recordings of a single subject telling stories in Turkish. Each story is approximately 7 minutes long and the total duration of the database is 27 min and 45 seconds. The audio-visual data is synchronously captured from the stereo camera and sound card. The stereo video includes only upper body gestures with 30 frames per second whereas the audio is recorded with 16 kHz sampling rate and 16 bits per sample.