32nd IEEE International Conference on Acoustics, Speech and Signal Processing, Hawaii, United States Of America, 15 - 20 April 2007, pp.677-678, (Full Text)
We present a new framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. The proposed two-stage analysis aims to "learn" both elementary prosody and head gesture patterns for a particular speaker, as well as the correlations between these head gesture and prosody patterns from a training video sequence. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech.