2006
|
Kakumanu, P; Esposito, A; Garcia, O N; Gutierrez-Osuna, R A comparison of acoustic coding models for speech-driven facial animation Journal Article In: Speech communication, vol. 48, no. 6, pp. 598–615, 2006. @article{kakumanu2006comparison,
title = {A comparison of acoustic coding models for speech-driven facial animation},
author = {P Kakumanu and A Esposito and O N Garcia and R Gutierrez-Osuna},
url = {https://psi.engr.tamu.edu/wp-content/uploads/2018/01/kakumanu2006comparison.pdf},
year = {2006},
date = {2006-01-01},
journal = {Speech communication},
volume = {48},
number = {6},
pages = {598--615},
publisher = {Elsevier},
abstract = {This article presents a thorough experimental comparison of several acoustic modeling techniques by their ability to capture information related to orofacial motion. These models include (1) Linear Predictive Coding and Linear Spectral Frequencies, which model the dynamics of the speech production system, (2) Mel Frequency Cepstral Coefficients and Perceptual Critical Feature Bands, which encode perceptual cues of speech, (3) spectral energy and fundamental frequency, which capture prosodic aspects, and (4) two hybrid methods that combine information from the previous models. We also consider a novel supervised procedure based on Fisher’s Linear Discriminants to project acoustic information onto a low-dimensional subspace that best discriminates different orofacial configurations. Prediction of orofacial motion from speech acoustics is performed using a non-parametric k-nearest-neighbors procedure. The sensitivity of this audio–visual mapping to coarticulation effects and spatial locality is thoroughly investigated. Our results indicate that the hybrid use of articulatory, perceptual and prosodic features of speech, combined with a supervised dimensionality-reduction procedure, is able to outperform any individual acoustic model for speech-driven facial animation. These results are validated on the 450 sentences of the TIMIT compact dataset.},
keywords = {Facial animation},
pubstate = {published},
tppubtype = {article}
}
This article presents a thorough experimental comparison of several acoustic modeling techniques by their ability to capture information related to orofacial motion. These models include (1) Linear Predictive Coding and Linear Spectral Frequencies, which model the dynamics of the speech production system, (2) Mel Frequency Cepstral Coefficients and Perceptual Critical Feature Bands, which encode perceptual cues of speech, (3) spectral energy and fundamental frequency, which capture prosodic aspects, and (4) two hybrid methods that combine information from the previous models. We also consider a novel supervised procedure based on Fisher’s Linear Discriminants to project acoustic information onto a low-dimensional subspace that best discriminates different orofacial configurations. Prediction of orofacial motion from speech acoustics is performed using a non-parametric k-nearest-neighbors procedure. The sensitivity of this audio–visual mapping to coarticulation effects and spatial locality is thoroughly investigated. Our results indicate that the hybrid use of articulatory, perceptual and prosodic features of speech, combined with a supervised dimensionality-reduction procedure, is able to outperform any individual acoustic model for speech-driven facial animation. These results are validated on the 450 sentences of the TIMIT compact dataset. |
2005
|
Fu, S; Gutierrez-Osuna, R; Esposito, A; Kakumanu, P; Garcia, O N Audio/visual mapping with cross-modal hidden Markov models Journal Article In: Multimedia, IEEE Transactions on, vol. 7, no. 2, pp. 243–252, 2005. @article{fu2005tmm,
title = {Audio/visual mapping with cross-modal hidden Markov models},
author = {S Fu and R Gutierrez-Osuna and A Esposito and P Kakumanu and O N Garcia},
url = {https://psi.engr.tamu.edu/wp-content/uploads/2018/01/fu2005tmm.pdf},
year = {2005},
date = {2005-01-01},
journal = {Multimedia, IEEE Transactions on},
volume = {7},
number = {2},
pages = {243--252},
publisher = {IEEE},
abstract = {The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data.},
keywords = {Facial animation},
pubstate = {published},
tppubtype = {article}
}
The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data. |
Gutierrez-Osuna, R; Kakumanu, P; Esposito, A; Garcia, ON; Bojorquez, A; Castillo, JL; Rudomin, I Speech-driven facial animation with realistic dynamics Journal Article In: Multimedia, IEEE Transactions on, vol. 7, no. 1, pp. 33–42, 2005. @article{gutierrez2005tmm,
title = {Speech-driven facial animation with realistic dynamics},
author = {R Gutierrez-Osuna and P Kakumanu and A Esposito and ON Garcia and A Bojorquez and JL Castillo and I Rudomin},
url = {https://psi.engr.tamu.edu/wp-content/uploads/2018/01/gutierrez2005tmm.pdf},
year = {2005},
date = {2005-01-01},
journal = {Multimedia, IEEE Transactions on},
volume = {7},
number = {1},
pages = {33--42},
publisher = {IEEE},
abstract = {This work presents an integral system capable of generating animations with realistic dynamics, including the individualized nuances, of three-dimensional (3-D) human faces driven by speech acoustics. The system is capable of capturing short phenomena in the orofacial dynamics of a given speaker by tracking the 3-D location of various MPEG-4 facial points through stereovision. A perceptual transformation of the speech spectral envelope and prosodic cues are combined into an acoustic feature vector to predict 3-D orofacial dynamics by means of a nearest-neighbor algorithm. The Karhunen-Loe´ve transformation is used to identify the principal components of orofacial motion, decoupling perceptually natural components from experimental noise. We also present a highly optimized MPEG-4 compliant player capable of generating audio-synchronized animations at 60 frames/s. The player is based on a pseudo-muscle model augmented with a nonpenetrable ellipsoidal structure to approximate the skull and the jaw. This structure adds a sense of volume that provides more realistic dynamics than existing simplified pseudo-muscle-based approaches, yet it is simple enough to work at the desired frame rate. Experimental results on an audiovisual database of compact TIMIT sentences are presented to illustrate the performance of the complete system.},
keywords = {Facial animation, Speech},
pubstate = {published},
tppubtype = {article}
}
This work presents an integral system capable of generating animations with realistic dynamics, including the individualized nuances, of three-dimensional (3-D) human faces driven by speech acoustics. The system is capable of capturing short phenomena in the orofacial dynamics of a given speaker by tracking the 3-D location of various MPEG-4 facial points through stereovision. A perceptual transformation of the speech spectral envelope and prosodic cues are combined into an acoustic feature vector to predict 3-D orofacial dynamics by means of a nearest-neighbor algorithm. The Karhunen-Loe´ve transformation is used to identify the principal components of orofacial motion, decoupling perceptually natural components from experimental noise. We also present a highly optimized MPEG-4 compliant player capable of generating audio-synchronized animations at 60 frames/s. The player is based on a pseudo-muscle model augmented with a nonpenetrable ellipsoidal structure to approximate the skull and the jaw. This structure adds a sense of volume that provides more realistic dynamics than existing simplified pseudo-muscle-based approaches, yet it is simple enough to work at the desired frame rate. Experimental results on an audiovisual database of compact TIMIT sentences are presented to illustrate the performance of the complete system. |
2001
|
Kakumanu, P; Gutierrez-Osuna, R; Esposito, A; Bryll, R; Goshtasby, A; Garcia, ON Speech driven facial animation Conference Proceedings of the 2001 workshop on Perceptive user interfaces, ACM 2001. @conference{kakumanu2001speech,
title = {Speech driven facial animation},
author = {P Kakumanu and R Gutierrez-Osuna and A Esposito and R Bryll and A Goshtasby and ON Garcia},
url = {https://psi.engr.tamu.edu/wp-content/uploads/2018/01/kakumanu2001speech.pdf},
year = {2001},
date = {2001-01-01},
booktitle = {Proceedings of the 2001 workshop on Perceptive user interfaces},
pages = {1--5},
organization = {ACM},
abstract = {The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has been developed for learning the spatio-temporal relationship between speech acoustics and facial animation is described, including video and speech processing, pattern analysis, and MPEG-4 compliant facial animation for a given speaker. In particular, we propose a perceptual transformation of the speech spectral envelope, which is shown to capture the dynamics of articulatory movements. An efficient nearest-neighbor algorithm is used to predict novel articulatory trajectories from the speech dynamics. The results are very promising and suggest a new way to approach the modeling of synthetic lip motion of a given speaker driven by his/her speech. This would also provide clues toward a more general cross-speaker realistic animation.},
keywords = {Facial animation, Speech},
pubstate = {published},
tppubtype = {conference}
}
The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has been developed for learning the spatio-temporal relationship between speech acoustics and facial animation is described, including video and speech processing, pattern analysis, and MPEG-4 compliant facial animation for a given speaker. In particular, we propose a perceptual transformation of the speech spectral envelope, which is shown to capture the dynamics of articulatory movements. An efficient nearest-neighbor algorithm is used to predict novel articulatory trajectories from the speech dynamics. The results are very promising and suggest a new way to approach the modeling of synthetic lip motion of a given speaker driven by his/her speech. This would also provide clues toward a more general cross-speaker realistic animation. |