ISSN1000-0054 CN11-2223/N ( ) 2011 51 9 JTsinghuaUniv(Sci& Tech) 2011Vol.51 No.9 5/33 1180-1186 ( 710129) [1-2] 2 [1] MPEG-4 3-D MOS MOS 3.42 3.50 TP391 1000-0054(2011)09-1180-07 A Real-timespeechdriventalkingavatar LIBingfengXIELeiZHOUXiangzengFUZhonghuaZHANGYanning (ShaanxiProvincialKeyLaboratoryofSpeechandImageInformation ProcessingSchoolofComputerScienceNorthwesternPolytechnical UniversityXi an710129china) AbstractThispaperpresentsareal-timespeechdriventalkingavatar. Unlikemosttalkingavatarsinwhichthespeech-synchronizedfacial animationisgeneratedoflinethistalkingavatarisabletospeak withlive speech input. Thislife-like talking avatar has many potentialapplicationsinvideophonesvirtualconferencesaudio/video chatsandentertainment.sincephonemesarethesmalestunitsof pronunciationa real-time phoneme recognizer was built. The synchronizationbetweentheinputlivespeechandthefacialmotion usedaphonemerecognitionandoutputalgorithm.thecoarticulation efectsareincludedinadynamicvisemegenerationalgorithm to coordinatethefacialanimationparameters (FAPs)fromtheinput phonemes.the MPEG-4compliantavatar modelisdrivenbythe generatedfaps.testsshowthattheavatarmotionissynchronized andnaturalwith MOSvaluesof3.42and3.5. Keywordsvisualspeechsynthesistalkingavatarfacialanimation (motiondriven) (textdriven) (speechdriven) (textto speechtts) [3] MPEG-4 (facialani- mationparameterfap [4] ) [5] 2 Mathew [6] Markov (hidden markovmodelhmm) [7] Bayes 2011-07-15 (60802085) (61175018) (2011KJXX29) (2011JM8009) (1988 )( ) E-maillxie@nwpu.edu.cn
1181 (real-timespeechdriven) 1 1 ( ) 1 4 3-D FAP FAP 2 1 2 2 (voiceactivationdetectionvad) 2.1 3-D FaceGen 3- D XfaceEd MPEG-4 1 FAP 66 FAP ( ) FAP 84
1182 ( ) 201151(9) ^w last(t) XfaceEd FaceGen t+1 t+n-1 o t+1 18 o t+n-1 (1) N -1 ^Wt+1 = (^w 1 ^w 2 last(t+1) t+1 t+1 ^w t+1 ) 烄 2.2 (3) 烅 ^Wt+N-1 = (^w 1 ^w 2 last(t+n-1) 烆 t+n-1 t+n-1 ^w t+n-1 ). O=(o 1 o I )(I 1) W =(w 1 w n )(n 1) N L O ^w ^W ^w pre ^w ^W =argmax P(W O). (1) W L HMM Veterbi ^Wt = (^w 1^w 2 last(t) t t ^w t )= argmax P(W (o 1 o t )). (2) W L t t t+n-1 ^w last(t) t = ^w t+1 last(t+1) = ^w t+n-1 last(t+n-1) = ^w (4) t o t (1) { ^w ^w pre. ^w ^w pre 3 3 N 2 N 2.3 2.3.1 N N 3 3-D FAP ( p b ) (viseme) MPEG-4
1183 [4] 21 38 ( 59 ) F sp (t)= Dsp( t) T sp + (D lp (t)+d rp (t)) T 0p. 1 D sp (t)+d lp (t)+d rp (t) (8) [3] 1 T 0p p T sp ( ) s p bpm gkh aang ou i 4 f jqx aian eeng u dtn zhchshr ao eien v(/yu/) l zcs o er SIL 13 iao=i+ao MPEG-4 FAP FAP FAP FAP( ) 2.3.2 4 2.3.3 FAP 1 2 p s FAP p s D sp (5) (7) D sp = αsp e -θsp(-) τ c 烄 τ 0 (5) 烅 αspe -θ sp(+) 烆 τ c τ<0. αsp θsp(-) θsp(+) c τ 2 τ=t so-tt so t 2 ( ) (5) 2 D lp =αlpe sgn( σ)θ lp σ c. (6) D sp (-) =αspe -θ sp(-) τ c τ 0 D sp (+) =αspe -θ sp(+) τ c τ<0. (9) (10) σ=t lo-tt lo 5 ( ) j D rp =αrpe -sgn( υ)θ rp υ c. (7) i p i j υ=t ro-tt ro
1184 ( ) 201151(9) 5 6b j i j T 1 T 2 D ip (+) D jp (-) 6e L 2 F p (t)= D jp(-)(t) T jp +D ip (+)(t) T ip. D jp (-)(t)+d ip (+)(t) (11) D jp (+) D kp (-) 6e L 3 6d k t io t t jo t io i T 3 D kp (-) D rp t jo j T ip T jp 6e L 4 i j p (11) t io t jo i j k 3 6 6a i i 0 T 1 D ip (-) D lp 6e L 1 6c k j k T 2 T 3 3 3.1 5 HMM Gauss (sil) 60 HMM 90h HMM HTK [9] 16kHz 16bit 12 Mel (MFCC) 39 25ms 10ms Veterbi 3.2 6 21 (meanopinionscoremos) - 4 1) 1 HTK (forcedalignment) (upperbound)
1185 2) 2 2 MOS MOS MOS /ms 3) 3 1 4.34 4.01 0 4) 4 2 3.98 3.81 0 3 20 10 (5 5 ) 2 20 4s 60 3 3 26.34% 2 4 20 ( 45ms 4 80 htp//www.nwpu- aslp.org/talkingavatar.html) 80 1 110ms [10] 2 1 3.3 (interquar- 2 tilerangeiqr) 7 8 (outlier) MOS 2 7 4 MOS 3 3.42 3.50 45 4 3.21 3.16 110 1 3 4 ( 1 2) 4 3 3 4 3 3 4 3 2 4 MOS 3 0.34 3 MOS 3.42 3.5 4 8 4 MOS
1186 ( ) 201151(9) FAP FAP 3-D MOS MOS 3.42 3.50 (References) [1] CosatoEOstermannJGrafH Petal.Lifeliketalking facesforinteractiveservices[j].proceedingsoftheieee 200391(9)1406-1429. [2] TANG Hao FU Yun TU Jilin et al. Humanoid audio-visualavatarwithemotivetext-to-speechsynthesis[j]. IEEE Transactionson Multimedia200810(6)969-981. [3] WU Zhiyong ZHANG Shen CAI Lianhong et al. Real-time Synthesis of Chinese Visual Speech and Facial Expressions using MPEG-4 FAP Features in a Three-dimensionalAvatar[C]//TheInternationalConferenceon SpokenLanguageProcessingPitsburgh20061802-1805. [4] PandzicI SForchheimer R. MPEG-4 Facial Animation [M].New YorkWiley2002. [5] HUANG FujieCosatoEGraf H.Triphonebasedunits electionforconcatenativevisualspeechsynthesis[c]//ieee internationalconference on acousticsspeechand signal processing.njieeepress20022037-2040. [6] Brand M. Voice puppetry [C]// Proceedings of the SIGGRAPH 99.NYACM Press199921-28. [7] XIE Lei LIU Zhiqiang. Realistic mouth-synching for speech-driventalkingfaceusingarticulatory modeling [J]. IEEE Transactionson Multimedia20079(3)500-510.. [J]. 200314(3)461-466. WANGZhimingCAILianhong.A dynamicviseme model andparameterestimation [J].Journalof Software2003 14(3)461-466.(inChinese) [9] YoungSEvermannGKershaw Detal.TheHTK Book [M].CambridgeUniversityEngineeringDepartment2009. [10]. [M]. 1992. WANG LijiaLIN Tao.Phonetics Course [M].Peking UniversityPress1992.(inChinese) ( 1179 ) 3 SNR 3 ITD ID (References) [1] CherryEC.Someexperimentsontherecognitionofspeech with one and with two ears [J]. J of ASA 1953 25975-979. [2] Bregman A S.AuditoryScene Analysis [M].Cambridge MITPress1990. [3] HanslerESchmidtG.Topicsin AcousticEchoand Noise Control[M].BerlinSpringer2006. [4] RomanNWangDLBrownGJ.Speechsegregationbased on sound localization [J]. J of ASA 2003 114(4)2236-2252. [5] MeddisR.Simulationofmechanicaltoneuraltransductionin the auditory receptor [J]. J of ASA 1988 83(3)1056-1063. [6] BeroutiMSchwartzMMakhoulJ.Enhancementofspeech corrupted by acoustic noise [J]. Proc IEEE Int Conf AcoustSpeechSignal Process1979208-211. [7] ScalartPFilhoJ.Speechenhancementbasedonapriori signaltonoiseestimation [J].ProcIEEE IntConf Acoust SpeechSignal Process1996629-632. Ephraim YMalahD.Speechenhancementusingaminimum mean-square error log-spectral amplitude estimator [J]. IEEE Trans Acoust Speech Signal Process 1985 23(2)443-445.