1181 (real-timespeechdriven) 1 1 ( ) D FAP FAP (voiceactivationdetectionvad) D FaceGen 3- D XfaceEd MPEG-4 1 FAP 66 FAP ( ) FAP 84

Σχετικά έγγραφα
Συνδυασμένη Οπτική-Ακουστική Ανάλυση Ομιλίας

Buried Markov Model Pairwise

Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// 3 3 Hanning (T ) 3 Hanning 3T (y(t)w(t)) dt =.5 T y (t)dt. () STRAIGHT F 3 TANDEM-STRAIGHT[] 3 F F 3 [] F []. :

Fourier transform, STFT 5. Continuous wavelet transform, CWT STFT STFT STFT STFT [1] CWT CWT CWT STFT [2 5] CWT STFT STFT CWT CWT. Griffin [8] CWT CWT

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

[5] F 16.1% MFCC NMF D-CASE 17 [5] NMF NMF 3. [5] 1 NMF Deep Neural Network(DNN) FUSION 3.1 NMF NMF [12] S W H 1 Fig. 1 Our aoustic event detect

Study on the Strengthen Method of Masonry Structure by Steel Truss for Collapse Prevention

40 3 Journal of South China University of Technology Vol. 40 No Natural Science Edition March

[2] REVERB 8 [3], [4] [5] [20] [6], [7], [8], [9], [10] [11] REVERB 8 *1 [9] LDA *2 MLLT (SAT) [8] (basis fmllr) [12] (DNN) [10] DNN [11] [13] [14] Ka

Analysis of prosodic features in native and non-native Japanese using generation process model of fundamental frequency contours

FENXI HUAXUE Chinese Journal of Analytical Chemistry. Savitzky-Golay. n = SG SG. Savitzky-Golay mmol /L 5700.

SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [1


Quick algorithm f or computing core attribute

(hidden Markov model: HMM) FUNDAMENTALS OF SPEECH SYNTHESIS BASED ON HMM. Keiichi Tokuda. Department of Computer Science

GUI

Speech Recognition using Phase Information based on Long-Term Analysis

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

1,a) 1,b) 2 3 Sakriani Sakti 1 Graham Neubig 1 1. A Study on HMM-Based Speech Synthesis Using Rich Context Models

A ne w method for spectral analysis of the potential field and conversion of derivative of gravity-anomalies : cosine transform

EM Baum-Welch. Step by Step the Baum-Welch Algorithm and its Application 2. HMM Baum-Welch. Baum-Welch. Baum-Welch Baum-Welch.

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Application of Wavelet Transform in Fundamental Study of Measurement of Blood Glucose Concentration with Near2Infrared Spectroscopy

35 10 : 3387 [12] [9] [13] [13] 2.2 PULL PUSH (1)PUSH PUSH 3 1 [14] NAS SAN VPN PUSH 2 2 PUSH 1 2 / PULL 3 [10] [11] 2.1 (2)PULL PULL [14] 3 PULL (3)

Non-negative Matrix Factorization, NMF [5] NMF. [1 3] Bregman [4] Harmonic-Temporal Clustering, HTC [2,3] 1,2,b) NTT

CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

A method of seeking eigen-rays in shallow water with an irregular seabed

Estimation, Evaluation and Guarantee of the Reverberant Speech Recognition Performance based on Room Acoustic Parameters

1530 ( ) 2014,54(12),, E (, 1, X ) [4],,, α, T α, β,, T β, c, P(T β 1 T α,α, β,c) 1 1,,X X F, X E F X E X F X F E X E 1 [1-2] , 2 : X X 1 X 2 ;

Query by Phrase (QBP) (Music Information Retrieval, MIR) QBH QBP / [1, 2] [3, 4] Query-by-Humming (QBH) QBP MIDI [5, 6] [8 10] [7]

Signal processing for handling singing voice texture

ER-Tree (Extended R*-Tree)

Επικοινωνία Ανθρώπου Υπολογιστή. Β2. Αναγνώριση ομιλίας

Echo path identification for stereophonic acoustic echo cancellation without pre-processing

Περιεχόµενα. ΕΠΛ 422: Συστήµατα Πολυµέσων. Μέθοδοι συµπίεσης ηχητικών. Βιβλιογραφία. Κωδικοποίηση µε βάση την αντίληψη.

(Υπογραϕή) (Υπογραϕή) (Υπογραϕή)

C F E E E F FF E F B F F A EA C AEC

Spectrum Representation (5A) Young Won Lim 11/3/16

Motion analysis and simulation of a stratospheric airship

Q π (/) ^ ^ ^ Η φ. <f) c>o. ^ ο. ö ê ω Q. Ο. o 'c. _o _) o U 03. ,,, ω ^ ^ -g'^ ο 0) f ο. Ε. ιη ο Φ. ο 0) κ. ο 03.,Ο. g 2< οο"" ο φ.

IL - 13 /IL - 18 ELISA PCR RT - PCR. IL - 13 IL - 18 mrna. 13 IL - 18 mrna IL - 13 /IL Th1 /Th2

ITU-R BT.1908 (2012/01) !" # $ %& '( ) * +, - ( )

Acoustic Signal Adjustment by Considering Musical Expressive Intention Using a Performance Intension Function

[1] DNA ATM [2] c 2013 Information Processing Society of Japan. Gait motion descriptors. Osaka University 2. Drexel University a)

VSC STEADY2STATE MOD EL AND ITS NONL INEAR CONTROL OF VSC2HVDC SYSTEM VSC (1. , ; 2. , )

v.connect 2 v.connect : A Singing Synthesis System Enabling Users to Control Vocal Tones Makoto Ogawa, 1 Syunji Yazaki 1 and Kôki Abe 1 VOCALOID

GPU [16] ( ) GPU [17] GPU GPU Stam [1] 2 [2] Wang [18] Thürey [19] Zhang [20] 2 (smoothed Kim [21] particlehydrodynamicssph) [3] SPH [4] SPH M

Πτυχιακή Εργασι α «Εκτι μήσή τής ποιο τήτας εικο νων με τήν χρή σή τεχνήτων νευρωνικων δικτυ ων»

934 Ν. 9<Π)/94. Ε.Ε. Παρ. 1(H) Αρ. 2863,43.94

Research on mode-locked optical fiber laser

Research on model of early2warning of enterprise crisis based on entropy

Probability and Random Processes (Part II)

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n

Zigbee. Zigbee. Zigbee Zigbee ZigBee. ZigBee. ZigBee

Approximation Expressions for the Temperature Integral

MachineDancing: MikuMikuDance (MMD) *1 MMD MMD. Kinect. MachineDancing. 3 MachineDancing. 1 MachineDancing :

Research on divergence correction method in 3D numerical modeling of 3D controlled source electromagnetic fields

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

High order interpolation function for surface contact problem

ΔΗΜΟΤΙΚΕΣ ΕΚΛΟΓΕΣ 18/5/2014 ΑΚΥΡΑ

Electronic Supplementary Information

[2] y π π ( )π j πj i πi πj i 2. U = {1 K N} y p(s) S i j k I k yij wi = 1 πi πj i I k = 1 k S ^tπ = { i j wiyij 0k S y S πk k πk = Pr(k S)=Pr(I k =1)

Sinsy: HMM. Sinsy An HMM-based singing voice synthesis system which can realize your wish I want this person to sing my song

Ένα µοντέλο Ισοδύναµης Χωρητικότητας για IEEE Ασύρµατα Δίκτυα. Εµµανουήλ Καφετζάκης

Studies on the Binding Mechanism of Several Antibiotics and Human Serum Albumin

THE BEST AUDIO SOLUTIONS. Cmedia is Solutions. Cmedia Solutions: Applications: Worldwide Solutions That Touch People s Hearts!

[15], [16], [17] [6] [2] [5] Jiang [6] 2.1 [6], [10] Score(x, y) y ( 1) ( 1 ) b e ( 1 ) b e. O(n 2 ) Jiang [6] (word lattice reranking)

2002 Journal of Software

ITU-R SM (2012/09)

Διπλωματική Εργασία της φοιτήτριας του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

D-Glucosamine-derived copper catalyst for Ullmann-type C- N coupling reaction: theoretical and experimental study

ss rt çã r s t Pr r Pós r çã ê t çã st t t ê s 1 t s r s r s r s r q s t r r t çã r str ê t çã r t r r r t r s

: Ω F F 0 t T P F 0 t T F 0 P Q. Merton 1974 XT T X T XT. T t. V t t X d T = XT [V t/t ]. τ 0 < τ < X d T = XT I {V τ T } δt XT I {V τ<t } I A

Area Location and Recognition of Video Text Based on Depth Learning Method

Technical Research Report, Earthquake Research Institute, the University of Tokyo, No. +-, pp. 0 +3,,**1. No ,**1

Anomaly Detection with Neighborhood Preservation Principle

Protease-catalysed Direct Asymmetric Mannich Reaction in Organic Solvent

Legal use of personal data to fight telecom fraud

Audio Engineering Society. Convention Paper. Presented at the 120th Convention 2006 May Paris, France

Συστήµατα Πολυµέσων Ενδιάµεση Εξέταση: Οκτώβριος 2004

Higher-Order Correlation Analysis of Pitch Fluctuations in Sustained Normal Vowels by the Method of Surrogate Data

ΓΗΣ ΕΠΙΣΗΜΟΥ ΕΦΗΜΕΡΙΔΟΣ ΤΗΣ ΔΗΜΟΚΡΑΤΙΑΣ ύττ* *Αρ. 870 της 23ης ΑΠΡΙΛΙΟΥ 1971 ΝΟΜΟΘΕΣΙΑ

Optimizing Microwave-assisted Extraction Process for Paprika Red Pigments Using Response Surface Methodology

ΔΙΠΛΩΜΑΤΙΚΕΣ ΕΡΓΑΣΙΕΣ

X g 1990 g PSRB

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

! : ;, - "9 <5 =*<

Διπλωματική Εργασία. του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

LUO, Hong2Qun LIU, Shao2Pu Ξ LI, Nian2Bing

Yesenia Lacouture-Parodi, PhD. AudioLabs, Fraunhofer IIS Prof. Dr. Emanuël Habets, AudioLabs, Fraunhofer IIS

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

N8-0 (1 *.0 ' :7 ' _H $ (G0 )-: + $ B1+ N (+:- A+1 5.

Evolution of Novel Studies on Thermofluid Dynamics with Combustion

Γλωσσική Τεχνολογία. 7 η Ενότητα: Αναγνώριση ομιλίας και συστήματα προφορικών διαλόγων. Ίων Ανδρουτσόπουλος.

Χρήστος Ξενάκης. Πανεπιστήμιο Πειραιώς, Τμήμα Ψηφιακών Συστημάτων

Estimation of stability region for a class of switched linear systems with multiple equilibrium points

GF GF 3 1,2) KP PP KP Photo 1 GF PP GF PP 3) KP ULultra-light 2.KP 2.1KP KP Fig. 1 PET GF PP 4) 2.2KP KP GF 2 3 KP Olefin film Stampable sheet

The Greek Data Protection Act: The IT Professional s Perspective

THE BEST AUDIO SOLUTIONS. Cmedia is Solutions. Cmedia Solutions: Applications: Worldwide Solutions That Touch People s Hearts!

Transcript:

ISSN1000-0054 CN11-2223/N ( ) 2011 51 9 JTsinghuaUniv(Sci& Tech) 2011Vol.51 No.9 5/33 1180-1186 ( 710129) [1-2] 2 [1] MPEG-4 3-D MOS MOS 3.42 3.50 TP391 1000-0054(2011)09-1180-07 A Real-timespeechdriventalkingavatar LIBingfengXIELeiZHOUXiangzengFUZhonghuaZHANGYanning (ShaanxiProvincialKeyLaboratoryofSpeechandImageInformation ProcessingSchoolofComputerScienceNorthwesternPolytechnical UniversityXi an710129china) AbstractThispaperpresentsareal-timespeechdriventalkingavatar. Unlikemosttalkingavatarsinwhichthespeech-synchronizedfacial animationisgeneratedoflinethistalkingavatarisabletospeak withlive speech input. Thislife-like talking avatar has many potentialapplicationsinvideophonesvirtualconferencesaudio/video chatsandentertainment.sincephonemesarethesmalestunitsof pronunciationa real-time phoneme recognizer was built. The synchronizationbetweentheinputlivespeechandthefacialmotion usedaphonemerecognitionandoutputalgorithm.thecoarticulation efectsareincludedinadynamicvisemegenerationalgorithm to coordinatethefacialanimationparameters (FAPs)fromtheinput phonemes.the MPEG-4compliantavatar modelisdrivenbythe generatedfaps.testsshowthattheavatarmotionissynchronized andnaturalwith MOSvaluesof3.42and3.5. Keywordsvisualspeechsynthesistalkingavatarfacialanimation (motiondriven) (textdriven) (speechdriven) (textto speechtts) [3] MPEG-4 (facialani- mationparameterfap [4] ) [5] 2 Mathew [6] Markov (hidden markovmodelhmm) [7] Bayes 2011-07-15 (60802085) (61175018) (2011KJXX29) (2011JM8009) (1988 )( ) E-maillxie@nwpu.edu.cn

1181 (real-timespeechdriven) 1 1 ( ) 1 4 3-D FAP FAP 2 1 2 2 (voiceactivationdetectionvad) 2.1 3-D FaceGen 3- D XfaceEd MPEG-4 1 FAP 66 FAP ( ) FAP 84

1182 ( ) 201151(9) ^w last(t) XfaceEd FaceGen t+1 t+n-1 o t+1 18 o t+n-1 (1) N -1 ^Wt+1 = (^w 1 ^w 2 last(t+1) t+1 t+1 ^w t+1 ) 烄 2.2 (3) 烅 ^Wt+N-1 = (^w 1 ^w 2 last(t+n-1) 烆 t+n-1 t+n-1 ^w t+n-1 ). O=(o 1 o I )(I 1) W =(w 1 w n )(n 1) N L O ^w ^W ^w pre ^w ^W =argmax P(W O). (1) W L HMM Veterbi ^Wt = (^w 1^w 2 last(t) t t ^w t )= argmax P(W (o 1 o t )). (2) W L t t t+n-1 ^w last(t) t = ^w t+1 last(t+1) = ^w t+n-1 last(t+n-1) = ^w (4) t o t (1) { ^w ^w pre. ^w ^w pre 3 3 N 2 N 2.3 2.3.1 N N 3 3-D FAP ( p b ) (viseme) MPEG-4

1183 [4] 21 38 ( 59 ) F sp (t)= Dsp( t) T sp + (D lp (t)+d rp (t)) T 0p. 1 D sp (t)+d lp (t)+d rp (t) (8) [3] 1 T 0p p T sp ( ) s p bpm gkh aang ou i 4 f jqx aian eeng u dtn zhchshr ao eien v(/yu/) l zcs o er SIL 13 iao=i+ao MPEG-4 FAP FAP FAP FAP( ) 2.3.2 4 2.3.3 FAP 1 2 p s FAP p s D sp (5) (7) D sp = αsp e -θsp(-) τ c 烄 τ 0 (5) 烅 αspe -θ sp(+) 烆 τ c τ<0. αsp θsp(-) θsp(+) c τ 2 τ=t so-tt so t 2 ( ) (5) 2 D lp =αlpe sgn( σ)θ lp σ c. (6) D sp (-) =αspe -θ sp(-) τ c τ 0 D sp (+) =αspe -θ sp(+) τ c τ<0. (9) (10) σ=t lo-tt lo 5 ( ) j D rp =αrpe -sgn( υ)θ rp υ c. (7) i p i j υ=t ro-tt ro

1184 ( ) 201151(9) 5 6b j i j T 1 T 2 D ip (+) D jp (-) 6e L 2 F p (t)= D jp(-)(t) T jp +D ip (+)(t) T ip. D jp (-)(t)+d ip (+)(t) (11) D jp (+) D kp (-) 6e L 3 6d k t io t t jo t io i T 3 D kp (-) D rp t jo j T ip T jp 6e L 4 i j p (11) t io t jo i j k 3 6 6a i i 0 T 1 D ip (-) D lp 6e L 1 6c k j k T 2 T 3 3 3.1 5 HMM Gauss (sil) 60 HMM 90h HMM HTK [9] 16kHz 16bit 12 Mel (MFCC) 39 25ms 10ms Veterbi 3.2 6 21 (meanopinionscoremos) - 4 1) 1 HTK (forcedalignment) (upperbound)

1185 2) 2 2 MOS MOS MOS /ms 3) 3 1 4.34 4.01 0 4) 4 2 3.98 3.81 0 3 20 10 (5 5 ) 2 20 4s 60 3 3 26.34% 2 4 20 ( 45ms 4 80 htp//www.nwpu- aslp.org/talkingavatar.html) 80 1 110ms [10] 2 1 3.3 (interquar- 2 tilerangeiqr) 7 8 (outlier) MOS 2 7 4 MOS 3 3.42 3.50 45 4 3.21 3.16 110 1 3 4 ( 1 2) 4 3 3 4 3 3 4 3 2 4 MOS 3 0.34 3 MOS 3.42 3.5 4 8 4 MOS

1186 ( ) 201151(9) FAP FAP 3-D MOS MOS 3.42 3.50 (References) [1] CosatoEOstermannJGrafH Petal.Lifeliketalking facesforinteractiveservices[j].proceedingsoftheieee 200391(9)1406-1429. [2] TANG Hao FU Yun TU Jilin et al. Humanoid audio-visualavatarwithemotivetext-to-speechsynthesis[j]. IEEE Transactionson Multimedia200810(6)969-981. [3] WU Zhiyong ZHANG Shen CAI Lianhong et al. Real-time Synthesis of Chinese Visual Speech and Facial Expressions using MPEG-4 FAP Features in a Three-dimensionalAvatar[C]//TheInternationalConferenceon SpokenLanguageProcessingPitsburgh20061802-1805. [4] PandzicI SForchheimer R. MPEG-4 Facial Animation [M].New YorkWiley2002. [5] HUANG FujieCosatoEGraf H.Triphonebasedunits electionforconcatenativevisualspeechsynthesis[c]//ieee internationalconference on acousticsspeechand signal processing.njieeepress20022037-2040. [6] Brand M. Voice puppetry [C]// Proceedings of the SIGGRAPH 99.NYACM Press199921-28. [7] XIE Lei LIU Zhiqiang. Realistic mouth-synching for speech-driventalkingfaceusingarticulatory modeling [J]. IEEE Transactionson Multimedia20079(3)500-510.. [J]. 200314(3)461-466. WANGZhimingCAILianhong.A dynamicviseme model andparameterestimation [J].Journalof Software2003 14(3)461-466.(inChinese) [9] YoungSEvermannGKershaw Detal.TheHTK Book [M].CambridgeUniversityEngineeringDepartment2009. [10]. [M]. 1992. WANG LijiaLIN Tao.Phonetics Course [M].Peking UniversityPress1992.(inChinese) ( 1179 ) 3 SNR 3 ITD ID (References) [1] CherryEC.Someexperimentsontherecognitionofspeech with one and with two ears [J]. J of ASA 1953 25975-979. [2] Bregman A S.AuditoryScene Analysis [M].Cambridge MITPress1990. [3] HanslerESchmidtG.Topicsin AcousticEchoand Noise Control[M].BerlinSpringer2006. [4] RomanNWangDLBrownGJ.Speechsegregationbased on sound localization [J]. J of ASA 2003 114(4)2236-2252. [5] MeddisR.Simulationofmechanicaltoneuraltransductionin the auditory receptor [J]. J of ASA 1988 83(3)1056-1063. [6] BeroutiMSchwartzMMakhoulJ.Enhancementofspeech corrupted by acoustic noise [J]. Proc IEEE Int Conf AcoustSpeechSignal Process1979208-211. [7] ScalartPFilhoJ.Speechenhancementbasedonapriori signaltonoiseestimation [J].ProcIEEE IntConf Acoust SpeechSignal Process1996629-632. Ephraim YMalahD.Speechenhancementusingaminimum mean-square error log-spectral amplitude estimator [J]. IEEE Trans Acoust Speech Signal Process 1985 23(2)443-445.