Research Demo - Synthesize Emotional Speech

Title: Synthesize Emotional Speech (Synthesis of emotional speech using subspace constraints in prosody)

Publications

Tsuyoshi Moriyama, Shinya Mori, Shinji Ozawa, "A Synthesis Method of Emotional Speech Using Subspace Constraints in Prosody", Journal of Information Processing, pp.1181-1191, Vol.50, No.3, 2009.
Shinya Mori, "Emotional Speech Synthesis Using Subspace Constraints in Prosody", Master Thesis, Keio University, 2005.
ICME2006 (proceedings (PDF 146KB), poster1 (PDF 538KB, PNG 158KB), poster2 (PDF 12.6MB, PNG 148KB))

Examples of synthesized speech :Synthesis of emotional speech using subspace constraints in prosody

Our method can synthesize emotion with its gradation.

target emotion	words (accent, # of morae, trained/not-trained)	emotion intensity
"anger"	/amamizu/ (LHLL, 4, trained)	0(neutral) -> 3 -> 6(max)***
"boredom"	/aomori/ (LHLL, 4, not trained)	0(neutral) -> 3 -> 6(max)
"joy"	/arawani/ (HLLL, 4, trained)	0(neutral) -> 3 -> 6(max)
"sorrow"	/omoumamani/ (LHLLLL, 6, trained)	0(neutral) -> 3 -> 6(,max)
"disgust"	/naname/ (LHL, 3, trained)	0(neutral) -> 3 -> 6(max)
"depression"	/oborozukiyo/ (LHHHHL, 6, trained)	0(neutral) -> 3 -> 6(max)

*** emotion vector e given here was [6 0 0 0 0 0 0 0 0 0 0 0]^T.

Research purpose and approach

Concatinative approach in synthesizing speech demonstrated to be most promising for achieving human-like quality because it concatenates human voice segments stored in the database. In other words, conventional concatinative systems only synthesize what are stored in the database. Yet, It has a difficulty in collecting enough variety of waveforms in the database for synthesizing various kinds and gradations of emotional content, depending on the required degrees of freedom in possible emotional context.

We have analyzed speech data that contained a variety of emotions, and found the prosody patterns (=speaking styles, i.e., contours of F0 and power, lengths of the morae) approximately distributed in the normal distribution (mean and the variation from it). We use principal component analysis for effectively parameterizing the database (training speech samples) so that our system can synthesize what are actually NOT stored in the training samples as well.

This method can be viewed as the speech version of the Active Shape Models.

Same linguistic structure gives same motions in prosody

When the word /naname/ (3 morae, LHL) conveys "anger", the speaker puts more stress on the 2nd mora due to the location of accent. The same phenomenon is observed with other words that share the same linguistic structure such as /erabu/ and /kowai/. /Naniyorimo/, on the other hand, has the accent on the 1st mora, and the speaker puts more stress on it when conveying "anger". We assume motions of speech prosody when the speech conveys emotion are dependent on the combiation of the number of morae and the accent location.

We chose a representative word for each of 20 combinations in the following table, for which we collected speech samples with a variety of emotions. For instance, /nama/ was chosen for 2 morae and HL accent.

# of morae	accent locations	words
2	HL	/nama/
2	LH	/nami/
3	HLL	/midori/
	LHL	/naname/
	LHH	/nagame/
４	HLLL	/arawani/
	LHLL	/amamizu/
	LHHL	/arayuru/
	LHHH	/omonaga/
5	HLLLL	/naniyorimo/
	LHLLL	/amamizuwa/
	LHHLL	/amanogawa/
	LHHHL	/yawarageru/
	LHHHH	/amarimono/
6	HLLLLL	/emoiwarenu/
	LHLLLL	/omoumamani/
	LHHLLL	/amagaeruwa/
	LHHHLL	/iwazumogana/
	LHHHHL	/oborozukiyo/
	LHHHHH	/warawaremono/

A Synthesis Method of Emotional Speech Using Subspace Constraints in Prosody

Firstly, prosody patterns pi = [fi1, fi2, ..., fiL, ai1, ai2, ..., aiL, li1, li2, ..., lin] ( i: i-th training speech, f: F0, a: power, l: mora length, L: speech length, n: the number of morae) extracted from training speech samples are analysized by principal component analysis to obtain the following formulation that is summation of the mean and the variance from it:

: mean prosody pattern,

: j-th principal component score,

: j-th eigenvector

Principal component vector ci gives prosody pattern pi of i-th training speech. When synthesizing a word utterance, the system assemples wavelets from phoneme database, that contains netural utterances, and generates the waveform using TD-PSOLA with p.

Secondly, subjective evaluation for each training speech sample gives K-dimensional emotion vector ei. In this implementation, we used 12 emotions that include "anger", "joy", "disgust", "surprise", "despise", "pride", "depression", "amusing", "sorrow", "boredom", "suffering", and "shame" (K=12).

Finally, multiple regression analysis where emotion vector e gives the predictor variables and principal component vector c the criterion variables estimates the partial regression coefficients R that linearly relate them by the equation below. Thus, prosody pattern p to be synthesized can be obtained from emotion vector e that is given.

Examples of training speech^*

Here show the examples of training speech samples for a Japanese word /arayuru/ (the number of morae = 4, accent type = LHHL). A single male speaker expressed 47 emotions for each of 20 combinations of the number of morae and accent types.

neutral**	envy	contempt	hope
anger	irritation	delight	happy
joy	complaint	cynicism	fondness
disgust	longing	indifference	disliking
despise	sympathy	admiration	reluctance
amusing	tolerance	pride	depression
worry	chuckling	love	accusation
gentleness	disappointment	grief	anxiety
relief	scold	flattery	surprise
indignation	sorrow	contentment	haste
shame	scare	boredom	shock
calmness	hate	suffering

* Full dataset is available from here (as Keio University Japanese Emotional Speech Database (Keio-ESD), Speech Resources Consoutium, The National Institute of Informatics).

** Emotions that the speaker tried to express (They are not necessarily conveyed successfully, nor is required in the method.)