Title: Synthesize Emotional Speech (Synthesis of emotional speech using subspace constraints in prosody)

> Publications

> Examples of synthesized speech :Synthesis of emotional speech using subspace constraints in prosody

Our method can synthesize emotion with its gradation.

target emotion
words (accent, # of morae, trained/not-trained)
emotion intensity
"anger"
/amamizu/ (LHLL, 4, trained)
0(neutral) -> 3 -> 6(max)***
"boredom"
/aomori/ (LHLL, 4, not trained)
0(neutral) -> 3 -> 6(max)
"joy"
/arawani/ (HLLL, 4, trained)
0(neutral) -> 3 -> 6(max)
"sorrow"
/omoumamani/ (LHLLLL, 6, trained)
0(neutral) -> 3 -> 6(,max)
"disgust"
/naname/ (LHL, 3, trained)
0(neutral) -> 3 -> 6(max)
"depression"
/oborozukiyo/ (LHHHHL, 6, trained)
0(neutral) -> 3 -> 6(max)

*** emotion vector e given here was [6 0 0 0 0 0 0 0 0 0 0 0]T.

> Research purpose and approach

Concatinative approach in synthesizing speech demonstrated to be most promising for achieving human-like quality because it concatenates human voice segments stored in the database. In other words, conventional concatinative systems only synthesize what are stored in the database. Yet, It has a difficulty in collecting enough variety of waveforms in the database for synthesizing various kinds and gradations of emotional content, depending on the required degrees of freedom in possible emotional context.

We have analyzed speech data that contained a variety of emotions, and found the prosody patterns (=speaking styles, i.e., contours of F0 and power, lengths of the morae) approximately distributed in the normal distribution (mean and the variation from it). We use principal component analysis for effectively parameterizing the database (training speech samples) so that our system can synthesize what are actually NOT stored in the training samples as well.

This method can be viewed as the speech version of the Active Shape Models.

> Same linguistic structure gives same motions in prosody

When the word /naname/ (3 morae, LHL) conveys "anger", the speaker puts more stress on the 2nd mora due to the location of accent. The same phenomenon is observed with other words that share the same linguistic structure such as /erabu/ and /kowai/. /Naniyorimo/, on the other hand, has the accent on the 1st mora, and the speaker puts more stress on it when conveying "anger". We assume motions of speech prosody when the speech conveys emotion are dependent on the combiation of the number of morae and the accent location.

We chose a representative word for each of 20 combinations in the following table, for which we collected speech samples with a variety of emotions. For instance, /nama/ was chosen for 2 morae and HL accent.

# of morae
accent locations
words
2
HL
HL
/nama/
LH
LH
/nami/
3
HLL
HLL
/midori/
LHL
LHL
/naname/
LHH
LHH
/nagame/
S
HLLL
HLLL
/arawani/
LHLL
LHLL
/amamizu/
LHHL
LHHL
/arayuru/
LHHH
LHHH
/omonaga/
5
HLLLL
HLLLL
/naniyorimo/
LHLLL
LHLLL
/amamizuwa/
LHHLL
LHHLL
/amanogawa/
LHHHL
LHHHL
/yawarageru/
LHHHH
LHHHH
/amarimono/
6
HLLLLL
HLLLLL
/emoiwarenu/
LHLLLL
LHLLLL
/omoumamani/
LHHLLL
LHHLLL
/amagaeruwa/
LHHHLL
LHHHLL
/iwazumogana/
LHHHHL
LHHHHL
/oborozukiyo/
LHHHHH
LHHHHH
/warawaremono/

> A Synthesis Method of Emotional Speech Using Subspace Constraints in Prosody

Firstly, prosody patterns pi = [fi1, fi2, ..., fiL, ai1, ai2, ..., aiL, li1, li2, ..., lin] ( i: i-th training speech, f: F0, a: power, l: mora length, L: speech length, n: the number of morae) extracted from training speech samples are analysized by principal component analysis to obtain the following formulation that is summation of the mean and the variance from it:

p_i=p_bar+sigma(c_ij*v_j)

p_bar
: mean prosody pattern, c_ij : j-th principal component score, v_j : j-th eigenvector

Principal component vector ci gives prosody pattern pi of i-th training speech. When synthesizing a word utterance, the system assemples wavelets from phoneme database, that contains netural utterances, and generates the waveform using TD-PSOLA with p.

Secondly, subjective evaluation for each training speech sample gives K-dimensional emotion vector ei. In this implementation, we used 12 emotions that include "anger", "joy", "disgust", "surprise", "despise", "pride", "depression", "amusing", "sorrow", "boredom", "suffering", and "shame" (K=12).

Finally, multiple regression analysis where emotion vector e gives the predictor variables and principal component vector c the criterion variables estimates the partial regression coefficients R that linearly relate them by the equation below. Thus, prosody pattern p to be synthesized can be obtained from emotion vector e that is given.

c=R*e

> Examples of training speech*

Here show the examples of training speech samples for a Japanese word /arayuru/ (the number of morae = 4, accent type = LHHL). A single male speaker expressed 47 emotions for each of 20 combinations of the number of morae and accent types.

neutral**
envy
contempt
hope
anger
irritation
delight
happy
joy
complaint
cynicism
fondness
disgust
longing
indifference
disliking
despise
sympathy
admiration
reluctance
amusing
tolerance
pride
depression
worry
chuckling
love
accusation
gentleness
disappointment
grief
anxiety
relief
scold
flattery
surprise
indignation
sorrow
contentment
haste
shame
scare
boredom
shock
calmness
hate
suffering
 

* Full dataset is available from here (as Keio University Japanese Emotional Speech Database (Keio-ESD), Speech Resources Consoutium, The National Institute of Informatics).

** Emotions that the speaker tried to express (They are not necessarily conveyed successfully, nor is required in the method.)