Generation of High Quality Audio Natural Emotional Speech Corpus using Task Based Mood Induction

Charlie Cullen, Dublin Institute of Technology
Brian Vaughan, Dublin Institute of Technology
Spyros Kousidis, Dublin Institute of Technology
Yi Wang, Dublin Institute of Technology
Ciaran McDonnell, Dublin Institute of Technology
D. Campbell, Dublin Institute of Technology

Document Type Conference Paper

International Conference on Multidisciplinary Information Sciences and Technologies Extremadura (InSciT), Merida, Spain. 2006.


Detecting emotional dimensions [1] in speech is an area of great research interest, notably as a means of improving human computer interaction in areas such as speech synthesis [2]. In this paper, a method of obtaining high quality emotional audio speech assets is proposed. The methods of obtaining emotional content are subject to considerable debate, with distinctions between acted [3] and natural [4] speech being made based on the grounds of authenticity. Mood Induction Procedures (MIP’s) [5] are often employed to stimulate emotional dimensions in a controlled environment. This paper details experimental procedures based around MIP 4, using performance related tasks to engender activation and evaluation responses from the participant. Tasks are specified involving two participants, who must co-operate in order to complete a given task [6] within the allotted time. Experiments designed in this manner also allow for the specification of high quality audio assets (notably 24bit/192Khz [7]), within an acoustically controlled environment [8], thus providing means of reducing unwanted acoustic factors within the recorded speech signal. Once suitable assets are obtained, they will be assessed for the purposes of segregation into differing emotional dimensions. The most statistically robust method of evaluation involves the use of listening tests to determine the perceived emotional dimensions within an audio clip. In this experiment, the FeelTrace [9] rating tool is employed within user listening tests to specify the categories of emotional dimensions for each audio clip.