14:00   Oral Session 6-KZ – Emotional Speech Synthesis & Affective Music Players
Chair: Marc Schröder
14:00
25 mins
Perception of synthetic emotion expressions in speech: Categorical and dimensional annotations
Judith M. Kessens, Mark A. Neerincx, Melanie Kroes, Gerrit Bloothooft, Rosemarijn Looije
Abstract: In this paper, both categorical and dimensional annotations have been made of neutral and emotional speech synthesis (anger, fear, sad, happy and relaxed). With various prosodic emotion manipulation techniques we found emotion classification rates of 40%, which is significantly above chance level (17%). The classification rates are higher for sentences that have a semantics matching the synthetic emotion. By manipulating the pitch and duration, differences in arousal were perceived whereas differences in valence were hardly perceived. Of the investigated emotion manipulation methods, EmoFilt and EmoSpeak performed very similar, except for the emotion fear. Copy synthesis did not perform well, probably caused by suboptimal alignments and the use of multiple speakers.
14:25
25 mins
Annotating meaning of listener vocalizations for speech synthesis
Sathish Pammi, Marc Schröder
Abstract: Generation of listener vocalizations is one of the major objectives of emotionally colored conversational speech synthesis. Success in this endeavor depends on the answers to three questions: What kinds of meaning are expressed through listener vocalizations? What form is suitable for a given meaning? And, in what context should which listener vocalizations be produced? In this paper, we address the first of these questions. We present a method to record natural and expressive listener vocalizations for synthesis, and describe our approach to identify a suitable categorical description of the meaning conveyed in the vocalizations. In our data, one actor produces a total of 967 listener vocalizations, in his natural speaking style and three acted emotion-specific personalities. In an open categorization scheme, we find that eleven categories occur on at least 5% of the vocalizations, and that most vocalizations are better described by two or three categories rather than a single one. Furthermore, an annotation of meaning reference, according to Bühler's Organon model, allows us to make interesting observations regarding the listener's own state, his stance towards the interlocutor, and his attitude towards the topic of the conversation.
14:50
25 mins
Deploying music characteristics for an affective music player
Marjolein van der Zwaag, Joyce H.D.M. Westerink, Egon L. van den Broek
Abstract: This paper describes work toward an affective music player (AMP), which is able to direct affect to a goal state by selecting music. Repeatedly, music has been shown to modulate affect; however, precise guidelines for the use of music characteristics in an AMP have not been defined. To explore these, we investigated the influence of music characteristics on 32 participants who listened to 16 songs, testing effects of tempo (slow/high), mode (minor/major), and percussiveness (low/high). Subjective measures of affect (i.e., arousal, tension, and positive and negative valence) and physiology (i.e., skin conductance level, skin conductance responses, and heart rate variability) were measured during listening. Results show main and interaction effects of music characteristics on both subjective affect and physiology, implying that the characteristics are mutually dependent in modulating affect. Based on these results, guidelines are presented for AMPs, which can effectively direct affect through music.
15:15
25 mins
Emotional Speech Synthesis by Sensing Affective Information from Text
Mostafa Al Masum Shaikh, Antonio Rui Ferreira Rebordao, Keikichi Hirose, Mitsuru Ishizuka
Abstract: Speech is the most common medium to express subjective meanings and intents that, in order to be fully understood, rely heavily in emotion perception towards something or somebody. Since many applications are becoming speech-enabled, there is an increasing need for applications that efficiently perform affective speech synthesis. We have carried out several perceptual experiments that show that automatic Text-To-Speech (TTS) systems are weak in the relevance of prosodic and acoustic properties to achieve emotional expressivity. The continuing research on expressive speech synthesis has acknowledged that emotion in speech relies on parameters like, fundamental frequency (F0) level, voice quality, or articulation precision. Moreover, there are several TTS systems which provide control over these parameters by a XML-based formatted input. This paper describes an approach to generate such formatted input so that the synthesizer can be assigned with appropriate prosodic parameters and pitch accent according to the detection of emotion from the input text. Our technique utilizes several linguistic resources to recognize emotions like “happiness” or “sadness” conveyed though the input text and thereby assigns appropriate parameters to the synthesizer to carry out expressive speech synthesis. For test and evaluation purposes MARY TTS system has been considered to readout “happy” and “sad” news. The preliminary perceptual test results are encouraging and human judges could perceive “happy” emotion significantly better by listening to the synthesized speech obtained with our approach while compared to non-affective synthesized speech.
15:40
25 mins
Personalized affective music player
Joris H. Janssen, Egon L. van den Broek, Joyce H.D.M. Westerink
Abstract: We introduce and test an affective music player (AMP) that selects music for mood enhancement. Through a concise overview of content, construct, and ecological validity, we elaborate five considerations that form the foundation of the AMP. Based on these considerations, computational models are developed, using regression and kernel density estimation. We show how these models can be used for music selection and how they can be extended to fit in other systems. Subsequently, the success of the models is illustrated with a user test. The AMP augments music listening, where its techniques, in general, enable automated affect guidance. Finally, we argue that our AMP is readily applicable to real-world situations as it can 1) cope with noisy situations, 2) handle the large inter-individual differences apparent in the musical domain, and 3) integrate context or other information, all in real-time.