Richard Savery

Shimi and Prosody

Using Musical Prosody for Robotic Interaction


My work with Shimi started with the broad question of what voice a robot should use to communicate. I focused on generating a new non-speech voice, aiming to avoid uncanny valley, and allow a robot to talk like a robot. This was done using prosodic audio generated through deep learning on an embedded Nvidia Jetson TX2. After creating the voice we were able to show increased levels of trust in users when collaborating with Shimi. These metrics then led to the development of an expanded successful NSF grant, which commenced in November 2019.


National Science Foundation, National Robotics Initiative - $669,912.00

In 2019 I was the primary author with my advisor Gil Weinberg for an NSF grant Creating Trust Between Groups of Humans and Robots Using a Novel Music Driven Robotic Emotion Generator


Shimi Will Now Sing to You in an Adorable Robot Voice IEEE Spectrum - 05 Mar 2019


Establishing Human-Robot Trust through Music-Driven Robotic Emotion Prosody and Gesture

28th IEEE International Conference on Robot & Human Interactive Communication 2019

Richard Savery, Ryan Rose and Gil Weinberg

Abstract: As human-robot collaboration opportunities continue to expand, trust becomes ever more important for full engagement and utilization of robots. Affective trust, built on emotional relationship and interpersonal bonds is particularly critical as it is more resilient to mistakes and increases the willingness to collaborate. In this paper we present a novel model built on music-driven emotional prosody and gestures that encourages the perception of a robotic identity, designed to avoid uncanny valley. Symbolic musical phrases were generated and tagged with emotional information by human musicians. These phrases controlled a synthesis engine playing back prerendered audio samples generated through interpolation of phonemes and electronic instruments. Gestures were also driven by the symbolic phrases, encoding the emotion from the musical phrase to low degree-of-freedom movements. Through a user study we showed that our system was able to accurately portray a range of emotions to the user. We also showed with a significant result that our non-linguistic audio generation achieved an 8% higher mean of average trust than using a state-of-the-art text-to-speech system.

Finding Shimi’s Voice: Fostering Human-Robot Communication With Music And a NVIDIA Jetson TX2

Linux Audio Conference 2019

R Savery, R Rose, G Weinberg

Abstract: We present a novel robotic implementation of an embedded linux system in Shimi, a musical robot companion. We discuss the challenges and benefits of this transition as well as a system and technical overview. We also present a unique approach to robotic gesture generation and a new voice generation system designed for robot audio vocalization of any MIDI file. Our interactive system combines NLP, audio capture and processing, and emotion and contour analysis from human speech input. Shimi ultimately acts as an exploration into how a robot can use music as a driver for human engagement.