You can scrape a database of movie scripts, tie each word in the script to a moment in the movie (accurate to the second), and extract recordings that are supposed to demonstrate an emotion. You can even use the modifiers that are in a script, such as “very angrily” to train on various degrees of each emotion. If emotions are not inscribed into the script, you can use textual affect detection tools. There’s got to be some way to do this without having mTurks label every second of a recording.