Judging Text-to-Speech by the Wisdom of the Crowd

Title: Judging Text-to-Speech by the Wisdom of the Crowd
Speaker: Prof. Alan Black (http://www.cs.cmu.edu/~awb/)
Date: Wednesday, Nov 30th @Noon
Room: GHC 6501

One of the many hard issues in generating good synthetic speech is the difficulty in evaluating the quality. Objective measures are always useful when optimizing various machine learning algorithms, but in speech generation it is ultimately what the end user actually thinks about the speech that is important. Running human listening tests is expensive, and not very reliable.

This talk lays out the techniques we've used to try to find robust subjective evaluation techniques for speech synthesis. These have been implemented in the annual Blizzard Challenge where teams build synthetic voices from a common dataset and then we have many people judge the quality by listening to them. The results are robust (different subsets of listeners correlate) and there have been interesting results found about the orthogonality of naturalness and intelligibility. However as we go further into using end users are an evaluation system we note a number of issues that must be addressed. People prefer voices they've listened to before (for good speech perceptual reasons). People are not good at judging subtle differences such as voice quality, intonation, timing etc; naturalness and intelligibility are not the only goals.

This talk will present existing crowd sourcing techniques used to evaluate speech synthesis and propose new techniques that might help use evaluate future directions in speech synthesis.