Building Synthetic Voices

Alan W Black and Kevin A. Lenzo
awb@cs.cmu.edu and lenzo@cs.cmu.edu

This tutorial will give an overview of the basic techniques available for building synthetic voices for speech synthesis systems, including an actual example of voice building.

The first part will describe the basic components of a speech synthesis system covering the state of the art techniques used within them. Specifically:

Text Analysis: addressing issues of expansions of symbols, numbers, acronyms etc and resolving homographs
Linguistic Analysis: "from words to how to say them", addressing issues in lexical entries, letter to sound rules and prosodic modeling, (phrasing, intonation and duration).
Waveform Synthesis: "from phones and prosody to waveforms" describing basic techniques for making computers talk using recorded prompts, diphones, and general unit selection synthesis

The second part will describe the basic stages required in building new synthetic voices (in English or other languages):

building a text analysis system
building a lexicon and letter to sound rules
build phrasing, intonation and duration models
recording data for concatenative speech synthesis (diphones, unit selection and/or limited domain)

This tutorial is based on the techniques, documentation and tools freely distributed through CMU's FestVox project (http://festvox.org/ ) leading to voices that can be run on Edinburgh University's Festival Speech Synthesis System.

Building Synthetic Voices

Alan W Black and Kevin A. Lenzo awb@cs.cmu.edu and lenzo@cs.cmu.edu

Alan W Black and Kevin A. Lenzo
awb@cs.cmu.edu and lenzo@cs.cmu.edu