Building Synthetic Voices
This tutorial will give an overview of the basic techniques available for building synthetic voices for speech synthesis systems, including an actual example of voice building.
The first part will describe the basic components of a speech synthesis system covering the state of the art techniques used within them. Specifically:
- Text Analysis: addressing issues of expansions of symbols, numbers, acronyms etc and resolving homographs
- Linguistic Analysis: "from words to how to say them", addressing issues in lexical entries, letter to sound rules and prosodic modeling, (phrasing, intonation and duration).
- Waveform Synthesis: "from phones and prosody to waveforms" describing basic techniques for making computers talk using recorded prompts, diphones, and general unit selection synthesis
The second part will describe the basic stages required in building new synthetic voices (in English or other languages):
- building a text analysis system
- building a lexicon and letter to sound rules
- build phrasing, intonation and duration models
- recording data for concatenative speech synthesis (diphones, unit selection and/or limited domain)
This tutorial is based on the techniques, documentation and tools freely distributed through CMU's FestVox project (http://festvox.org/
) leading to voices that can be run on Edinburgh University's Festival Speech Synthesis System.