Computers that people can have a conversation with have been a staple of science fiction stories for decades. Modern computers have not yet reached the stage of being able to hold a conversation; however, they can produce speech. How do they do it?
The process of producing speech on a computer is called speech synthesis. Many modern computer applications, from automated GPS navigators to business VoIP services, take advantage of this ability. Speech synthesis is a type of computer output. The computer or other device producing the speech takes words input into the system and reads them back to the user.
Speech Synthesis Step One: Pre-processing
Speech synthesis is a three-step process. The first step is called pre-processing or normalization. In this stage, the computer analyzes the different ways the given text could be read and determines the correct one for the context. Numbers, times, dates, abbreviations, special characters and acronyms are turned into words. Because computers don't have the same ability as humans to decide how to pronounce something based on context, neural networks or statistical probability techniques are used. For example, if a computer is trying to determine whether a number represents a year or a quantity, it may look for clues in the text, such as the word "year."
Additionally, the computer must attempt to determine the correct pronunciation for homographs, which are words that look the same but are pronounced differently depending on what they mean. To accomplish this, the computer looks for context clues, such as whether a sentence is written in the present or past tense.
Speech Synthesis Step Two: Phonemes
In this step, the speech synthesizer determines which sounds make up the words that need to be spoken. These sounds are called phonemes. A basic approach to this step is to provide the computer with a list of dictionary words and accompanying phonemes; however, this method does not produce very natural sounding speech, because when humans speak sentences, the phonemes may sound differently based on several factors. This is a concept called prosody.
An alternative method is to divide words into graphemes, which are the individual letters or syllables contained in a word, and produce phonemes based on a simple ruleset for each grapheme. This has the advantage of making it possible for the computer to read any word, including made-up words, words in foreign languages, proper names and technical terms. The main drawback is that some languages, such as English, have many words that are pronounced differently from how they are written.
Speech Synthesis Step Three: Sound
Computers produce speech sounds in three main ways. The first is to use a recording of a human speaking the phonemes. In the second, the computer generates the phonemes by using basic sound frequencies. Finally, some computers can mimic the human voice.
Speech synthesizers that use recordings of the human voice are preloaded with short clips of human sounds that the computer arranges to form words. This is called concatenative speech synthesis. This is the most natural-sounding type of speech synthesis; however, it is limited to a single voice and language.
Formant speech synthesizers generate speech based on the 3-5 key sound frequencies generated by the human voice. These synthesizers can say anything because they are not limited to a pre-loaded library of sounds.
Articulatory synthesizers model the human voice. This is the most complex method and should be capable of producing the most natural-sounding speech; however, computer technology has not yet reached the level where machines can model the human vocal apparatus well enough to produce natural-sounding speech.
New uses for speech synthesizers are being invented all the time and as the technology improves, talking computers are likely to become a more common part of everyday life. Today, even the most natural-sounding of computer-generated speech is usually distinguishable from the real thing, but someday that will likely no longer be the case.
Publish Date: November 30, 2021 2:10 PM