Earlier this summer I stood on the roof of Lernout and Hauspie's headquarters in Ieper, together with Jo Lernout, surveying the Flanders language valley. It led me to ponder on how far the industry has come in the twenty or so years I have been involved in it.
The Flanders Language valley contains not only a large HQ for L&H, but also several smaller buildings for start up ventures, clustered around a training centre. This visionary enterprise symbolises a number of stages of a maturing industry. The size of the HQ for one of the first generation speech and language companies illustrates the maturing of a fledgling industry into the mainstream of Information Technology and Telecommunications. As is often the case with a maturing industry, it spawns other companies tackling specific market and technological opportunities, represented by the cluster of small buildings in the Flanders Language Valley, many yet to be completed and occupied. One of the first to be occupied will be the Ecommerce joint venture between L&H and Intel.
The training centre initiative is a bold one and is a recognition of the specific skills needed to make speech and language technology a success. The industry has reached a stage where the need now is not so much for research staff as for application engineers. The need is for people who understand the particular requirements for good interface design and knowledge engineers who can extract and encode the vast knowledge bases which are typical of most advanced speech and language applications. This activity of course has to be replicated in the language of each target market. So concerned are Jo Lernout and Pol Hauspie about the lack of expertise available, they are investing their own money in the establishment of a number of training and education centres around the world.
The maturing of the speech technology industry has in part been brought about by a combination of dramatic advances in computing power, memory capacity and decreasing costs, together with perhaps less dramatic incremental improvements in the algorithms. In fact looking back over several decades it is possible to chart a number of step function improvements that have occurred in speech recognition performance, almost entirely due to the availability of adequate memory for speech training data storage and processor power to cope with statistically based algorithms.
In one of my earlier columns I commented on how some researchers regarded text to speech synthesis (TTS) development as like squeezing toothpaste out of a tube and recognition development as more like putting it back again. This was a reflection on how difficult it seemed to create large vocabulary continuous speech recognition systems compared to the achievements of unlimited vocabulary TTS. At the time I challenged this view on the basis of the lack of progress made in improving the naturalness of speech synthesis systems, or putting it another way, achieving that last 10% needed to create really usable TTS that members of the public would be happy to listen to. The lack of naturalness has been largely responsible for the limited application of TTS to date. I believe that a step function improvement has however recently been achieved in speech synthesis, and in a way similar to speech recognition, it is almost entirely due to the availability of cheap memory.
Speech synthesis was originally based on the pioneering work of researchers such as Gunar Fant and Dennis Klatt. This relied on a model of human speech production where a very small number of parameters could be used to drive an electronic synthesiser that mimicked the characteristics of the human vocal tract. Rules are required to translate ordinarily spelt sentences into a string of these parameters. More recent techniques have used small speech segments, so called diphones, derived from real speech recordings. These diphones are in fact pieces of sound which span two halves of the basic units of speech – phonemes. By including the sound which occurs at the boundary of two sounds such as two different vowels, problems of synthesising these transitions are overcome. Because there are many different transitional sounds, depending on the context, a large inventory of diphones is needed. These diphone segments are concatenated together, using pronunciation rules, to produce a word or sentence. This approach requires much more memory that the original synthesis techniques, since encoded real speech segments have to be stored.
The availability of large and low cost memory has pushed this approach a step further and now a new generation of synthesiser is being launched that significantly improves naturalness. A much larger inventory of speech segments than is used in the diphone approach, are joined together to produce the required word or sentence. The use of larger as well as different real speech segments has resulted in a much more natural sounding TTS than the widely adopted diphone approach. L&H have released the first commercial product that I am aware of using this approach, and the improvement in speech quality is impressive. The initial release is for American English with other languages to follow. AT&T has an interactive demonstration of their own technology, also using a similar approach on their web site.
Although some work is still required on the pronunciation rules, particularly names, the improvement in quality that this approach delivers should result in an important step forward for the public deployment of TTS in both PC and network applications. A growing number of companies are now offering over the phone e-mail reading products and services as well as information from the internet. With the large number of subscription free ISP's in the UK market, all seeking to differentiate their products, improved TTS quality could be one significant factor in customer acceptance for those wishing to offer such services.
Jeremy Peckham has over 20 years experience in the voice processing industry as a scientist and consultant and latterly as a businessman and entrepeneur. He began his career with the Royal Aircraft Establishment, spent 12 years with Logica and then founded and ran the UK speech specialist Vocalis, floating the company on the London Stock Exchange in 1996. Currently managing director of Strategis Consulting Ltd and Chairman of The Speech Recognition Company Ltd.
Published: Monday, December 2, 2002
I am checking out all the amazing and daily updated content on ContactCenterWorld.com and networking with professionals worldwide
Send To Friends Post On My Wall