Axel Horndasch
Using Contextual Information to Process Out-of-Vocabulary Words in Spoken Dialog Systems
Speech is a very efficient channel for communication, but it is a challenge to cope with words which are unknown to one of the dialogue partners. Human beings are very good at dealing with words that are not in their vocabulary, for example by exploiting the context the unknown word appears in. In this way, the word can be categorized and, based on this information and the pronunciation, the spelling can be worked out as well. A key source for task-relevant out-of-vocabulary (OOV) words are named entities like names of cities, persons, companies, products etc. in case they are rarely used or newly invented. In a sentence like “Please give me information about the Argentinian soccer player Jorge Burruchaga!” for example, Mr. Burruchaga’s name is the most important piece of information. Automatic speech recognition (ASR) systems usually map words which were not part of the training material to similar sounding words without considering the word class; often, the consequence of such recognition errors is that the human-machine dialogue cannot be finished successfully. In this work, it is described how an automatic speech recognition (ASR) system can be enhanced with a functionality that closely resembles the human method so that OOV words can be detected, categorized and recovered in a written form. The suggested approach, hierarchical hybrid word-class-based OOV detection in combination with sub-word units, is integrated into the widely used Kaldi speech recognition toolkit. Experiments on the speech corpora EVAR and SmartWeb show that more than 70% of unknown city names and about 50% of OOV celebrity names can be detected while at the same time improving the word error rate of the system.