The Technology

 

Introduction           

What is SR?

Speech Recognition vs. Voice Recognition

Technology of SR

Timeline

The Players          

Future

Limitations and Potential

References

The complicated technologies supporting Speech Recognition systems vary as much as the voice itself. However, the underlying technology of SR is basically the same for all the major applications today. In the simplest sense, speech is input into the computer, which is then parsed and/or identified by the Speech Recognition program. Next, the processor runs a series of algorithms to determine what is believed to have been said (based on other technologies to be explored next) and responds to the audible message, either as a command or speech-to-text input. Click here to enlarge

The ultimate objective for developing SR technologies is to create a system through which humans can speak to a machine in the same way they would converse with another human being. Essentially, we will speak in a natural language to the humanized computer system, without regard to perfect syntax or grammar.  

"When a speech recognition system is combined with a natural language processing system, the result is an overall system that not only recognizes voice input but also understands it." (Turban)

 

Natural Language Processing (NLP) has two basic methods for interpreting voice input:

1)  Keywording: The speech is recorded and the computer generates results based on important words or phrases. For instance, this application works well for performing tasks on an operating system: "Open file", "select all", etc. Keywording is also used in call centers (i.e. you say the party’s name or extension instead of pressing keys on the number pad).

2)  Syntactic and Symantec Analysis: This process is much more complex than Keywording. As the speaker inputs audible data, the VR program parses the noise and computes what is believed (by the system) to be what the user inputs. This technique requires an extensive set of algorithms, rules, and definitions. For instance, when the word "two" is spoken into the system, the program can predict that "2" is intended (instead of "too" or "to"). The computer may determine the appropriate meaning of this homonym by analyzing the syntax, semantics, and sentence structure. This method is best applied to word processing and data entry.

Another important technology associated with SR is the ability for the program to understand fluid speech versus unnatural speech with pauses between each word. This ability marks the difference between Continuous Speech systems and Discrete Speech systems. While Discrete Speech systems are not conducive to natural human speech, they are highly accurate. On the other hand, as expected, the Continuous Speech model that is closer to a human's natural talking has a lower accuracy rate.

Several companies have developed and distributed "Speech Engines." These "engines" are essentially databanks of all possible words, phrases, syllables, phonemes, etc. through which the SR programs search to find a reasonable result. Each speech engine offered by each different developer operates on a different principle. For instance, the Microsoft Speech Recognition Engines use either an "acoustic model" or a "dictation language model." Other companies have their own specifications.

 
Copyright © 1999 Ira Greenberg and Andrew Bate.  All Rights Reserved.