|
|
IntroductionWhat is SR?Speech Recognition vs. Voice RecognitionTechnology of SRTimelineThe PlayersFutureLimitations and PotentialReferences |
The
complicated technologies supporting Speech Recognition systems vary as
much as the voice itself. However, the underlying technology of SR is
basically the same for all the major applications today. In the simplest
sense, speech is input into the computer, which is then parsed and/or
identified by the Speech Recognition program. Next, the processor runs
a series of algorithms to determine what is believed to have been said
(based on other technologies to be explored next) and responds to the
audible message, either as a command or speech-to-text input. The ultimate objective for developing
SR technologies is to create a system through which humans can speak to a machine
in the same way they would converse with another human
being. Essentially, we will speak in a natural
language to the humanized computer system, without regard to perfect syntax
or grammar.
"When a speech recognition system is combined with a natural language processing system, the result is an overall system that not only recognizes voice input but also understands it." (Turban) Natural
Language Processing
(NLP) has two basic methods for interpreting voice input:
1) Keywording:
The speech is recorded and the computer generates results based on
important words or phrases. For instance, this application works well for
performing tasks on an operating system: "Open file", "select
all", etc. Keywording is also used in call centers (i.e. you say the
party’s name or extension instead of pressing keys on the number pad). 2) Syntactic
and Symantec Analysis: This process is much more complex than Keywording. As the
speaker inputs audible data, the VR program parses the noise and computes what
is believed (by the system) to be what the user inputs. This technique requires
an extensive set of algorithms, rules, and definitions. For instance, when the
word "two" is spoken into the system, the program can predict that
"2" is intended (instead of "too" or "to"). The computer
may determine the appropriate meaning of this homonym by analyzing the syntax,
semantics, and sentence structure. This method is best applied to word
processing and data entry.
Another
important technology associated with SR is the ability for the program to
understand fluid speech versus unnatural speech with pauses between each word. This
ability marks the difference between Continuous
Speech systems and Discrete Speech systems. While Discrete Speech systems are
not conducive to natural human speech, they are highly accurate. On the other hand,
as expected, the Continuous Speech model that is closer to a human's natural talking
has a lower accuracy rate. Several
companies have developed and distributed "Speech Engines." These
"engines" are essentially databanks of all possible words,
phrases, syllables, phonemes, etc. through which the SR programs search to find
a reasonable result. Each speech engine offered by each different developer
operates on a different principle. For instance, the Microsoft Speech
Recognition Engines use either an "acoustic model" or a "dictation language
model." Other companies have their own specifications.
|
Copyright © 1999 Ira Greenberg and Andrew Bate. All Rights Reserved.
|