Just a little reading on computers and speech:
A sound wave from speech is actually a very complex mix of multiple waves coming at different frequencies. The particular frequencies—how they change, and how strongly those frequencies are coming through—matter a lot in telling the difference between, say, an "ah" sound and an "ee" sound. More mathematical operations transform the complex wave into a numerical representation of the important features.
4. LOOK AT SMALL CHUNKS OF THE DIGITIZED SOUND ONE AFTER THE OTHER AND GUESS WHAT SPEECH SOUND EACH CHUNK SHOWS.
There are about 40 speech sounds, or phonemes, in English. The computer has a general idea of what each of them should look like because it has been trained on a bunch of examples. But not only do the characteristics of these phonemes vary with different speaker accents, they change depending on the phonemes next to them—the 't' in "star" looks different than the 't' in "city." The computer must have a model of each phoneme in a bunch of different contexts for it to make a good guess.
5. GUESS POSSIBLE WORDS THAT COULD BE MADE UP OF THOSE PHONEMES.
The computer has a big list of words that includes the different ways they can be pronounced. It makes guesses about what words are being spoken by splitting up the string of phonemes into strings of permissible words. If it sees the sequence "hang ten," it shouldn't split it into "hey, ngten!" because "ngten" won't find a good match in the dictionary.
6. DETERMINE THE MOST LIKELY SEQUENCE OF WORDS BASED ON HOW PEOPLE ACTUALLY TALK.
There are no word breaks in the speech stream. The computer has to figure out where to put them by finding strings of phonemes that match valid words. There can be multiple guesses about what English words make up the speech stream, but not all of them will make good sequences of words. "What do cats like for breakfast?" could be just as good a guess as "water gaslight four brick vast?" if words are the only consideration. The computer applies models of how likely one word is to follow the next in order to determine which word string is the best guess. Some systems also take into account other information, like dependencies between words that are not next to each other. But the more information you want to use, the more processing power you need.
7. TAKE ACTION
Once the computer has decided which guesses to go with, it can take action. In the case of dictation software, it will print the guess to the screen. In the case of a customer service phone line, it will try to match the guess to one of its pre-set menu items. In the case of Siri, it will make a call, look up something on the Internet, or try to come up with an answer to match the guess. As anyone who has used speech recognition software knows, mistakes happen. All the complicated statistics and mathematical transformations might not prevent "recognize speech" from coming out as "wreck a nice beach," but for a computer to pluck either one of those phrases out of the air is still pretty incredible.
- See more at: http://mentalfloss.com/article/31609....wydD5FLV.dpuf
Read the full text here: http://mentalfloss.com/article/31609...#ixzz2MD57E2b6