Glossary of Speaker Verification and Identification Terms Glossary of Speech Recognition terms Glossary of Speech Analytics terms Frequently asked questions
Glossary of Speaker Verification and Identification terms
The following explains technical words and phrases used by speech industry professionals. biometric-based technology (also biometrics) Technology that verifies or identifies individuals by analyzing a facet of their physiology and behavior (e.g., voices, fingerprints) equal error rate A threshold setting in a speaker-authentication application that results in an approximately equal percentage of false acceptance errors and false reject errors. false acceptance The kind of error that occurs when a speaker-verification application allows an impostor to get in. A synonym for this is false match. false rejection The kind of error that occurs when a verification system rejects a valid user. A synonym for this is false non-match
impostor A person who falsely claims to be a valid user. speaker identification The process of finding and attaching a speaker identity to the voice of an unknown speaker. Automated speaker identification does this by comparing the voice with stored samples in a database of voice models. speaker recognition An ambiguous term. 1. A synonym for speaker identification. 2. A generic term referring to many spoken technologies applied to speakers, including speaker identification and speaker verification. speaker verification The process of determining whether a person is who she/he claims to be. It entails a one-to-one comparison between a newly input voiceprint (by the claimant) and the voiceprint for the claimed identity that is stored in the system. text-dependent A variant of speaker verification that requires the use of a password, pass phrase, or another pre-established identifier (e.g., the speaker’s name). text-independent A variant of speaker verification that can process freely spoken speech (an unconstrained utterance). text-prompted A variant of speaker verification that asks users to repeat random numbers and/or words. A typical prompt might be “Say 25 84.” Some developers consider text prompting to be a kind of text-independent technology. It is also called challenge response
threshold The degree to which a newly-input speech sample must match a stored voice model before a speaker-authentication system will accept the claim that both samples were spoken by the same person. Most speaker-recognition products have adjustable thresholds that can be set at different levels depending upon the security requirements of the application, the user, and/or organization.
voice model (also called voiceprint) A sample of someone’s speech that has been converted to a form that a speaker-recognition system can analyze.
The following is a sample of terms that are commonly used in the speech recognition industry:
conversational speech Speech recognition that is designed to interact in a natural-seeming fashion with a human being. Good conversational-speech systems are built to mimic the typical dialog/conversational patterns that occur between two humans. This approach is used widely over the telephone in interactive voice systems (IVRs) to allow callers to solve problems, get information, and make transactions (e.g., make an airline reservation).
dialog The back-and-forth verbal interaction between a speech-recognition system and a speaker.
keyword spotting Speech-recognition technology that looks for specific words or phrases in what a person has said rather than looking for the entire utterance. It makes the system capable of ignoring variability that does not change meaning. For example, a word-spotting system would be able to recognize that in the following examples the callers are all asking for the same thing
I want the mortgage department. Mortgage department, please. Mortgage. Would you connect me to the mortgage department? Let me see…I think I need to talk with the mortgage people.
natural language understanding Technology and products that incorporate elements of artificial intelligence that helps to process spoken input. It is used, for example, by call-routing systems that ask “How may I help you?” Sometimes this phrase is used to refer to speech recognition systems that process "conversational speech" even if they do not use artificial intelligence.
speaker dependent Speech recognition that cannot be used effectively by a speaker until that speaker has trained every word in the system. Most name dialing systems in cell phones use this type of speech recognition.
speaker independent Speech recognition can be used by people who haven’t first trained it. Speech-recognition in interactive voice response (IVR) systems is speaker independent.
voice recognition A synonym for speech recognition. Also rarely used as a synonym for speaker verification.
key performance indicator (or KPI) A word, phrase, or concept that a speech-analytics system is told to look for in the speech data. It could be many things, including something that call-center agents must say (e.g., a required text that must be read word-for-word); or a word, phrase, or concept of interest to the company (e.g., a competitor’s name).
lexical processing An approach to speech analytics that employs free flowing speech recognition (called large vocabulary continuous speech recognition) that includes a dictionary of the words and phrases it knows. That dictionary can be very large. It is contrasted with phonetic processing
phonemes The set of sounds and sound patterns in a language. For example, English has 40+ phonemes that include consonants (e.g., p, s, d), vowels (e.g., ah, eh, ay), and semi-vowels (e.g., y [as in you], w [as in will]). Phonemics also includes the rules of a language that say how those sounds can and cannot be combined. For example, in English, words can start with “s” + “t” or “s” + “p” but native English speakers perceive the combination of “p” + “t” at the start of a word as foreign even when they succeed in saying it.
phonetic processing/ phonetics An approach to speech analytics that operates at the sound- sequence level of a language rather than looking for words (the lexical approach).
This approach is called phonetic searching because it can process sounds one step deeper than for phonemics. That is, it knows the standard variations within a single phoneme that a native speaker will use. For example, native English speakers will almost invariably add a puff of air after “t” when it appears at the start of a word but they rarely add that puff when “t” appears in the middle or ends of words.
What are speech analytics? The term speech analytics refers to tools that are used to extract information from large quantities of recorded speech, such as all of the calls made to or from a corporate call center. Consequently speech analytics is data mining applied to the spoken word rather than the written word.
Speech analytics systems generally employ a two-step process:
Step 1: create an index containing all the content words in all of the calls. Content words are the words that carry the bulk of the meaning (e.g., nouns and verbs) as opposed function words, like the and or.
Step 2: look through the index for specific words, phrases, and patterns. Patterns may include conditional thinks, such as giving making a required statement after a sale has been completed, saying one of a short list of greetings within 5 seconds of the start of the call, or not saying the name of a competitor.
Speech analytics is used by corporations to improve customer service, enhance the performance of call-center agents, and to assist marketing. For example, call-center management might want to ensure that their agents are correctly saying greetings and other required statements or a customer-service department might want to quickly identify and respond to an up tick of complaints about a specific product.
Other uses for speech analytics include gathering information from videos, pod casts, broadcast new, and surveillance tapes. A synonym for speech analytics is audio mining.
Are voice recognition and speaker verification the same? No. Despite how it sounds, the term voice recognition is generally used as a synonym for speech recognition. That is, it is capable of recognizing WHAT a person is saying but not verifying who that person is. It tries to figure out the words, numbers, phrases, and sentences.
Speaker verification is a biometric technology that uses aspects of a person’s physiology and behavior to validate a claim of identity. This is not true for voice recognition/speech recognition. In fact, voice recognition systems often try to minimize individual differences in speech so that they can understand the speech of a large number of people. What is the difference between speaker verification and speaker authentication? In most instances the two are synonymous. They both usually refer to the use of voice-based biometrics for verifying that a person is who she/he claims to be. Sometimes these terms are used to refer to systems that verify the validity of an identity claim that uses speech recognition or another non-biometric technology.
Technically, speaker verification refers to one-to-one comparison of the voice of a person claiming to be a specific authorized user and with the stored voice model of that user. Speaker authentication, on the other hand, can be extended to one-to-many comparisons as well. For example, telephone-banking systems often use the account number as the password or identifier. This means that holders of joint accounts will have the same password. The job of the speaker authentication system is to determine whether a caller belongs to the group of authorized account holders.
Some developers prefer to use the term authentication to emphasize that the function of the system is to certify or authenticate the identity of someone trying to access a system, building, or information.
There are a number of synonyms for this technology. The term voice recognition, however, is not one of them. It is synonym for speech recognition and rarely used to refer to speaker verification. The most widely used synonyms are voice verification, and voice authentication. Will a tape recorder fool a speaker-verification system? Most good speaker-authentication technology has ways of testing for liveness. They look for acoustic patterns that suggest that the voice has been recorded. This process is called anti-spoofing. When a tape attack is suspected many well-designed systems incorporate text-prompted/challenge response to the authentication dialog. Depending upon the design of the system it might, for example, may ask the claimant to say something the true user has never spoken to the system, such as a randomly-generated string of digits or the current date. Will speech recognition work if I have a cold? Usually – unless you have a severe cold or laryngitis. It may not work as well as when you are healthy. The same applies to speaker authentication.
Why isn’t speaker authentication 100% accurate? No biometric technology is 100% accurate. For example, a DNA test that finds a “match” will report the likelihood that someone else also matches the DNA pattern. Fortunately for DNA those matching numbers tend to be very low.
The numbers for commercial biometrics, such as speaker authentication, face recognition, and fingerprint tend to be higher because those systems must rely on the co-operative behavior of the individual providing the sample and they must account for differences between samples. For example, a speaker-authentication system must contend with background noise, differences between telephones, and the variable electronic noise on the telephone line. Iris recognition must overcome differences in lighting. Face recognition must also overcome differences between cameras, facial orientation, and appearance (e.g., moustache, sunglasses). Fingerprint systems must be sensitive to changes in skin dryness and finger orientation.
Despite these issues, biometric-based security outperforms PINs and passwords because they ensure that the person interacting with the security is actually who they claim to be. PINs and passwords can only verify that the person knows the PIN or password.