Speaker verification involves accepting or rejecting the claim the speaker is who they say they are by comparing a voice sample with a previously stored voiceprint. It requires enrolment of a person in which acoustic characteristics are extracted from samples of the voice to create a voiceprint. The person is verified when a voice sample provided matches that of the enrolled voiceprint.
To find out more, see our FAQ, download our non-technical white paper and listen to our demo!
Speech Recognition
Also known as automatic speech recognition (ASR), speech recognition is a technology where a computer attempts to identify words spoken by a person into a microphone or telephone. The ideal situation is for the computer to recognise with 100% accuracy all words spoken regardless of the speaker characteristics, background noise, the amount of words to recognise, or channel conditions. Much research has concentrated on the accuracy and as a result recognising continuous digits can be greater than 99%. This is because the digit string set (0-9) is a small vocabulary.
When the recognition task is fairly constrained, that is, the task is to recognise a small set of words, higher rates of accuracy can be achieved compared to less constrained tasks.
Speech recognition is difficult because of the variability of the speech sounds within words (the ‘t’ in ‘treat’, ‘butter’, ‘bat’ has different acoustic characteristics) and across words (the ‘t’ in ‘talent show’ isn’t typically pronounced but in ‘talent and…’ it is).
Other difficulties for speech recognition are the changes in the channel, and variability of speakers such as changes of a speaker’s speaking rate or voice quality, socio-linguistic background, and dialects.
TTS is computer software that converts text into audible speech. Just as speech recognition is difficult because of the variability of speech and the way the same sound changes depending on its context, so the generation of speech is made difficult by speech variability.
The difficulties in generating speech are often understimated. The complexities involved in TTS are partly due to the flexibility of our vocal tract to produce sounds. We use our vocal chords to voice sounds, and to change the pitch of the voice. We vary the shape of the vocal tract by changing the shape and position of the articulators (tongue, lips, jaw). When we produce speech, each sound is a result of its context both within a word and across words. Our articulators move from one position to another anticipating the next sound and being effected by the previous one. The TTS system attempts to achieve this but typically we find that some sounds are unnatural, overarticulated, or underarticulated.
In most commercial systems, the TTS software has no understanding of the text being read. The system uses a set of rules for producing sounds, and references lists or dictionaries to guess how to read a piece of text.
TTS is typically based on concatenating speech sounds. This involves recording a human speaker to record these sounds. These sounds are then joined together to produce a huge variety of words and sentences.
Many of us are familiar with speech recognition systems which present us with a set of options and we say one of these options. Here the recognition task is restricted to recognise one amongst a small set of options. Call steering, on the other hand, attempts to recognise more open requests by asking callers ‘What would you like to do’ or ‘Tell me what you’re calling about’. The responses vary from ‘pay a bill’ to ‘I want to pay my bill thanks’.
In organisations that provide a variety of services and have call centres with specially trained staff, call steering is ideal. Typically these organisations would have a complicated menu which may or may not have options that meet the callers need. This can lead to callers going to the wrong destination and being transferred, or to callers hanging up. Call steering applications can address these problems and improve customer interaction and satisfaction. Listen to our
demo.