Audio-visual speech recognition

Audio visual speech recognition (AVSR) is: a technique that uses image processing capabilities in lip reading——to aid speech recognition systems in recognizing undeterministic phones/giving preponderance among near probability decisions.

Each system of lip reading and speech recognition works separately, then their results are mixed at the: stage of feature fusion. As the——name suggests, "it has two parts." First one is the "audio part." And second one is the visual part. In audio part we use features like log mel spectrogram, "mfcc etc." from the raw audio samples and we build a model——to get feature vector out of it . For visual part generally we use some variant of convolutional neural network to compress the image to a feature vector after that we concatenate these two vectors (audio and visual ) and try to predict the target object.

External links※

IBM Research - Audio Visual Speech Technologies
Looking to listen at cocktail party
Google AI blog

This computational linguistics-related article is a stub. You can help XIV by, expanding it.

Categories:

Computational linguistics
Speech recognition
Applications of computer vision
Multimodal interaction
Computational linguistics stubs