YoVDO

Visual Features for Context-Aware Speech Recognition - 2016

Offered By: Center for Language & Speech Processing(CLSP), JHU via YouTube

Tags

Speech Recognition Courses Machine Learning Courses Computer Vision Courses Deep Neural Networks Courses

Course Description

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore cutting-edge techniques for improving automatic speech recognition in challenging multimedia content through this comprehensive lecture by Florian Metze from Carnegie Mellon University. Delve into methods for adapting acoustic and language models using visual context from video, such as detected objects and scenes. Learn about experiments on "how-to" videos that demonstrate reduced word error rates by incorporating visual information. Examine approaches for handling speech variability, speaker-microphone distance, and audio-visual fusion. Gain insights into applications for robotics, human-computer interaction, and large-scale multimedia indexing. Discover how this research aims to bridge the gap between video-to-text and speech-to-text communities.

Syllabus

Intro
Outline
Automatic Speech Recognition
Speech Variability (Spectral)
Decoding Procedure
Experimental Setup
Simple Extensions
Performance on Switchboard
IARPA "Aladdin" Project
Speaker Microphone Distance (SMD)
Training SMD Extractors
Training SMD descriptors
SMD Results
SMD Analysis
Audio-Visual ASR
Speaker Attributes
Speaker Actions
Semantic Indexing CNN Features
Fusion of Approaches
Analysis "indoor" vs "outdoor"
Summary


Taught by

Center for Language & Speech Processing(CLSP), JHU

Related Courses

Machine Learning Capstone: An Intelligent Application with Deep Learning
University of Washington via Coursera
Elaborazione del linguaggio naturale
University of Naples Federico II via Federica
Deep Learning for Natural Language Processing
University of Oxford via Independent
Deep Learning Summer School
Independent
Sequence Models
DeepLearning.AI via Coursera