This work was carried out at the University of Oxford Computer Science Department by Yannis Assael, Brendan Shillingford, Prof Shimon Whiteson and Prof Nando de Freitas. We thank Google DeepMind, CIFAR, and NVIDIA for financial support. We also thank University of Sheffield, Jon Barker, Martin Cooke, Stuart Cunningham and Xu Shao for the GRID corpus dataset; Aine Jackson, Brittany Klug and Samantha Pugh for helping us measure the experienced lipreader baseline; Mitko Sabev for his phonetics guidance; Odysseas Votsis for his video production help; and Alex Graves and Oiwi Parker Jones for helpful comments.
LipNet is doing lipreading using Machine Learning, aiming to help those who are hard of hearing and can revolutionise speech recognition.
LipNet: End-to-End Sentence-level Lipreading
[Yannis M. Assael, Brendan Shillingford], Shimon Whiteson, Nando de Freitas
[]
Abstract:
Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016).
1 view
278
79
1 month ago 00:17:21 1
How to make a figurine out of garbage and without cost?! Как сделать статуэтку из мусора?
5 months ago 00:03:59 1
ШОКОЛАДНЫЙ БИСКВИТНЫЙ РУЛЕТ пышный воздушный НЕ ТРЕСКАЕТСЯ с насыщенным шоколадным вкусом ЛюдаИзиКук
7 months ago 00:01:44 1
LipNet: How easy do you think lipreading is?
9 months ago 00:06:08 3
Как сделать тесто для лепки, пластилин Play Doh в домашних условиях How to make Play Doh at hom
1 year ago 00:45:25 1
How to Code a Machine Learning Lip Reading App with Python Tensorflow and Streamlit
2 years ago 00:04:43 1
Декор для кулича. Безе, которое получается всегда идеально!
9 years ago 00:01:19 59
Как сделать Плей До из бальзама для волос. Пластилин Плей До Play Doh How to make
9 years ago 00:04:26 4
Play Doh своими руками Как сделать Плей до Рецепт Play Doh with his hands How to Play to Recipe