• Publications
  • Influence
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
TLDR
This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the Framework.
From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding
TLDR
This paper formulate audio to semantic understanding as a sequence-to-sequence problem, and proposes and compares various encoder-decoder based approaches that optimize both modules jointly, in an end- to-end manner.
Acoustic Modeling for Google Home
TLDR
The technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016, result in a reduction of WER of 8-28% relative to the current production system.
Unsupervised language model adaptation
TLDR
Unsupervised language model adaptation, from ASR transcripts, shows an error rate reduction of 3.9% over the unadapted baseline performance, from 28% to 24.1%, using 17 hours of unsupervised adaptation material.
Restoring punctuation and capitalization in transcribed speech
TLDR
The results show that using larger training data sets consistently improves performance, while increasing the n-gram order does not help nearly as much.
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition
TLDR
This paper introduces a neural network architecture, which performs multichannel filtering in the first layer of the network, and shows that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target Speaker direction.
SCANMail: a voicemail interface that makes speech browsable, readable and searchable
TLDR
A novel principle for the design of UIs to speech data: What You See Is Almost What You Hear (WYSIAWYH), which is a transcript of the speech data used as a visual analogue to that underlying data that allows users to visually scan, read, annotate and search these transcripts.
Fast vocabulary-independent audio search using path-based graph indexing
TLDR
A fast vocabulary independent audio search approach that operates on phonetic lattices and is suitable for any query, inspired by a general graph indexing method that defines an automatic procedure to select a small number of paths as indexing features, keeping the index size small while allowing fast retrieval of the lattices matching a given query.
...
...