• Publications
  • Influence
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
The zero resource speech challenge 2017
TLDR
A new challenge aimed at discovering subword and word units from raw speech and constructing systems that generalize across languages and adapt to new speakers is described.
Towards Learning a Universal Non-Semantic Representation of Speech
TLDR
This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective that outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks.
Efficient spoken term discovery using randomized algorithms
TLDR
This paper investigates the use of randomized algorithms that operate directly on the raw acoustic features to produce sparse approximate similarity matrices in O( n) space and O(n log n) time and demonstrates these techniques facilitate spoken term discovery performance capable of outperforming a model-based strategy in the zero resource setting.
Towards Unsupervised Training of Speaker Independent Acoustic Models
TLDR
This paper investigates the feasibility of using the results of a number of recent efforts to automatically discover repeated spoken terms without a recognizer as constraints for unsupervised acoustic model training, and starts with a relatively small set of word types.
Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings
TLDR
Several supervised and unsupervised approaches to the problem of embedding speech segments of arbitrary length into fixed-dimensional spaces in which simple distances serve as a proxy for linguistically meaningful (phonetic, lexical, etc.) dissimilarities are explored.
Towards spoken term discovery at scale with zero resources
TLDR
This work finds that long (∼ 1 s) repetitions tend to be contentful phrases and proposes an algorithm to search for these long repetitions without first recognizing the speech, and takes advantage of sparse feature representations and inherent low occurrence frequency of long content terms to achieve orders-of-magnitude speedup relative to the prior art.
...
...