• Publications
  • Influence
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Self-Supervised GANs via Auxiliary Rotation Loss
TLDR
This work allows the networks to collaborate on the task of representation learning, while being adversarial with respect to the classic GAN game, and takes a step towards bridging the gap between conditional and unconditional GANs.
High-Fidelity Image Generation With Fewer Labels
TLDR
This work demonstrates how one can benefit from recent work on self- and semi-supervised learning to outperform the state of the art on both unsupervised ImageNet synthesis, as well as in the conditional setting.
Self-Supervised Generative Adversarial Networks
TLDR
This work exploits two popular unsupervised learning techniques, adversarial training and self-supervision, to close the gap between conditional and unconditional GANs and allows the networks to collaborate on the task of representation learning, while being adversarial with respect to the classic GAN game.
Now Playing: Continuous low-power music recognition
TLDR
A low-power music recognizer that runs entirely on a mobile device and automatically recognizes music without user interaction is presented, which respects user privacy by running entirely on-device and can passively recognize a wide range of music.
Continental-Scale Building Detection from High Resolution Satellite Imagery
TLDR
A model training pipeline for detecting buildings across the entire continent of Africa, using 50 cm satellite imagery, starts with the U-Net model and reports novel methods for improving performance of building detection with this type of model.
Representation learning from videos in-the-wild: An object-centric approach
We propose a method to learn image representations from uncurated videos. We combine a supervised loss from off-the-shelf object detectors and self-supervised losses which naturally arise from the
Self-Supervised Learning of Video-Induced Visual Invariances
TLDR
Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task.
Scaling Up Models and Data with t5x and seqio
TLDR
Two software libraries are presented: t5x simplifies the process of building and training large language models at scale while maintaining ease of use, and seqio provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines.
Training Deep Neural Networks for Reverberation Robust Speech Recognition
TLDR
It is shown that a simple approach using room impulse response (RIR) can be used to train systems more robust to reverberation, and improvements from 59.7% word error rate (WER) down to 41.9%.