• Publications
  • Influence
Learning Transferable Visual Models From Natural Language Supervision
TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Crepe: A Convolutional Representation for Pitch Estimation
TLDR
This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.
Jukebox: A Generative Model for Music
TLDR
It is shown that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes, and can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.
Robust fine-tuning of zero-shot models
TLDR
This work introduces a simple and effective method for improving robustness whilefine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT), providing large accuracy improvements under distribution shift, while preserving high accuracy on the target distribution.
Adversarial Learning for Improved Onsets and Frames Music Transcription
TLDR
This work introduces an adversarial training scheme that operates directly on the time-frequency representations and makes the output distribution closer to the ground-truth, and shows that adversarial learning can significantly reduce the error rate while increasing the confidence of the model estimations.
Text and Code Embeddings by Contrastive Pre-Training
TLDR
It is shown that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code.
Open-Source Practices for Music Signal Processing Research: Recommendations for Transparent, Sustainable, and Reproducible Audio Research
TLDR
Because of an increased abundance of methods, the proliferation of software toolkits, the explosion of machine learning, and a focus shift toward more realistic problem settings, modern research systems are substantially more complex than their predecessors.
Neural Music Synthesis for Flexible Timbre Control
TLDR
A neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned on a learned instrument embedding followed by a WaveNet vocoder, is described.
Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications
TLDR
Through some preliminary probes, it is found that CLIP can inherit biases found in prior computer vision systems, which adds evidence to the growing body of work calling for a change in the notion of a ‘better’ model.
...
1
2
...