• Publications
  • Influence
End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
TLDR
This paper introduces a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using thisnew data set, that generates responses in a dialog about a video.
WHAM!: Extending Speech Separation to Noisy Environments
TLDR
The WSJ0 Hipster Ambient Mixtures dataset is created, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples, to benchmark various speech separation architectures and objective functions to evaluate their robustness to noise.
WHAMR!: Noisy and Reverberant Single-Channel Speech Separation
TLDR
WHAMR!, an augmented version of WHAM! with synthetic reverberated sources is introduced, and a thorough baseline analysis of current techniques as well as novel cascaded architectures on the newly introduced conditions are provided.
Phasebook and Friends: Leveraging Discrete Representations for Source Separation
TLDR
These methods are evaluated on the wsj0-2mix dataset, a well-studied corpus for single-channel speaker-independent speaker separation, matching the performance of state-of-the-art mask-based approaches without requiring additional phase reconstruction steps.
Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity
TLDR
It is shown that the synthesized Lakh dataset (Slakh) can be used to effectively augment existing datasets for musical instrument separation, while opening the door to a wide array of data-intensive music signal analysis tasks.
Finding Strength in Weakness: Learning to Separate Sounds With Weak Supervision
TLDR
This work proposes objective functions and network architectures that enable training a source separation system with weak labels and benchmarks the performance of the algorithm using synthetic mixtures of overlapping events created from a database of sounds recorded in urban environments.
The Phasebook: Building Complex Masks via Discrete Representations for Source Separation
TLDR
This work proposes to estimate phase using "phase book", a new type of layer based on a discrete representation of the phase difference between the mixture and the target, and introduces "combook", a similartype of layer that directly estimates a complex mask.
Attentive Neural Processes and Batch Bayesian Optimization for Scalable Calibration of Physics-Informed Digital Twins
TLDR
To handle large-scale calibration of digital twins without exorbitant simulations, this work proposes ANP-BBO: a scalable and parallelizable batch-wise Bayesian optimization (BBO) methodology that leverages attentive neural processes (ANPs).
Segmentation, Indexing, and Retrieval for Environmental and Natural Sounds
TLDR
A dynamic Bayesian network (DBN) is presented that jointly infers onsets and end times of the most prominent sound events in the space, along with an extension of the algorithm for covering large spaces with distributed microphone arrays.
Class-conditional Embeddings for Music Source Separation
TLDR
This work proposes using a common embedding space for the time-frequency bins of all instruments in a mixture inspired by deep clustering and deep attractor networks and outperforms a mask-inference baseline on the MUSDB-18 dataset.
...
...