• Publications
  • Influence
A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling
TLDR
This paper builds a neural network called TALNet, which is the first system to reach state-of-the-art audio tagging performance on Audio Set, while exhibiting strong localization performance on the DCASE 2017 challenge at the same time.
Tracking changes in continuous emotion states using body language and prosodic cues
TLDR
The emotional content of body language cues describing a participant's posture, relative position and approach/withdraw behaviors during improvised affective interactions are examined, and it is shown that they reflect changes in the participant's activation and dominance levels.
A first attempt at polyphonic sound event detection using connectionist temporal classification
TLDR
This paper presents a first attempt at using Connectionist temporal classification (CTC) for sound event detection, and shows that CTC is able to locate the boundaries of sound events on a very noisy corpus of consumer generated content with rough hints about their positions.
Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection
TLDR
This paper proposes a small-footprint multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels and shows that audio embeddings extracted by the convolutional neural networks significantly boost the performance of all MIL models.
An in-depth comparison of keyword specific thresholding and sum-to-one score normalization
TLDR
This paper compares two widely used thresholding algorithms: keyword specific thresholding (KST) and sum-to-one score normalization (STO), analyzes the difference in their performance in detail, and recommends the use of the “estimated KST” algorithm.
Polyphonic Sound Event Detection with Weak Labeling
TLDR
This thesis proposes to train deep learning models for SED using various levels of weak labeling, and shows that the sound events can be learned and localized by a recurrent neural network (RNN) with a connectionist temporal classification (CTC) output layer, which is well suited for sequential supervision.
Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling
TLDR
Evaluation on a subset of Audio Set shows that CTL closes a third of the gap between presence/ absence labeling and strong labeling, demonstrating the usefulness of the extra temporal information in sequential labeling.
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks
TLDR
This paper combines the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameterefficient conformer architectures and proposes a self- supervised audio representation learning method that achieves a new state-of-the-art score on the AudioSet benchmark.
Audio-based multimedia event detection using deep recurrent neural networks
TLDR
This paper introduces longer-range temporal information with deep recurrent neural networks (RNNs) for both stages ofimedia event detection, and observes improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.
The ACLEW DiViMe: An Easy-to-use Diarization Tool
TLDR
The present paper introduces the set of included tools and the current work, which is focused on making minimal assumptions regarding users’ technical skills, and shows how the current DiViMe tools fare against three sets of challenging data.
...
...