The Influence of Dataset Partitioning on Dysfluency Detection Systems

  title={The Influence of Dataset Partitioning on Dysfluency Detection Systems},
  author={S. P. Bayerl and Dominik Wagner and Elmar N{\"o}th and Tobias Bocklet and Korbinian Riedhammer},
  booktitle={International Conference on Text, Speech and Dialogue},
. This paper empirically investigates the influence of different data splits and splitting strategies on the performance of dysfluency detection systems. For this, we perform experiments using wav2vec 2.0 models with a classification head as well as support vector machines (SVM) in conjunction with the features extracted from the wav2vec 2.0 model to detect dysfluencies. We train and evaluate the systems with different non-speaker-exclusive and speaker-exclusive splits of the Stuttering Events in… 

Dysfluencies Seldom Come Alone - Detection as a Multi-Label Problem

Specially adapted speech recognition models are necessary to handle stuttered speech. For these to be used in a targeted manner, stuttered speech must be reliably detected. Recent works have treated

Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0

Stuttering is a varied speech disorder that harms an individual’s communication ability. Persons who stutter (PWS) often use speech therapy to cope with their condition. Improving speech recognition

Explore wav2vec 2.0 for Mispronunciation Detection

This paper presents an initial attempt to use self-supervised learning for Mispronunciaiton Detection, and outperforms existing methods on a public dataset L2-ARCTIC with a F1 value of 0 .

SEP-28k: A Dataset for Stuttering Event Detection from Podcasts with People Who Stutter

This work introduces Stuttering Events in Podcasts (SEP-28k), a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetition, and interjections, and benchmarks a set of acoustic models on SEP- 28k and the public FluencyBank dataset.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

X-Vectors: Robust DNN Embeddings for Speaker Recognition

This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

VoxCeleb: A Large-Scale Speaker Identification Dataset

This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art TDNN based systems on the Voxceleb test sets and the 2019 VoxCeleb Speaker Recognition Challenge.

Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings

This work proposes a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks, showing superior performance compared to results in the literature.

Librispeech: An ASR corpus based on public domain audio books

It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.