MeWEHV: Mel and Wave Embeddings for Human Voice Tasks

  title={MeWEHV: Mel and Wave Embeddings for Human Voice Tasks},
  author={Andr'es Vasco-Carofilis and Laura Fern'andez-Robles and Enrique Alegre and Eduardo Fidalgo},
A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines the embeddings generated… 

Figures and Tables from this paper



wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.

Spoken Language Identification using ConvNets

A new attention based model for language identification which uses log-Mel spectrogram images as input is proposed and the effectiveness of raw waveforms as features to neural network models for LI tasks is presented.

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.

Multi-representation knowledge distillation for audio classification

A novel end-to-end collaborative training framework that takes multiple representations as inputs to train the networks jointly with a knowledge distillation method that significantly promotes the performance of networks without increasing the computational overhead in the inference stage.

Audio Tagging by Cross Filtering Noisy Labels

This article presents a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging, and achieves state-of-the-art performance and even surpasses the ensemble models on FSDKaggle2018 dataset.

Common Voice: A Massively-Multilingual Speech Corpus

This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

A new pre-trained model, WavLM, is proposed, to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.

Towards Learning a Universal Non-Semantic Representation of Speech

This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective that outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks.

Time delay deep neural network-based universal background models for speaker recognition

This study investigates a lightweight alternative in which a supervised GMM is derived from the TDNN posteriors, which maintains the speed of the traditional unsupervised-GMM, but achieves a 20% relative improvement in EER.