MeWEHV: Mel and Wave Embeddings for Human Voice Tasks
@article{VascoCarofilis2022MeWEHVMA, title={MeWEHV: Mel and Wave Embeddings for Human Voice Tasks}, author={Andr'es Vasco-Carofilis and Laura Fern'andez-Robles and Enrique Alegre and Eduardo Fidalgo}, journal={ArXiv}, year={2022}, volume={abs/2209.14078} }
A recent trend in speech processing is the use of embeddings created through machine learning models trained on a specific task with large datasets. By leveraging the knowledge already acquired, these models can be reused in new tasks where the amount of available data is small. This paper proposes a pipeline to create a new model, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines the embeddings generated…
References
SHOWING 1-10 OF 64 REFERENCES
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
- Computer ScienceNeurIPS
- 2020
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being…
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2020
This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
Spoken Language Identification using ConvNets
- Computer Science, LinguisticsAmI
- 2019
A new attention based model for language identification which uses log-Mel spectrogram images as input is proposed and the effectiveness of raw waveforms as features to neural network models for LI tasks is presented.
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
Multi-representation knowledge distillation for audio classification
- Computer ScienceMultimedia Tools and Applications
- 2022
A novel end-to-end collaborative training framework that takes multiple representations as inputs to train the networks jointly with a knowledge distillation method that significantly promotes the performance of networks without increasing the computational overhead in the inference stage.
Audio Tagging by Cross Filtering Noisy Labels
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2020
This article presents a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging, and achieves state-of-the-art performance and even surpasses the ensemble models on FSDKaggle2018 dataset.
Common Voice: A Massively-Multilingual Speech Corpus
- Computer ScienceLREC
- 2020
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
- Computer ScienceIEEE Journal of Selected Topics in Signal Processing
- 2022
A new pre-trained model, WavLM, is proposed, to solve full-stack downstream speech tasks and achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.
Towards Learning a Universal Non-Semantic Representation of Speech
- Computer ScienceINTERSPEECH
- 2020
This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective that outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks.
Time delay deep neural network-based universal background models for speaker recognition
- Computer Science2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
- 2015
This study investigates a lightweight alternative in which a supervised GMM is derived from the TDNN posteriors, which maintains the speed of the traditional unsupervised-GMM, but achieves a 20% relative improvement in EER.