RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

@inproceedings{Jung2019RawNetAE,
  title={RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification},
  author={Jee-weon Jung and Hee-Soo Heo and Ju-ho Kim and Hye-jin Shim and Ha-Jin Yu},
  booktitle={INTERSPEECH},
  year={2019}
}
Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains. In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring further investigation. In this study, we explore end-to-end deep neural networks that input raw waveforms to improve various aspects: front-end speaker embedding extraction including model architecture, pre-training scheme, additional objective functions, and… Expand
Raw-x-vector: Multi-scale Time Domain Speaker Embedding Network
TLDR
This paper proposes a new speaker embedding called raw-x-vector for speaker verification in the time domain, combining a multi-scale waveform encoder and an x-vector network architecture, and shows that the proposed approach outperforms existing raw-waveform-based speaker verification systems by a large margin. Expand
Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms
TLDR
This study improves RawNet by scaling feature maps using various methods that utilizes a scale vector that adopts a sigmoid non-linear function and investigates replacing the first convolution layer with the sinc-convolution layer of SincNet. Expand
FDN: Finite Difference Network with Hierachical Convolutional Features for Text-independent Speaker verification
Recently, directly utilize raw waveforms as input is widely explored for the speaker verification system. For example, RawNet [1] and RawNet2 [2] extract feature embeddings from raw waveforms, whichExpand
Learning the Front-End Speech Feature with Raw Waveform for End-to-End Speaker Recognition
TLDR
This paper presents an end-to-end speaker recognition system, combining front-end raw waveform feature extractor, back- end speaker embedding classifier and angle-based loss optimizer, and details the superiority of the raw wave form feature extractors. Expand
Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms
TLDR
This study improves RawNet by rescaling feature maps using various methods and investigates replacing the first convolution layer with the sinc-convolution layer of SincNet. Expand
Selective Deep Speaker Embedding Enhancement for Speaker Verification
TLDR
This study proposes two frameworks for deep speaker embedding enhancement and specifically focus on distant utterances, which input speaker embeddings extracted from front-end systems, including deep neural networkbased systems, which widen the range of applications. Expand
Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms
TLDR
This paper shows that MFCC, VAD, and CMVN can be replaced by the tools available in the standard deep learning toolboxes, such as a stacked of stride convolutions, temporal gating, and instance normalization, and it is shown that directly learning speaker embeddings from waveforms outperforms an x-vector network that uses MFCC or filter-bank output as features. Expand
Orthogonality Regularizations for End-to-End Speaker Verification
TLDR
This paper introduces two orthogonality regularizers to endto-end speaker verification systems, the first is based on the Frobenius norm, and the second one utilizes restricted isometry property, which can be handily incorporated into end-to- end training. Expand
SVSNet: An End-to-end Speaker Voice Similarity Assessment Model
TLDR
SVSNet is proposed, the first end-to-end neural network model to assess the speaker voice similarity between natural speech and synthesized speech and it notably outperforms wellknown baseline systems in the assessment of speaker similarity at the utterance and system levels. Expand
Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding
TLDR
WaveFilter proves the feasibility of WaveFilter on separating the target-speaker’s voice from multi-speakers voice mixtures without knowing the exact number of speakers in advance, which in turn proves the readiness of the method for real-world applications. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 34 REFERENCES
End-to-end losses based on speaker basis vectors and all-speaker hard negative mining for speaker verification
TLDR
This study proposes two end-to-end loss functions for speaker verification using the concept of speaker bases, which are trainable parameters and enable hard negative mining and calculations of between-speaker variations with consideration of all speakers. Expand
A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result
TLDR
A complete end-to-end speaker verification system, which inputs raw audio signals and outputs the verification results, and a pre-processing layer and the embedded speaker feature extraction models were mainly investigated. Expand
Avoiding Speaker Overfitting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Verification
TLDR
This paper investigated regularization techniques, a multistep training scheme, and a residual connection with pooling layers in the perspective of mitigating speaker overfitting which lead to considerable performance improvements. Expand
Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms
TLDR
The experiments show how deep architectures with sample-level filters improve the accuracy in music auto-tagging and they provide results comparable to previous state-of-the-art performances for the Magnatagatune dataset and Million Song Dataset. Expand
Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model
TLDR
A Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings is proposed and it is found that the networks are better at discriminating broad phonetic classes than individual phonemes. Expand
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TLDR
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition. Expand
Speaker Recognition from Raw Waveform with SincNet
TLDR
This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters. Expand
Joint Training of Expanded End-to-End DNN for Text-Dependent Speaker Verification
TLDR
The main contribution of this paper is that, instead of using DNNs as parts of the system trained independently, the whole system is trained jointly with a finetune cost after pre-training each part. Expand
Speech acoustic modeling from raw multichannel waveforms
TLDR
A convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, and learns a similar feature representation through supervised training and outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions. Expand
Learning the speech front-end with raw waveform CLDNNs
TLDR
It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech. Expand
...
1
2
3
4
...