Neural Predictive Coding Using Convolutional Neural Networks Toward Unsupervised Learning of Speaker Characteristics

  title={Neural Predictive Coding Using Convolutional Neural Networks Toward Unsupervised Learning of Speaker Characteristics},
  author={Arindam Jati and P. Georgiou},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  • Arindam Jati, P. Georgiou
  • Published 2019
  • Computer Science, Engineering
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
Learning speaker-specific features is vital in many applications like speaker recognition, diarization, and speech recognition. [...] Key Method We train a convolutional deep siamese network to produce “speaker embeddings” by learning to separate “same” versus “different” speaker pairs which are generated from an unlabeled data of audio streams. Two sets of experiments are done in different scenarios to evaluate the strength of NPC embeddings and compare with state-of-the-art in-domain supervised methods.Expand
An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks
An unsupervised training framework for learning a speaker-specific embedding using a Neural Predictive Coding (NPC) technique that outperforms the MFCC baseline for speaker change detection, and both MFCC and i-vector baselines for speaker classification. Expand
Learning Speaker Representations with Mutual Information
This work learns representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. Expand
Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks
The experimental results show that sequence-to-sequence system trained on both word sequences and MFCC can improve on speaker diarization result compared to the system that only relies on lexical modality or the baseline MFCC-based system. Expand
Contrastive Self-Supervised Learning for Text-Independent Speaker Verification
  • Haoran Zhang, Yuexian Zou, Helin Wang
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
This work exploits a contrastive self-supervised learning (CSSL) approach for text-independent speaker verification task and proposes channel-invariant loss to prevent the network from encoding the undesired channel information into the speaker representation. Expand
Multiview Shared Subspace Learning Across Speakers and Speech Commands
This paper learns speech representations in a multiview paradigm by constraining the views to known modes of variability such as speakers or spoken words by treating observations from one mode of variability as multiple parallel views, and shows that it can learn representations that are discriminative to the other mode. Expand
Unspeech: Unsupervised Speech Context Embeddings
This work uses a Siamese convolutional neural network architecture to train Unspeech embeddings and evaluates them on speaker comparison, utterance clustering and as a context feature in TDNN-HMM acoustic models trained on TED-LIUM, comparing it to i-vector baselines. Expand
Unsupervised Representation Learning for Speaker Recognition via Contrastive Equilibrium Learning
Experimental results showed that the proposed CEL significantly outperforms the state-of-the-art unsupervised speaker verification systems and the best performing model achieved 8.01% EER on VoxCeleb1 and VOiCES evaluation sets, respectively. Expand
Active Speakers in Context
The Active Speaker Context is introduced, a novel representation that models relationships between multiple speakers over long time horizons that improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving an mAP of 87.1%. Expand
Towards an Unsupervised Entrainment Distance in Conversational Speech using Deep Neural Networks
This work presents a turn-level distance measure obtained in an unsupervised manner using a Deep Neural Network model, which is called Neural Entrainment Distance (NED), and establishes a framework that learns an embedding from the population-wide entrainment in an unlabeled training corpus. Expand
MAAS: Multi-modal Assignation for Active Speaker Detection
A novel approach to active speaker detection is presented that directly addresses the multi-modal nature of the problem, and provides a straightforward strategy where independent visual features from potential speakers in the scene are assigned to a previously detected speech event. Expand


Speaker2Vec: Unsupervised Learning and Adaptation of a Speaker Manifold Using Deep Neural Networks with an Evaluation on Speaker Segmentation
The proposed Speaker2Vec represents a speaker-characteristics manifold learned in an unsupervised manner that outperforms the state-of-the-art speaker segmentation algorithms and MFCC based baseline methods on four evaluation datasets, while it allows for further improvements by employing this embedding into supervised training methods. Expand
Learning Speaker-Specific Characteristics With a Deep Neural Architecture
  • A. Salman
  • Computer Science, Medicine
  • IEEE Transactions on Neural Networks
  • 2011
A novel deep neural architecture especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker- specific overcomplete representation. Expand
Deep Neural Network Embeddings for Text-Independent Speaker Verification
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora. Expand
Speaker adaptation of neural network acoustic models using i-vectors
This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed. Expand
Deep Speaker: an End-to-End Neural Speaker Embedding System
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline. Expand
Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
A multi-objective loss function is proposed for learning speaker-specific characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss. Expand
Application of convolutional neural networks to speaker recognition in noisy conditions
This paper applies a convolutional neural network (CNN) trained for automatic speech recognition (ASR) to the task of speaker identification (SID). In the CNN/i-vector front end, the sufficientExpand
X-Vectors: Robust DNN Embeddings for Speaker Recognition
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition. Expand
Deep neural network-based speaker embeddings for end-to-end speaker verification
It is shown that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Expand
Speaker diarization through speaker embeddings
This paper proposes to learn a set of high-level feature representations through deep learning, referred to as Speaker Embeddings, for speaker diarization. Speaker Embedding features are taken fromExpand