A Comparison of Self-Supervised Speech Representations As Input Features For Unsupervised Acoustic Word Embeddings

@article{Staden2021ACO,
  title={A Comparison of Self-Supervised Speech Representations As Input Features For Unsupervised Acoustic Word Embeddings},
  author={Lisa van Staden and Herman Kamper},
  journal={2021 IEEE Spoken Language Technology Workshop (SLT)},
  year={2021},
  pages={927-934}
}
Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by mapping speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in the form of automatically discovered word-like segments. Rather than learning embeddings at the… 

Figures and Tables from this paper

The effectiveness of self-supervised representation learning in zero-resource subword modeling
TLDR
Comparing two representative SSL algorithms, namely, contrastive predictive coding and autoregressive predictive coding, as a front-end method of a recently proposed, state-of-the art two-stage approach, to learn a representation as input to a back-end cross-lingual DNN shows CPC is more effective than APC as the front- end in this approach.
Self-Supervised Speech Representation Learning: A Review
TLDR
This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
Language Recognition Based on Unsupervised Pretrained Models
TLDR
It is discovered that unsupervised pretrained models capture expressive and highly linear-separable features that can help language recognition perform well even when the classifiers are relatively simple or only a small amount of labeled data is available.
Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning
TLDR
A simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search and can be applied iteratively and yield competitive SSE as evaluated on two tasks.
The Effectiveness of Unsupervised Subword Modeling With Autoregressive and Cross-Lingual Phone-Aware Networks
TLDR
Comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level showed that the proposed approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed.
Paralinguistic Privacy Protection at the Edge
TLDR
EDGY, a new lightweight disentangled representation learning model that transforms and filters high-dimensional voice data to remove sensitive attributes at the edge prior to offloading to the cloud, is introduced.
Self-supervised representation learning from 12-lead ECG data

References

SHOWING 1-10 OF 54 REFERENCES
Truly Unsupervised Acoustic Word Embeddings Using Weak Top-down Constraints in Encoder-decoder Models
  • H. Kamper
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
The encoder-decoder correspondence autoencoder is proposed, which, instead of true word segments, uses automatically discovered segments: an unsupervised term discovery system finds pairs of words of the same unknown type, and the EncDec-CAE is trained to reconstruct one word given the other as input.
Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information
TLDR
This work investigates how to improve the invariance of unsupervised acoustic embeddings to speaker and gender characteristics on Xitsonga, and considers two different methods for normalising out these factors: speaker andGender conditioning, and adversarial training.
Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks
TLDR
It is shown that a new hybrid correspondence-Triamese approach (CTriamese), consistently outperforms both the CAE and Triamese models in terms of average precision and ABX error rates on both English and Xitsonga evaluation data.
Evaluating the reliability of acoustic speech embeddings
TLDR
This work systematically compares two popular metrics, ABX discrimination and Mean Average Precision, on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoder, correspondence autoencoders, siamese).
Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings
TLDR
Several supervised and unsupervised approaches to the problem of embedding speech segments of arbitrary length into fixed-dimensional spaces in which simple distances serve as a proxy for linguistically meaningful (phonetic, lexical, etc.) dissimilarities are explored.
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder
TLDR
This paper proposes unsupervised learning of Audio Word2Vec from audio data without human annotation using Sequence-to-sequence Audoencoder (SA), which significantly outperformed the conventional Dynamic Time Warping (DTW) based approaches at significantly lower computation requirements.
Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech
TLDR
The proposed Speech2Vec model, a novel deep neural network architecture for learning fixed-length vector representations of audio segments excised from a speech corpus, is based on a RNN Encoder-Decoder framework, and borrows the methodology of skipgrams or continuous bag-of-words for training.
Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments
TLDR
It is shown that a simple downsampling method supplemented with length information can outperform the variable-length input feature representation on both evaluations and can yield even better results at the expense of increased computational complexity.
Deep convolutional acoustic word embeddings using word-pair side information
TLDR
This work uses side information in the form of known word pairs to train a Siamese convolutional neural network (CNN): a pair of tied networks that take two speech segments as input and produce their embeddings, trained with a hinge loss that separates same-word pairs and different- word pairs by some margin.
Rapid Evaluation of Speech Representations for Spoken Term Discovery
TLDR
This work presents a dynamic time warping-based framework for quantifying how well a representation can associate words of the same type spoken by different speakers and benchmarks the quality of a wide range of speech representations.
...
...