Leveraging Pre-Trained Representations to Improve Access to Untranscribed Speech from Endangered Languages

@article{San2021LeveragingPR,
  title={Leveraging Pre-Trained Representations to Improve Access to Untranscribed Speech from Endangered Languages},
  author={Nay San and Martijn Bartelds and Mitchell Browne and Lily Clifford and Fiona Gibson and John Mansfield and David Nash and Jane Simpson and Myfany Turpin and Maria Vollmer and Sasha Wilmoth and Dan Jurafsky},
  journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2021},
  pages={1094-1101}
}
Pre-trained speech representations like wav2vec 2.0 are a powerful tool for automatic speech recognition (ASR). Yet many endangered languages lack sufficient data for pre-training such models, or are predominantly oral vernaculars without a standardised writing system, precluding fine-tuning. Query-by-example spoken term detection (QbE-STD) offers an alternative for iteratively indexing untranscribed speech corpora by locating spoken query terms. Using data from 7 Australian Aboriginal… 

Figures and Tables from this paper

Automated speech tools for helping communities process restricted-access corpora for language revival efforts

A privacy-preserving workflow to widen both bottlenecks for recordings where speech in the endangered language is intermixed with a more widely-used language such as English for meta-linguistic commentary and questions and for recordings with access constraints.

Analyzing the Representational Geometry of Acoustic Word Embeddings

A closer analytical look at AWEs learned from English speech is taken and how the choice of the learning objective and the architecture shapes their representational profile is studied to highlight the prominent role of theLearning objective on shaping the representation profile compared to the model architecture.

Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages

This work uses sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora and uses UMAP clustering directly on speech signals to represent structure and variation.

References

SHOWING 1-10 OF 23 REFERENCES

Unsupervised Cross-lingual Representation Learning for Speech Recognition

XLSR is presented which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages to enable a single multilingual speech recognition model which is competitive to strong individual models.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

Almost-unsupervised Speech Recognition with Close-to-zero Resource Based on Phonetic Structures Learned from Very Small Unpaired Speech and Text Data

Preliminary work in this paper starts on mapping relation from the audio embeddings to text embedded words which gives the word-level ASR, and on aligning a small number of spoken words and the corresponding text words in the embedding spaces.

Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders

It is concluded that integrating BNFs with the CAE allows both large out-of-domain and sparse in-domain resources to be exploited for improved ASR-free keyword spotting.

Neural Network Based End-to-End Query by Example Spoken Term Detection

This article shows that the CNN based matching outperforms DTW based matching using bottleneck features as well and proposes to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously.

wav2vec: Unsupervised Pre-training for Speech Recognition

Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

CNN Based Query by Example Spoken Term Detection

This work addresses the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario by using posteriors from a spoken query and a test utterance to compute frame-level similarities in a matrix form and training a convolutional neural network for identifying the pattern and making a decision about the occurrence of the query.

Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users

This paper investigates the effectiveness of unsupervised speech representation learning on noisy radio broadcasting archives, which are abundant even in low-resource languages, and shares the first-ever speech recognition models for Maninka, Pular and Susu, languages spoken by over seven countries.

BERT Rediscovers the Classical NLP Pipeline

This work finds that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference.

Neural Representations for Modeling Variation in English Speech

This work shows that Transformer-based speech representations lead to significant performance gains over the use of phonetic transcriptions, and finds that feature-based use of Transformer models is most effective with one or more middle layers instead of the final layer.