Quantifying Language Variation Acoustically with Few Resources

  title={Quantifying Language Variation Acoustically with Few Resources},
  author={Martijn Bartelds and Martijn Wieling},
Deep acoustic models represent linguistic information based on massive amounts of data. Unfortunately, for regional languages and dialects such resources are mostly not available. However, deep acoustic models might have learned linguistic information that transfers to low-resource languages. In this study, we evaluate whether this is the case through the task of distinguishing low-resource (Dutch) regional varieties. By extracting embeddings from the hidden layers of various wav2vec 2.0 models… 

Figures and Tables from this paper


Ethnologue: Languages of the World
Neural representations for modeling variation in speech
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition and the algorithm uses a gumbel softmax or online k-means clustering to quantize the dense representations.
Layer-Wise Analysis of a Self-Supervised Speech Representation Model
This work examines one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools to characterize the evolution of information across model layers, and understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations.
Unsupervised Cross-lingual Representation Learning for Speech Recognition
XLSR is presented which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages to enable a single multilingual speech recognition model which is competitive to strong individual models.
Common Voice: A Massively-Multilingual Speech Corpus
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.
Information retrieval for music and motion
Analysis and Retrieval Techniques for Music Data, SyncPlayer: An Advanced Audio Player, and Relational Features and Adaptive Segmentation.
Adapting Monolingual Models: Data can be Scarce when Language Similarity is High
This work retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties, while the Transformer layers are independently finetuned on a POS-tagging task in the model’s source language.
Toward Progress in Theories of Language Sound Structure