WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

@article{Gao2022WAVPROMPTTF,
  title={WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models},
  author={Heting Gao and Junrui Ni and Kaizhi Qian and Yang Zhang and Shiyu Chang and Mark A. Hasegawa-Johnson},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.15863}
}
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text exam-ples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 19 REFERENCES
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being
Calibrate Before Use: Improving Few-Shot Performance of Language Models
TLDR
This work first estimates the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A", and then fits calibration parameters that cause the prediction for this input to be uniform across answers.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
TLDR
This paper connects the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task, and finds that the representation must satisfy several important properties to serve as drop-in replacements for text.
Speech Model Pre-training for End-to-End Spoken Language Understanding
TLDR
A method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU is proposed and improves performance both when the full dataset is used for training and when only a small subset is used.
SLURP: A Spoken Language Understanding Resource Package
TLDR
SLURP is released, a new SLU package containing a new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets and a new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement.
Pre-trained Models for Natural Language Processing: A Survey
TLDR
This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
Transformers: State-of-the-Art Natural Language Processing
TLDR
Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Deep multimodal semantic embeddings for speech and images
TLDR
A model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities and ties the networks together with an embedding and alignment model which learns a joint semantic space over both modalities.
...
...