speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment

  title={speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment},
  author={Junbo Zhang and Zhiwen Zhang and Yongqing Wang and Zhiyong Yan and Qiong Song and Yukai Huang and Ke Li and Daniel Povey and Yujun Wang},
This paper introduces a new open-source speech corpus named “speechocean762” designed for pronunciation assessment use, consisting of 5000 English utterances from 250 non-native speakers, where half of the speakers are children. Five experts annotated each of the utterances at sentence-level, wordlevel and phoneme-level. A baseline system is released in open source to illustrate the phoneme-level pronunciation assessment workflow on this corpus. This corpus is allowed to be used freely for… 

Figures and Tables from this paper

Self-Supervised Pre-Trained Speech Representation Based End-to-End Mispronunciation Detection and Diagnosis of Mandarin

This work extended the end-to-end MDD system based on CTC/Attention hybrid architecture and Transformer architecture, using features extracted from self-supervised pre-training speech representation models such as Wav2Vec 2.0 and WavLM to replace conventional speech features like MFCC and Fbank.

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

A novel automatic pronunciation assessment method based on SSL model-based methods that outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762.

SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation

The SpeechBlender utilizes varieties of masks to target different regions of a phonetic unit, and use the mixing factors to linearly interpolate raw speech signals while generating erroneous pronunciation instances, thus generating more effective samples than the ‘Cut/Paste’ method.

Variations of multi-task learning for spoken language assessment

Experiments on the speechocean762 dataset suggest that jointly learning from phone and word-level scores yields significant performance gains for the sentence-level score prediction task, and jointlylearning from different score types can also be mutually beneficial.

Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

This work trains a Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task learning and shows that GOPT achieves the best results on speechocean762 with a public automatic speech recognition (ASR) acoustic model trained on Librispeech.

CoCA-MDD: A Coupled Cross-Attention based Framework for Streaming Mispronunciation Detection and Diagnosis

A streaming e2e MDD model called CoCA-MDD, utilizing conv-transformer structure to encode input speech in a streaming manner and a coupled cross-attention (CoCA) mechanism is proposed to integrate frame-level acoustic features with encoded reference linguistic features.

3M: An Effective Multi-view, Multi-granularity, and Multi-aspect Modeling Approach to English Pronunciation Assessment

This paper integrates multiple prosodic and phonological features to provide a multi-view, multi-granularity, and multi-aspect pronunciation modeling and develops a vowel/consonant positional embedding for a more phonology-aware automatic pronunciation assessment.

L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis

L2-GEN, a new data augmentation framework to generate L2 phoneme sequences that capture realistic mispronunciation patterns by devising an unique machine translation-based sequence paraphrasing model is proposed.

Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information

A phone-level mixup, a simple yet effective data augmentation method, to improve the performance of word-level pronunciation scoring and utilize multi-source information to further improve the scoring system performance.

Hierarchical Pronunciation Assessment with Multi-Aspect Attention

A Hierarchical Pronunciation Assessment with Multi-aspect Attention (HiPAMA) model is proposed, which hierarchically represents the granularity levels to directly capture their linguistic structures and introduces multi- aspect attention that reflects associations across aspects at the same level to create more connotative representations.



The ISLE Corpus of Non-Native Spoken English

A corpus of non-native speech data has been collected, which consists of almost 18 hours of annotated speech signals spoken by Italian and German learners of English, to highlight pronunciation errors such as phone realisation problems and misplaced word stress assignments.

Sell-corpus: an Open Source Multiple Accented Chinese-english Speech Corpus for L2 English Learning Assessment

  • Yu ChenJun HuXinyu Zhang
  • Linguistics
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
To the best of the knowledge, this work is the first open-source English speech corpus that accounts for the accents of all major Chinese regional dialects and provides a baseline for Chinese multiple accented automatic speech recognition system.

Development of Japanese Speech Database Read by Non-native Speakers for Constructing CALL System

This paper describes the construction and evaluation of Japanese speech database read by non-native speakers in order to develop CALL systems by a research project. The project has been organized

L2-ARCTIC: A Non-native English Speech Corpus

L2-ARCTIC is introduced, a speech corpus of non-native English that is intended for research in voice conversion, accent conversion, and mispronunciation detection, and is publicly accessible at https://psi.tamu.edu/l2-arctic-corpus/.

TLT-school: a Corpus of Non Native Children Speech

A corpus of speech utterances collected in schools of northern Italy for assessing the performance of students learning both English and German is described and results achieved by means of an automatic speech recognition system developed by us are described.

iCALL corpus: Mandarin Chinese spoken by non-native speakers of European descent

We present iCALL , a speech corpus designed to evaluate Mandarin Chinese pronunciation patterns of non-native speakers of European descent, developed at the Institute for Infocomm Research (I 2 R) in

Using the HTK speech recogniser to anlayse prosody in a corpus of German spoken learner's English

Intonation is important in human communication to help the listener to understand the meaning and attitude of the speaker (Brown, 1977; O’Connor, 1970). Language students and teachers see intonation

The Automatic Assessment of Non-native Prosody: Combining Classical Prosodic Analysis with Acoustic Modelling

This paper combines a large prosodic feature vector with features derived from a Gaussian Mixture Model used as Universal Background Model and an open-source toolkit for extracting acoustic features to assess the quality of L2 learner’s utterances with respect to sentence melody and rhythm.

Design and Collection of an L2 English Corpus with a Suprasegmental Focus for Chinese Learners of English

A corpus is designed, collected and annotated to elicit suprasegmental information from over 200 Chinese learners of English, focusing on lexical stress, utterance level stress, intonation, reduction of function words, as well as prosodic disambiguation.

Intonation classification for L2 English speech using multi-distribution deep neural networks