Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

@article{Yang2022OpenSM,
  title={Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset},
  author={Zehui Yang and Yifan Chen and Lei Luo and Runyan Yang and Lingxuan Ye and Gaofeng Cheng and Ji Xu and Yaohui Jin and Qingqing Zhang and Pengyuan Zhang and Lei Xie and Yonghong Yan},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.16844}
}
This paper introduces a high-quality rich annotated Mandarin conversational (RAMC) speech dataset called MagicDataRAMC. The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in MagicData-RAMC are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker… 

Tables from this paper

References

SHOWING 1-10 OF 37 REFERENCES
HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus
TLDR
The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS), the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks.
WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
TLDR
WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition, and is provided for cross-validation purpose in training and evaluation.
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10, 000 Hours of Transcribed Audio
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of
AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
TLDR
AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario, is presented, and is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community.
AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline
  • Hui Bu, Jiayu Du, X. Na, Bengu Wu, Hao Zheng
  • Physics, Computer Science
    2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)
  • 2017
An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition
Librispeech: An ASR corpus based on public domain audio books
TLDR
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
CN-Celeb: A Challenging Chinese Speaker Recognition Dataset
  • Yue Fan, Jiawen Kang, Dong Wang
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
CN-Celeb is presented, a large-scale speaker recognition dataset collected ‘in the wild’ that contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
The Design for the Wall Street Journal-based CSR Corpus
TLDR
This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus, a corpus containing significant quantities of both speech data and text data.
History Utterance Embedding Transformer LM for Speech Recognition
TLDR
The history utterance embedding Transformer LM (HTLM), which includes an embedding generation network for extracting contextual information contained in the history utterances and a main TransformerLM for current prediction, and the two-stage attention (TSA) is proposed to encode richer contextual information into the embedding of history utterments while supporting GPU parallel training.
...
...