Corpus ID: 227231123

100, 000 Podcasts: A Spoken English Document Corpus

@inproceedings{Clifton20201000P,
  title={100, 000 Podcasts: A Spoken English Document Corpus},
  author={A. Clifton and S. Reddy and Yongze Yu and A. Pappu and R. Rezapour and Hamed Bonab and Maria Eskevich and G. Jones and Jussi Karlgren and Ben Carterette and R. Jones},
  booktitle={COLING},
  year={2020}
}
  • A. Clifton, S. Reddy, +8 authors R. Jones
  • Published in COLING 2020
  • Computer Science
  • Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval… CONTINUE READING
    7 Citations

    Tables from this paper

    References

    SHOWING 1-10 OF 47 REFERENCES
    The ICSI Meeting Corpus
    • A. Janin, D. Baron, +8 authors Chuck Wooters
    • Computer Science
    • 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).
    • 2003
    • 646
    • PDF
    The NIST Meeting Room Pilot Corpus
    • 97
    • PDF
    Extracting audio summaries to support effective spoken document search
    • 19
    • PDF
    TIMIT Acoustic-Phonetic Continuous Speech Corpus
    • 1,299
    • PDF
    The TREC Spoken Document Retrieval Track: A Success Story
    • 453
    • Highly Influential
    • PDF
    The CLEF 2003 Cross-Language Spoken Document Retrieval Track
    • 13
    • PDF
    The AMI meeting corpus
    • 372
    • PDF
    The ATIS Spoken Language Systems Pilot Corpus
    • 485
    • PDF
    Design and Implementation of an Online Corpus of Presentation Transcripts of TED Talks
    • 9