100, 000 Podcasts: A Spoken English Document Corpus
@inproceedings{Clifton20201000P, title={100, 000 Podcasts: A Spoken English Document Corpus}, author={A. Clifton and S. Reddy and Yongze Yu and A. Pappu and R. Rezapour and Hamed Bonab and Maria Eskevich and G. Jones and Jussi Karlgren and Ben Carterette and R. Jones}, booktitle={COLING}, year={2020} }
Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval… CONTINUE READING
7 Citations
References
SHOWING 1-10 OF 47 REFERENCES
The ICSI Meeting Corpus
- Computer Science
- 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).
- 2003
- 646
- PDF
Extracting audio summaries to support effective spoken document search
- Computer Science
- J. Assoc. Inf. Sci. Technol.
- 2017
- 19
- PDF
The TREC Spoken Document Retrieval Track: A Success Story
- Computer Science
- TREC
- 1999
- 453
- Highly Influential
- PDF
Design and Implementation of an Online Corpus of Presentation Transcripts of TED Talks
- Computer Science
- 2015
- 9
Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation
- Computer Science
- LREC
- 2020
- 3
- PDF