On Development of Consistently Punctuated Speech Corpora


Punctuation of automatically recognized speech is important to enhance readability of transcripts and to aid downstream NLP processing. This paper is concerned with issues involved in developing training and test corpora for automatic punctuation systems. Punctuation annotation in speech transcripts is difficult since there are numerous cases for which no standard punctuation rules exist. Special punctuation annotation guidelines tailored to spoken language were developed. Using these guidelines, almost 100 hours of broadcast news and conversation data in English and French have been punctuated by trained annotators. Measures of inter-annotator agreement are provided for both languages and differences between languages and genre are analyzed and discussed, along with some of the most frequent disagreements between annotators. Overall, using the guidelines, the annotation consistency has been significantly improved.

