Building a Large Annotated Corpus of English: The Penn Treebank

@article{Marcus1993BuildingAL,
  title={Building a Large Annotated Corpus of English: The Penn Treebank},
  author={Mitchell P. Marcus and Beatrice Santorini and Mary Ann Marcinkiewicz},
  journal={Comput. Linguistics},
  year={1993},
  volume={19},
  pages={313-330}
}
Abstract : As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure. This material now includes a fully hand-parsed version of the classic Brown corpus. About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant. 
The Penn Treebank: An Overview
TLDR
The design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation is described and the methodology employed in production is described.
Machine Translation of the Penn Treebank to Spanish
In this work we explored the problem of translating the Penn Treebank corpus to Spanish. For this problem, we considered Phrase-based Machine Translation techniques. Given that there not exist
Parsing Early Modern English for Linguistic Search
TLDR
This work trains a part-of-speech tagger and parser on a corpus of historical English, using ELMo embeddings trained over a billion words of similar text, to investigate the question of whether advances in NLP make it possible to vastly increase the size of data usable for research in historical syntax.
The Penn Discourse Treebank
TLDR
A preliminary analysis of inter-annotator agreement is presented – both the level of agreement and the types of inter -annotator variation.
Building a Treebank for French
TLDR
A treebank project for French has annotated a newspaper corpus of 1 Million words with part of speech, inflection, compounds, lemmas and constituency and presents some uses of the corpus.
Facilitating Treebank Annotation Using a Statistical Parser
Corpora of phrase-structure-annotated text, or treebanks, are useful for supervised training of statistical models for natural language processing, as well as for corpus linguistics. Their primary
An all-words sense annotated Turkish corpus
  • Sinan Akcakaya, O. T. Yildiz
  • Computer Science
    2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP)
  • 2018
This paper reports our efforts in constructing of a sense labeled Turkish corpus with respect to Turkish Language Institution's dictionary, using the traditional method of manual tagging. We tagged a
Parallel treebank from word-aligned bilingual corpus. Language engineering for phrasal alignments
  • M. Colhon
  • Computer Science
    15th International Conference on System Theory, Control and Computing
  • 2011
TLDR
A mechanism for parallel treebank generation between an intense studied language (i.e. English) and a less studied language, like Romanian, which is induced from the corresponding constituents of the English part taking into account the words alignments of the corpus.
An Annotated Corpus and a Grammar Model of Theorem Description
TLDR
A syntactically annotated corpus of theorem descriptions is built, using a book of set theory, and a grammar model of theorems is extracted from the obtained corpus, as the first step to understanding mathematical documents by computer.
Bank of English and Beyond
TLDR
A new practical parsing system, the Functional Dependency Grammar parser, is presented, developed from the Constraint Grammar system, and its suitability for treebank annotation is discussed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 77 REFERENCES
Deducing Linguistic Structure from the Statistics of Large Corpora
Two experiments that strongly suggest that largely distributional techniques might be developed to automatically provide both a set of part of speech tags for English and a skeletal parsing of free
Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
TLDR
This manual addresses the linguistic issues that arise in connection with annotating texts by part of speech ("tagging") and discusses parts of speech that are easily confused and gives guidelines on how to tag such cases.
Deducing linguistic structure from the statistics of large corpora
TLDR
The success of approaches using both stochastic and symbolic techniques suggests that much of the grammatical structure of language may be derived automatically through distributional analysis, an approach attempted and abandoned in the 1950s.
A stochastic parts program and noun phrase parser for unrestricted text
  • Kenneth Ward Church
  • Computer Science
    International Conference on Acoustics, Speech, and Signal Processing,
  • 1989
TLDR
A program that tags each word in an input sentence with the most likely part of speech has been written and performance is encouraging; a 400-word sample is presented and is judged to be 99.5% correct.
Partial Parsing: A Report on Work in Progress
This paper reports a handful of experiments designed to test the feasibility of applying well-known partial parsing techniques to the problem of automatic data base update from an open-ended source
Inside-Outside Reestimation From Partially Bracketed Corpora
TLDR
The inside-outside algorithm for inferring the parameters of a stochastic context-free grammar is extended to take advantage of constituent information in a partially parsed corpus to achieve faster convergence and better modelling of hierarchical structure than the original one.
Studies in Part of Speech Labelling
TLDR
This paper reports experiments in three important areas: handling unknown words, limiting the size of the training set, and returning a set of the most likely tags for each word rather than a single tag.
Parsing a Natural Language Using Mutual Information Statistics
TLDR
The generalized mutual information statistic is derived, the parsing algorithm is described, and results and sample output from the parser are presented.
Acquiring Disambiguation Rules from Text
TLDR
An effective procedure for automatically acquiring a new set of disambiguation rules for an existing deterministic parser on the basis of tagged text is presented and suggests a path toward more robust and comprehensive syntactic analyzers.
Probabilistic Parse Scoring Based on Prosodic Phrasing
TLDR
A decision tree is designed to predict prosodic phrase structure for a given syntactic parse, and the tree is used to compute a parse score, which now is the probability of the recognized break sequence.
...
1
2
3
4
5
...