Corpus ID: 9561000

The Penn Chinese TreeBank : Phrase structure annotation of a large corpus

@inproceedings{RTAPAL2005ThePC,
  title={The Penn Chinese TreeBank : Phrase structure annotation of a large corpus},
  author={M A R T A P A L},
  year={2005}
}
With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a… Expand

Figures and Tables from this paper

Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study
TLDR
Experiments show that adaptation from the much larger People's Daily corpus to the smaller but more popular Penn Chinese Treebank results in significant improvements in both segmentation and tagging accuracies, which in turn helps improve Chinese parsing accuracy. Expand
TBL-Improved Non-Deterministic Segmentation and POS Tagging for a Chinese Parser
TLDR
This experiment presents an experiment in improving the output of an off-the-shelf module that performs segmentation and tagging, the tokenizer-tagger from Beijing University (PKU), based on transformation-based learning (TBL). Expand
Extending and Scaling up the Chinese Treebank Annotation
TLDR
To address bottleneck, a procedure is implemented that decomposes the treebanking process into five self-contained steps and is able to increase the throughput by 30%, and it is shown that the disfluencies can be characterized into a finite set of categories. Expand
Towards Robust Linguistic Analysis using OntoNotes
TLDR
An analysis of the performance of publicly available, state-of-the-art tools on all layers and languages in the OntoNotes v5.0 corpus should set the benchmark for future development of various NLP components in syntax and semantics, and possibly encourage research towards an integrated system that makes use of the various layers jointly to improve overall performance. Expand
Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese
TLDR
This paper implements a neural model for joint word segmentation and part-of-speech tagging in Vietnamese by employing a simplified constituency parser that replaces all constituent labels with a single label indicating for phrases to reduce the complexity of parsing. Expand
Chinese Statistical Parsing
This chapter describes several issues that are fundamental to achieving accurate Chinese parsing given available Chinese resources and the challenges of the Gale processing pipeline. For Gale, ourExpand
Using annotated discourse information of a RST Spanish-Chinese treebank for translation and language learning tasks
TLDR
This PhD study aims to partially fill a knowledge gap in the study between Spanish and Chinese by annotating discourse similarities and differences under the theoretical framework of Rhetorical Structure Theory (RST) by Mann and Thompson (1988). Expand
Automatic Semantic Role Labeling for Chinese Verbs
TLDR
The results using hand-crafted parses are slightly higher than the results reported for the state-of-the-art semantic role labeling systems for English using the Penn English Proposition Bank data, even though the Chinese Proposition Bank is smaller in size. Expand
Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-way Attentions of Auto-analyzed Knowledge
TLDR
A neural model named TwASP is proposed for joint CWS and POS tagging following the character-based sequence labeling paradigm, where a two-way attention mechanism is used to incorporate both context feature and their corresponding syntactic knowledge for each input character. Expand
A Comparative Corpus Analysis of PP Ordering in English and Chinese
We present a comparative analysis of PP ordering in English and (Mandarin) Chinese, two languages with distinct typological word order characteristics. Previous work on PP orderings have mainlyExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 58 REFERENCES
Is it Harder to Parse Chinese, or the Chinese Treebank?
TLDR
A factored-model statistical parser for the Penn Chinese Treebank is developed, showing the implications of gross statistical differences between WSJ and Chinese Tree-banks for the most general methods of parser adaptation, and a detailed analysis of the major sources of statistical parse errors. Expand
Automatic annotation of the Penn-treebank with LFG f-structureinformation
TLDR
A new method that scales and has been applied to a complete treebank, in this case the WSJ section of Penn-II (Marcus et al, 1994), with more than 1,000,000 words in about 50,000 sentences is presented. Expand
Building a Large Annotated Corpus of English: The Penn Treebank
TLDR
As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus. Expand
A Stochastic Finite-State Word-Segmentation Algorithm for Chinese
TLDR
This paper presents a stochastic finite-state model wherein the basic workhorse is the weighted finite- state transducer and the model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and provides pronunciations for these words. Expand
Building a Large Chinese Corpus Annotated with Semantic Dependency
TLDR
This paper attempts to build a large corpus and annotate semantic knowledge with dependency grammar, and congruence is defined to measure the consistency of tagged corpus. Expand
Facilitating Treebank Annotation Using a Statistical Parser
Corpora of phrase-structure-annotated text, or treebanks, are useful for supervised training of statistical models for natural language processing, as well as for corpus linguistics. Their primaryExpand
Discriminative Reranking for Natural Language Parsing
TLDR
The boosting approach to ranking problems described in Freund et al. (1998) is applied to parsing the Wall Street Journal treebank, and it is argued that the method is an appealing alternative-in terms of both simplicity and efficiency-to work on feature selection methods within log-linear (maximum-entropy) models. Expand
Maximum entropy models for natural language ambiguity resolution
This thesis demonstrates that several important kinds of natural language ambiguities can be resolved to state-of-the-art accuracies using a single statistical modeling technique based on theExpand
Annotating the Propositions in the Penn Chinese Treebank
TLDR
It is described how diathesis alternation patterns can be used to make coarse sense distinctions for Chinese verbs as a necessary step in annotating the predicate-structure of Chinese verbs. Expand
A Maximum-Entropy-Inspired Parser
TLDR
A new parser for parsing down to Penn tree-bank style parse trees that achieves 90.1% average precision/recall for sentences of length 40 and less and 89.5% when trained and tested on the previously established sections of the Wall Street Journal treebank is presented. Expand
...
1
2
3
4
5
...