• Corpus ID: 16364992

Bootstrapping the Development of an HPSG-based Treebank for Persian

@article{Ghayoomi2012BootstrappingTD,
  title={Bootstrapping the Development of an HPSG-based Treebank for Persian},
  author={Masood Ghayoomi},
  journal={Linguistic Issues in Language Technology},
  year={2012},
  volume={7}
}
  • Masood Ghayoomi
  • Published 6 January 2012
  • Computer Science
  • Linguistic Issues in Language Technology
In this paper, we describe an ongoing research to develop an HPSG- based treebank for Persian. To this aim, we use a bootstrapping ap- proach for the data annotation. In the first step, a set of seed rules are defined as regular expressions in the CLaRK system. Then, the data is shallow processed with this set of rules. In the next step, a human annotator completes the annotation of sentences manually. To increase automatic annotation, we extract the manual applied rules and iteratively augment… 

Figures and Tables from this paper

Toward a Multi-Representation Persian Treebank
TLDR
The treebank is built using a bootstrapping approach, which converts a dependency structure tree to a phrase structure tree and the annotations are corrected manually, and has two syntactic representations: phrase structure and dependency structure.
Development of a Persian Syntactic Dependency Treebank
TLDR
The annotation process and linguistic properties of the Persian syntactic dependency treebank, which consists of approximately 30,000 sentences annotated with syntactic roles in addition to morpho-syntactic features, are described.
A Persian Treebank with Stanford Typed Dependencies
TLDR
The Uppsala Persian Dependency Treebank (UPDT) is presented with a syntactic annotation scheme based on Stanford Typed Dependencies and open source tools for automatic analysis of Persian containing a text normalizer, a sentence segmenter and tokenizers, a part-of-speech tagger, and a parser are presented.
Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank
TLDR
With this converter, this paper can automatically create a new language resource from an existing treebank developed based on a grammar formalism, and is able to create both projective and non-projective dependency trees.
A New DOP Model for Phrase-structure Parsing of Persian Sentences
TLDR
The accuracy of Double-DOP is well within the range of state-of-the-art parsers currently used in other NLP-tasks, while offering the additional benefits of a simple generative probability model and an explicit representation of grammatical constructions.
Constituency Parsing of Bulgarian: Word- vs Class-based Parsing
TLDR
This paper proposes using the Brown word clustering to do an off-line clustering and map the words in the treebank to create a class-based treebank, and shows that when the classes outnumber the POS tags, the results are better.
A Basic Language Resource Kit for Persian
TLDR
This work describes the Uppsala PErsian Corpus (UPEC), a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation, and develops open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian.
Converting Dependency Structure Into Persian Phrase Structure
TLDR
This article proposes a method to convert a dependency structure into a phrase structure by enriching a trainable model of former hybrid strategy approach by adding a classifier to the algorithm and using postprocessing modification, and shows a reduction of error rate and quality of conversion.
Word Clustering for Persian Statistical Parsing
TLDR
A word-clustering approach using the Brown algorithm and an extension to the clustering approach in which the POS tags of the words are also taken into the consideration while clustering the words, it is proved that adding this information improves the performance of clustering specially for homographs.
A new hybrid stemming method for persian language
TLDR
A new hybrid stemming method based on a combination of affix stripping and statistical techniques for Persian language is proposed, which combines cues from the orthography, word frequency, and syntactic distributions to induce the stemming rules.
...
1
2
...

References

SHOWING 1-10 OF 27 REFERENCES
‘An HPSG-based Syntactic Treebank of Bulgarian (BulTreeBank)’
The aim of this volume is to showcase the range of corpus-based linguistic research currently being carried out on languages other than English. The papers included report on work carried out on
Automatic annotation of the Penn-treebank with LFG f-structureinformation
TLDR
A new method that scales and has been applied to a complete treebank, in this case the WSJ section of Penn-II (Marcus et al, 1994), with more than 1,000,000 words in about 50,000 sentences is presented.
CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank
TLDR
This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word-word dependencies, and discusses the implications of the findings for the extraction of other linguistically expressive grammars from the Treebank, and for the design of future treebanks.
PerGram: A TRALE implementation of an HPSG fragment of Persian
  • Stefan Müller, Masood Ghayoomi
  • Linguistics, Computer Science
    Proceedings of the International Multiconference on Computer Science and Information Technology
  • 2010
TLDR
An HPSG grammar of Persian (PerGram) that is implemented in the TRALE system is discussed and a test suite with positive and negative examples from the linguistic literature is developed to test the coverage of the grammar with respect to naturally occurring sentences.
Lessons from building a Persian written corpus: Peykare
TLDR
This paper addresses some of the issues learned during the course of building a written language resource, called ‘Peykare’, for the contemporary Persian with a special attention to the Ezafe construction and homographs which are important in Persian text analyses.
LinGO Redwoods
TLDR
The Linguistic Grammars On-Line (LinGo) Redwoods initiative is presented, a seed activity in the design and development of a new type of treebank, rich in nature and dynamic in both the ways linguistic data can be retrieved from the treebank in varying granularity and the constant evolution and regular updating of the tree bank itself.
Corpus-based Analysis for Multi-token Units in Persian
TLDR
Defining the multi-token unit templates for these categories is one of the important results of this research and can be input to the segmentation module of the Persian Treebank generator system.
Unsupervised Parse Selection for HPSG
TLDR
This work shows that, by taking advantage of the constrained nature of these HPSG grammars, they can learn a discriminative parse selection model from raw text in a purely unsupervised fashion, which allows us to bootstrap the treebanking process and provide better parsers faster, and with less resources.
Persian complex predicates and the limits of inheritance-based analyses
TLDR
It is shown that theories that rely exclusively on the classification of patterns in inheritance hierarchies cannot account for the facts in an insightful way unless they are augmented by transformations or some similar device and that a lexical account together with appropriate grammar rules and an argument composition analysis of the future auxiliary has none of the shortcomings that classification-based analyses have.
CLaRK - an XML-based System for Corpora Development 1
TLDR
The architecture and the intended applications of the CLaRK system are described, which is an XML Editor which is the main interface to the system and envisage several uses, including Dictionary compilation for human users and document management, storing and querying.
...
1
2
3
...