Bootstrapping the Development of an HPSG-based Treebank for Persian
@article{Ghayoomi2012BootstrappingTD, title={Bootstrapping the Development of an HPSG-based Treebank for Persian}, author={Masood Ghayoomi}, journal={Linguistic Issues in Language Technology}, year={2012}, volume={7} }
In this paper, we describe an ongoing research to develop an HPSG- based treebank for Persian. To this aim, we use a bootstrapping ap- proach for the data annotation. In the first step, a set of seed rules are defined as regular expressions in the CLaRK system. Then, the data is shallow processed with this set of rules. In the next step, a human annotator completes the annotation of sentences manually. To increase automatic annotation, we extract the manual applied rules and iteratively augment…
No Paper Link Available
16 Citations
Toward a Multi-Representation Persian Treebank
- Computer Science2018 9th International Symposium on Telecommunications (IST)
- 2018
The treebank is built using a bootstrapping approach, which converts a dependency structure tree to a phrase structure tree and the annotations are corrected manually, and has two syntactic representations: phrase structure and dependency structure.
Development of a Persian Syntactic Dependency Treebank
- LinguisticsNAACL 2013
- 2013
The annotation process and linguistic properties of the Persian syntactic dependency treebank, which consists of approximately 30,000 sentences annotated with syntactic roles in addition to morpho-syntactic features, are described.
A Persian Treebank with Stanford Typed Dependencies
- Computer ScienceLREC
- 2014
The Uppsala Persian Dependency Treebank (UPDT) is presented with a syntactic annotation scheme based on Stanford Typed Dependencies and open source tools for automatic analysis of Persian containing a text normalizer, a sentence segmenter and tokenizers, a part-of-speech tagger, and a parser are presented.
Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank
- Computer ScienceLREC
- 2014
With this converter, this paper can automatically create a new language resource from an existing treebank developed based on a grammar formalism, and is able to create both projective and non-projective dependency trees.
A New DOP Model for Phrase-structure Parsing of Persian Sentences
- Computer ScienceALR@COLING
- 2012
The accuracy of Double-DOP is well within the range of state-of-the-art parsers currently used in other NLP-tasks, while offering the additional benefits of a simple generative probability model and an explicit representation of grammatical constructions.
Constituency Parsing of Bulgarian: Word- vs Class-based Parsing
- Computer ScienceLREC
- 2014
This paper proposes using the Brown word clustering to do an off-line clustering and map the words in the treebank to create a class-based treebank, and shows that when the classes outnumber the POS tags, the results are better.
A Basic Language Resource Kit for Persian
- Computer ScienceLREC
- 2012
This work describes the Uppsala PErsian Corpus (UPEC), a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation, and develops open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian.
Converting Dependency Structure Into Persian Phrase Structure
- Computer ScienceACM Trans. Asian Low Resour. Lang. Inf. Process.
- 2019
This article proposes a method to convert a dependency structure into a phrase structure by enriching a trainable model of former hybrid strategy approach by adding a classifier to the algorithm and using postprocessing modification, and shows a reduction of error rate and quality of conversion.
Word Clustering for Persian Statistical Parsing
- Computer ScienceJapTAL
- 2012
A word-clustering approach using the Brown algorithm and an extension to the clustering approach in which the POS tags of the words are also taken into the consideration while clustering the words, it is proved that adding this information improves the performance of clustering specially for homographs.
A new hybrid stemming method for persian language
- Computer ScienceDigit. Scholarsh. Humanit.
- 2017
A new hybrid stemming method based on a combination of affix stripping and statistical techniques for Persian language is proposed, which combines cues from the orthography, word frequency, and syntactic distributions to induce the stemming rules.
References
SHOWING 1-10 OF 27 REFERENCES
‘An HPSG-based Syntactic Treebank of Bulgarian (BulTreeBank)’
- Linguistics
- 2002
The aim of this volume is to showcase the range of corpus-based linguistic research currently being carried out on languages other than English. The papers included report on work carried out on…
Automatic annotation of the Penn-treebank with LFG f-structureinformation
- Computer Science
- 2002
A new method that scales and has been applied to a complete treebank, in this case the WSJ section of Penn-II (Marcus et al, 1994), with more than 1,000,000 words in about 50,000 sentences is presented.
CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank
- Computer ScienceCL
- 2007
This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word-word dependencies, and discusses the implications of the findings for the extraction of other linguistically expressive grammars from the Treebank, and for the design of future treebanks.
PerGram: A TRALE implementation of an HPSG fragment of Persian
- Linguistics, Computer ScienceProceedings of the International Multiconference on Computer Science and Information Technology
- 2010
An HPSG grammar of Persian (PerGram) that is implemented in the TRALE system is discussed and a test suite with positive and negative examples from the linguistic literature is developed to test the coverage of the grammar with respect to naturally occurring sentences.
Lessons from building a Persian written corpus: Peykare
- LinguisticsLang. Resour. Evaluation
- 2011
This paper addresses some of the issues learned during the course of building a written language resource, called ‘Peykare’, for the contemporary Persian with a special attention to the Ezafe construction and homographs which are important in Persian text analyses.
LinGO Redwoods
- Computer Science
- 2004
The Linguistic Grammars On-Line (LinGo) Redwoods initiative is presented, a seed activity in the design and development of a new type of treebank, rich in nature and dynamic in both the ways linguistic data can be retrieved from the treebank in varying granularity and the constant evolution and regular updating of the tree bank itself.
Corpus-based Analysis for Multi-token Units in Persian
- LinguisticsMTSUMMIT
- 2009
Defining the multi-token unit templates for these categories is one of the important results of this research and can be input to the segmentation module of the Persian Treebank generator system.
Unsupervised Parse Selection for HPSG
- Computer ScienceEMNLP
- 2010
This work shows that, by taking advantage of the constrained nature of these HPSG grammars, they can learn a discriminative parse selection model from raw text in a purely unsupervised fashion, which allows us to bootstrap the treebanking process and provide better parsers faster, and with less resources.
Persian complex predicates and the limits of inheritance-based analyses
- Linguistics
- 2010
It is shown that theories that rely exclusively on the classification of patterns in inheritance hierarchies cannot account for the facts in an insightful way unless they are augmented by transformations or some similar device and that a lexical account together with appropriate grammar rules and an argument composition analysis of the future auxiliary has none of the shortcomings that classification-based analyses have.
CLaRK - an XML-based System for Corpora Development 1
- Computer Science
- 2001
The architecture and the intended applications of the CLaRK system are described, which is an XML Editor which is the main interface to the system and envisage several uses, including Dictionary compilation for human users and document management, storing and querying.