Parsing the Arabic Treebank: Analysis and Improvements

Abstract

Previous work has demonstrated that the performance of current parsers on Arabic is far below their performance on English or even Chinese, which in turn harms performance on NLP tasks that use parsing as an input. This paper is an exploration of some of the issues involved in this difference. We focus on the Collins parsing model [3] as implemented in the Bikel parser [1]. The corpus used for the experiments is the Arabic Treebank [6] (ATB). We cluster these issues in three ways. First, it is important when comparing Arabic parsing performance to other languages that the comparison be a fair one; therefore we first discuss some issues around evaluation and show that current Arabic parsing performance is not quite as bad as previously thought. Second, we present some modifications to the parser which provide modest increases in performance. Finally, we explore deeper differences between the Arabic Treebank and the Penn Treebank and advance some speculations as to why parsers have difficulty with Arabic.

7 Figures and Tables

Statistics

051015'06'07'08'09'10'11'12'13'14'15'16'17
Citations per Year

57 Citations

Semantic Scholar estimates that this publication has 57 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Kulick2006ParsingTA, title={Parsing the Arabic Treebank: Analysis and Improvements}, author={Seth Kulick and Ryan Gabbard and Mitchell P. Marcus}, year={2006} }