Corpus ID: 27432725

Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik

  title={Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik},
  author={Patrick Littell and David R. Mortensen and Kartik Goyal and Chris Dyer and Lori S. Levin},
In Sorani Kurdish, one of the most useful orthographic features in named-entity recognition – capitalization – is absent, as the language’s Perso-Arabic script does not make a distinction between uppercase and lowercase letters. We describe a system for deriving an inferred capitalization value from closely related languages by phonological similarity, and illustrate the system using several related Western Iranian languages. 
Named Entity Recognition for Linguistic Rapid Response in Low-Resource Languages: Sorani Kurdish and Tajik
Construction of named-entity recognition systems in two Western Iranian languages, Sorani Kurdish and Tajik, as a part of a pilot study of “Linguistic Rapid Response” to potential emergency humanitarian relief situations finds the following to be effective: exploiting distributional regularities in monolingual data. Expand
Building a Corpus for the Zaza–Gorani Language Family
This paper presents the endeavour to collect a corpus in Zazaki and Gorani languages containing over 1.6M and 194k word tokens, respectively, and reveals that this corpus is publicly available. Expand
PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors
It is shown that phonological features outperform character-based models in PanPhon, a database relating over 5,000 IPA segments to 21 subsegmental articulatory features that boosts performance in various NER-related tasks. Expand
Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings
An attentional neural model which only uses language universal phonological character representations with word embeddings to achieve state of the art performance in a monolingual setting using supervision and which can quickly adapt to a new language with minimal or no data is introduced. Expand
Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning
We introduce polyglot language models, recurrent neural network models trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning onExpand


Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison
This paper aims at building a text corpus for Sorani and Kurmanji, the two main dialects of Kurdish, and highlighting some of the orthographic, phonological, and morphological differences between these two dialects from statistical and rule-based perspectives. Expand
Developing a Large-Scale Lexicon for a Less-Resourced Language: General Methodology and Preliminary Experiments on Sorani Kurdish
A general methodology for developing a large-scale lexicon for a less-resourced language, i.e., a language for which raw internet-based corpora and general-purpose grammars are virtually the only existing resources is described. Expand
Building a Test Collection for Sorani Kurdish
The experimental results show that normalization and, to a lesser extent, stemming can greatly improve the performance of Sorani IR systems. Expand
A Python Toolkit for Universal Transliteration
ScriptTranscriber is described, an open source toolkit for extracting transliterations in comparable corpora from languages written in different scripts that allows for ready incorporation of more sophisticated modules ― e.g. a trained transliteration model for a particular language pair. Expand
cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models
We present cdec, an open source framework for decoding, aligning with, and training a number of statistical machine translation models, including word-based models, phrase-based models, and modelsExpand
Fast string correction with Levenshtein automata
  • K. Schulz, S. Mihov
  • Computer Science
  • International Journal on Document Analysis and Recognition
  • 2002
This work shows how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to the length of W, which leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. Expand
SRILM - an extensible language modeling toolkit
The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools. Expand
Sorani Kurdish reference grammar with selected readings
  • Ms.
  • 2006
Sorani Kurdish reference grammar with selected readings
  • Ms.
  • 2006
Building a Kurdish language corpus: An overview of the technical problems
  • Proceedings of ICEMCO.
  • 1998