• Corpus ID: 27432725

Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik

  title={Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik},
  author={Patrick Littell and David R. Mortensen and Kartik Goyal and Chris Dyer and Lori S. Levin},
  booktitle={International Conference on Language Resources and Evaluation},
In Sorani Kurdish, one of the most useful orthographic features in named-entity recognition – capitalization – is absent, as the language’s Perso-Arabic script does not make a distinction between uppercase and lowercase letters. We describe a system for deriving an inferred capitalization value from closely related languages by phonological similarity, and illustrate the system using several related Western Iranian languages. 

Figures and Tables from this paper

Named Entity Recognition for Linguistic Rapid Response in Low-Resource Languages: Sorani Kurdish and Tajik

Construction of named-entity recognition systems in two Western Iranian languages, Sorani Kurdish and Tajik, as a part of a pilot study of “Linguistic Rapid Response” to potential emergency humanitarian relief situations finds the following to be effective: exploiting distributional regularities in monolingual data.

Building a Corpus for the Zaza–Gorani Language Family

This paper presents the endeavour to collect a corpus in Zazaki and Gorani languages containing over 1.6M and 194k word tokens, respectively, and reveals that this corpus is publicly available.

PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors

It is shown that phonological features outperform character-based models in PanPhon, a database relating over 5,000 IPA segments to 21 subsegmental articulatory features that boosts performance in various NER-related tasks.

Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings

An attentional neural model which only uses language universal phonological character representations with word embeddings to achieve state of the art performance in a monolingual setting using supervision and which can quickly adapt to a new language with minimal or no data is introduced.

Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning

We introduce polyglot language models, recurrent neural network models trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on



Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

This paper aims at building a text corpus for Sorani and Kurmanji, the two main dialects of Kurdish, and highlighting some of the orthographic, phonological, and morphological differences between these two dialects from statistical and rule-based perspectives.

Developing a Large-Scale Lexicon for a Less-Resourced Language: General Methodology and Preliminary Experiments on Sorani Kurdish

A general methodology for developing a large-scale lexicon for a less-resourced language, i.e., a language for which raw internet-based corpora and general-purpose grammars are virtually the only existing resources is described.

Building a Test Collection for Sorani Kurdish

The experimental results show that normalization and, to a lesser extent, stemming can greatly improve the performance of Sorani IR systems.

A Python Toolkit for Universal Transliteration

ScriptTranscriber is described, an open source toolkit for extracting transliterations in comparable corpora from languages written in different scripts that allows for ready incorporation of more sophisticated modules ― e.g. a trained transliteration model for a particular language pair.

cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models

We present cdec, an open source framework for decoding, aligning with, and training a number of statistical machine translation models, including word-based models, phrase-based models, and models

SRILM - an extensible language modeling toolkit

The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.

Fast string correction with Levenshtein automata

  • K. SchulzS. Mihov
  • Computer Science
    International Journal on Document Analysis and Recognition
  • 2002
This work shows how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to the length of W, which leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries.

Sorani Kurdish reference grammar with selected readings

  • Ms.
  • 2006

Building a Kurdish language corpus: An overview of the technical problems

  • Proceedings of ICEMCO.
  • 1998

Ethnologue: Languages of the world, Eighteenth edition

  • 2015