• Corpus ID: 226307957

The_Illiterati: Part-of-Speech Tagging for Magahi and Bhojpuri without even knowing the alphabet

  title={The\_Illiterati: Part-of-Speech Tagging for Magahi and Bhojpuri without even knowing the alphabet},
  author={Thomas Proisl and Peter Uhrig and Andreas Blombach and Natalie Dykes and Philipp Heinrich and Besim Kabashi and Sefora Mammarella},
In this paper, we describe the part-of-speechtagging experiments for Magahi and Bhojpuri that we conducted for our participation in the NSURL 2019 shared tasks 9 and 10 (Lowlevel NLP Tools for (Magahi|Bhojpuri) Language). We experiment with three different part-of-speech taggers and evaluate the impact of additional resources such as Brown clusters, word embeddings and transfer learning from additional tagged corpora in related languages. In a 10-fold cross-validation on the training data, our… 

Figures and Tables from this paper

KLUMSy @ KIPoS: Experiments on Part-of-Speech Tagging of Spoken Italian
This paper describes experiments on part-of-speech tagging of spoken Italian that were conducted in the context of the EVALITA 2020 KIPoS shared task and documents the approach and results in the shared task along with a statistical analysis of the factors which impact performance the most.
On State-of-the-art of POS Tagger, ‘Sandhi’ Splitter, ‘Alankaar’ Finder and ‘Samaas’ Finder for Indo-Aryan and Dravidian Languages
Analysis shows that Rule Based Approach (RBA) and Hidden Markov Model (HMM) are frequently used for POS tagging, RBA is most frequently usedfor “Sandhi” Splitter, the general Human Intelligence (HI) is used for “Alankaar” Finder and no “Samaas” finder technique is available for any Indian language.
A Deep Learning Based Method for Structuring the Chinese Pathological Reports of Lung Specimen
Experimental results on the self-constructed datasets indicated that the proposed structured processing method can be beneficial for structuring pathology reports of lung specimen and obtained state-of-the-art results.


SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts
SoMeWeTa is described, a part-of-speech tagger based on the averaged structured perceptron that is capable of domain adaptation and that can use various external resources that substantially improves on the state of the art for both the web and the social media data sets.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network
A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.
Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger
This paper presents results for a maximum-entropy-based part of speech tagger, which achieves superior performance principally by enriching the information sources used for tagging by incorporating these features: more extensive treatment of capitalization for unknown words, and features for the disambiguation of the tense forms of verbs.
A Named Entity Recognition Shootout for German
This work asks how to practically build a model for German named entity recognition (NER) that performs at the state of the art for both contemporary and historical texts, i.e., a big-data and a small-data scenario and pits the two best-performing model families against each other to observe the trade-off between expressiveness and data requirements.
The Hindi/Urdu Treebank Project
The goal of Hindi/Urdu treebanking project is to build multi-layered treebanks that will provide both syntactic and semantic annotations that cover two standardized registers that are often considered separate languages: Hindi and Urdu.
Semi-Supervised Learning for Natural Language
This thesis focuses on two segmentation tasks, named-entity recognition and Chinese word segmentation, and shows that features derived from unlabeled data substantially improves performance, both in terms of reducing the amount of labeled data needed to achieve a certain performance level and in termsof reducing the error using a fixed amount of labeling data.
Class-Based n-gram Models of Natural Language
This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics.
The Indo-Aryan Languages
1. Introduction 2. The modern Indo-Aryan languages and dialects 3. The historical context and development of Indo-Aryan 4. The nature of the New Indo-Aryan lexicon 5. NIA descriptive phonology 6.
Hindi Syntax: Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure
A treebanking project for Hindi/Urdu is annotating dependency syntax, lexical predicate-argument structure, and phrase structure syntax in a coordinated and partly automated manner.
Training & evaluation of POS taggers in Indo - Aryan languages : a case of Hindi , Odia and Bhojpuri
  • Semi - supervised learning for natural language . Master ’ s thesis , Massachusetts Institute of Technology , Department of Electrical Engineering and Computer Science
  • 2015