• Corpus ID: 21704644

Albanian Part-of-Speech Tagging: Gold Standard and Evaluation

  title={Albanian Part-of-Speech Tagging: Gold Standard and Evaluation},
  author={Besim Kabashi and Thomas Proisl},
In this paper, we present a gold standard corpus for Albanian part-of-speech tagging and perform evaluation experiments with different statistical taggers. [] Key Method We provide mappings from the full tagset to both the original Google Universal Part-of-Speech Tags and the variant used in the Universal Dependencies project. We perform experiments with different taggers on the full tagset as well as on the coarser tagsets and achieve accuracies of up to 95.10%.

Figures and Tables from this paper

Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models
This paper has created an Albanian part-of-speech corpus based on the Universal Dependencies schema for morphological annotation, containing about 118,000 tokens of naturally occuring text collected from different text sources, with an addition of 67,000 token of artificially created simple sentences used only in training.
KLUMSy @ KIPoS: Experiments on Part-of-Speech Tagging of Spoken Italian
This paper describes experiments on part-of-speech tagging of spoken Italian that were conducted in the context of the EVALITA 2020 KIPoS shared task and documents the approach and results in the shared task along with a statistical analysis of the factors which impact performance the most.
Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning
According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.
Universal Dependencies for Albanian
In this paper, we introduce the first Universal Dependencies (UD) treebank for standard Albanian, consisting of 60 sentences collected from the Albanian Wikipedia, annotated with lemmas, universal
A lexicon of Albanian for natural language processing
A lexicon that can be used for the Albanian language and aims to cover basic information for most frequent tasks of natural language processing is presented below.
Collecting Collocations for the Albanian Language
The collecting of data from different sources to build a collocation data set with the aim of compiling the first contemporary collocation dictionary for the Albanian language is described, based on the analysis of empirical data, i.
Albanian fake news detection
This paper presents a new public data set of labeled true and fake news in Albanian, and performs an extensive analysis of machine learning methods for fake news detection, exploring the Albanian language related feature categories such as the lexical, syntactic, lying-detection, and psycho-linguistic features.


A Proposal for a Part-of-Speech Tagset for the Albanian Language
The Albanian language has some properties that pose difficulties for the creation of a part-of-speech tagset that can adequately represent the underlying linguistic phenomena, and this paper presents a proposal for that tagset.
A Universal Part-of-Speech Tagset
This work proposes a tagset that consists of twelve universal part-of-speech categories and develops a mapping from 25 different treebank tagsets to this universal set, which when combined with the original treebank data produces a dataset consisting of common parts- of-speech for 22 different languages.
TnT - A Statistical Part-of-Speech Tagger
Contrary to claims found elsewhere in the literature, it is argued that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework.
SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts
SoMeWeTa is described, a part-of-speech tagger based on the averaged structured perceptron that is capable of domain adaptation and that can use various external resources that substantially improves on the state of the art for both the web and the social media data sets.
Improvements in Part-of-Speech Tagging with an Application to German
This paper presents a meta-modelling system that automates the very labor-intensive and therefore time-heavy and expensive process of manually tagging part-of-speech content in a variety of languages.
NLTK tagger for Albanian using iterative approach
  • A. Kadriu
  • Environmental Science
    Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces
  • 2013
A model of tagging for Albanian texts, using the NLTK toolkit, using cascading of three taggers with backoff, using a dictionary of around 32000 words and a set of regular expressions rules too.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network
A new part-of-speech tagger is presented that demonstrates the following ideas: explicit use of both preceding and following tag contexts via a dependency network representation, broad use of lexical features, and effective use of priors in conditional loglinear models.
A morphological Analyzer for Standard Albanian
A morphological analyzer for standard Albanian intended as a component of an annotation tool in the context of the Albanian Corpus Initiative and a complete tagset for Albanian and full form lexica for pronouns and irregular open-class elements are presented.
HunPos: an open source trigram tagger
HunPos is presented, a free and open source (LGPL-licensed) alternative, which can be tuned by the user to fully utilize the potential of HMM architectures, offering performance comparable to more complex models, but preserving the ease and speed of the training and tagging process.
Probabilistic part-of-speech tagging using decision trees
In this paper, a new probabilistic tagging method is presented which avoids problems that Markov Model based taggers face, when they have to estimate transition probabilities from sparse data. In