• Corpus ID: 2554659

A Proposal for a Part-of-Speech Tagset for the Albanian Language

@inproceedings{Kabashi2016APF,
  title={A Proposal for a Part-of-Speech Tagset for the Albanian Language},
  author={Besim Kabashi and Thomas Proisl},
  booktitle={LREC},
  year={2016}
}
Part-of-speech tagging is a basic step in Natural Language Processing that is often essential. Labeling the word forms of a text with fine-grained word-class information adds new value to it and can be a prerequisite for downstream processes like a dependency parser. Corpus linguists and lexicographers also benefit greatly from the improved search options that are available with tagged data. The Albanian language has some properties that pose difficulties for the creation of a part-of-speech… 
Albanian Part-of-Speech Tagging: Gold Standard and Evaluation
TLDR
This paper provides mappings from the full tagset to both the original Google Universal Part-of-Speech Tags and the variant used in the Universal Dependencies project and achieves accuracies of up to 95.10%.
A lexicon of Albanian for natural language processing
TLDR
A lexicon that can be used for the Albanian language and aims to cover basic information for most frequent tasks of natural language processing is presented below.
Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning
TLDR
According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.
Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models
TLDR
This paper has created an Albanian part-of-speech corpus based on the Universal Dependencies schema for morphological annotation, containing about 118,000 tokens of naturally occuring text collected from different text sources, with an addition of 67,000 token of artificially created simple sentences used only in training.
Universal Dependencies for Albanian
In this paper, we introduce the first Universal Dependencies (UD) treebank for standard Albanian, consisting of 60 sentences collected from the Albanian Wikipedia, annotated with lemmas, universal

References

SHOWING 1-10 OF 18 REFERENCES
Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
TLDR
This manual addresses the linguistic issues that arise in connection with annotating texts by part of speech ("tagging") and discusses parts of speech that are easily confused and gives guidelines on how to tag such cases.
NLTK tagger for Albanian using iterative approach
  • A. Kadriu
  • Environmental Science
    Proceedings of the ITI 2013 35th International Conference on Information Technology Interfaces
  • 2013
TLDR
A model of tagging for Albanian texts, using the NLTK toolkit, using cascading of three taggers with backoff, using a dictionary of around 32000 words and a set of regular expressions rules too.
A morphological Analyzer for Standard Albanian
TLDR
A morphological analyzer for standard Albanian intended as a component of an annotation tool in the context of the Albanian Corpus Initiative and a complete tagset for Albanian and full form lexica for pronouns and irregular open-class elements are presented.
Electronic Dictionaries and Transducers for Automatic Processing of the Albanian Language
TLDR
The problem of unknown words in a lately reformed language and the evolving of features in the dictionaries is taken into consideration and FST is used for their dynamic treatment.
Morphological study of Albanian words, and processing with NooJ
TLDR
The authors are developing electronic dictionaries and transducers for the automatic processing of the Albanian Language and found that morphemes are frequently concatenated or simply juxtaposed or contracted.
Syntactic Wordclass Tagging
TLDR
This paper presents a meta-modelling framework for tagging that automates the very labor-intensive and therefore time-heavy and expensive process of manually selecting and operation of tagsets.
Standards for Tagsets.
TLDR
In the interests of interchangeability and re-usability of annotated corpora, it is important to avoid a ‘free-for-all’, or a ’reinvention of the wheel’ every time a new project begins.
Part-of-speech tagging guidelines for the penn treebank project
An enclosed-type magnetic disc recording and/or reproducing apparatus according to this invention, having a pressure chamber to enclose a bearing unit of a flange rotating along with a magnetic disc,
A Part of Speech Tagging Model for Albanian
  • Lambert Academic Publishing, Saarbrücken.
  • 2012
Albanische Grammatik
...
1
2
...