Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging

  title={Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging},
  author={Nasser Zalmout and Nizar Habash},
The written forms of Semitic languages are both highly ambiguous and morphologically rich: a word can have multiple interpretations and is one of many inflected forms of the same concept or lemma. This is further exacerbated for dialectal content, which is more prone to noise and lacks a standard orthography. The morphological features can be lexicalized, like lemmas and diacritized forms, or non-lexicalized, like gender, number, and part-of-speech tags, among others. Joint modeling of the… 

Figures and Tables from this paper

A Multitask Learning Approach for Diacritic Restoration
This work investigates the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization.
Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects
The results show that strategic fine-tuning using datasets from other high-resource dialects is beneficial for a low- resource dialect and that high-quality morphological analyzers as external linguistic resources are beneficial especially in low-resource settings.
Improving Arabic Diacritization by Learning to Diacritize and Translate
A novel multitask learning method which trains a model to both diacritize and translate, which has applications in text-to-speech, speech- to-speech translation, and other NLP tasks is proposed.
Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization
We propose a novel architecture for labelling character sequences that achieves state-of-the-art results on the Tashkeela Arabic diacritization benchmark. The core is a two-level recurrence hierarchy
Morphological Analysis and Disambiguation for Gulf Arabic: The Interplay between Resources and Methods
This paper uses an existing state-of-the-art morphological disambiguation system to investigate the effects of different data sizes and different combinations of morphological analyzers for Modern Standard Arabic, Egyptian Arabic, and Gulf Arabic and finds that in very low settings, morphological Analyzers help boost the performance of the full morphology disambigsuation task.
Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell
This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.
Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered
A method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered, is presented.
Multi-Task Sequence Prediction For Tunisian Arabizi Multi-Level Annotation
A multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus, developed for the Fairseq framework, which allows for a fast and easy use for any other sequence prediction problem.
Improving Arabic Diacritization with Regularized Decoding and Adversarial Training
Experimental results on two benchmark datasets show that the proposed regularized decoding and adversarial training model can still learn adequate diacritics and outperform all previous studies, on both datasets.


The Role of Context in Neural Morphological Disambiguation
This paper addresses the problem of using context in morphological disambiguation by presenting several LSTM-based neural architectures that encode long-range surface-level and analysis-level contextual dependencies, and applies this approach to Turkish, Russian, and Arabic to compare effectiveness.
A Simple Joint Model for Improved Contextual Neural Lemmatization
A simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages from the Universal Dependencies corpora is presented.
Noise-Robust Morphological Disambiguation for Dialectal Arabic
This work presents a neural morphological tagging and disambiguation model for Egyptian Arabic, with various extensions to handle noisy and inconsistent content.
Scaling character-based morphological tagging to fourteen languages
This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets and shows consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where the state of the art is matched.
Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages
The results of the combination of stochastic and rule-based disambiguation methods applied to Basque languagel think that this combined method can achieve good results, and it would be appropriate for other agglutinative languages.
Joint Lemmatization and Morphological Tagging with Lemming
LEMMING sets the new state of the art in token-based statistical lemmatization on six languages and reduces the error by 60%, and gives empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.
Joint Prediction of Morphosyntactic Categories for Fine-Grained Arabic Part-of-Speech Tagging Exploiting Tag Dictionary Information
This paper proposes an approach that utilizes this information by jointly modeling multiple morphosyntactic tagging tasks with a multi-task learning framework and proposes a method of incorporating tag dictionary information into the authors' neural models by combining word representations with representations of the sets of possible tags.
Morphological Analysis and Disambiguation for Dialectal Arabic
This paper retargets an existing state-of-the-art MSA morphological tagger to Egyptian Arabic (ARZ), and demonstrates that the ARZ morphology tagger outperforms its MSA variant on ARZ input in terms of accuracy in part- of-speech tagging, diacritization, lemmatization and tokenization; and interms of utility for ARZ-toEnglish statistical machine translation.
Highly Effective Arabic Diacritization using Sequence to Sequence Modeling
This work presents a unified character level sequence-to-sequence deep learning model that recovers both types of diacritics without the use of explicit feature engineering and outperforms all previous state-of-the-art systems.
Context Sensitive Neural Lemmatization with Lematus
Lematus, a lemmatizer based on a standard encoder-decoder architecture, which incorporates character-level sentence context, is introduced, and it is shown that including context significantly improves results against a context-free version of the model.