Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging
@article{Zalmout2020JointDL, title={Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging}, author={Nasser Zalmout and Nizar Habash}, journal={ArXiv}, year={2020}, volume={abs/1910.02267} }
The written forms of Semitic languages are both highly ambiguous and morphologically rich: a word can have multiple interpretations and is one of many inflected forms of the same concept or lemma. This is further exacerbated for dialectal content, which is more prone to noise and lacks a standard orthography. The morphological features can be lexicalized, like lemmas and diacritized forms, or non-lexicalized, like gender, number, and part-of-speech tags, among others. Joint modeling of the…
10 Citations
A Multitask Learning Approach for Diacritic Restoration
- Computer ScienceACL
- 2020
This work investigates the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization.
Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects
- Computer ScienceFINDINGS
- 2022
The results show that strategic fine-tuning using datasets from other high-resource dialects is beneficial for a low- resource dialect and that high-quality morphological analyzers as external linguistic resources are beneficial especially in low-resource settings.
Improving Arabic Diacritization by Learning to Diacritize and Translate
- Computer ScienceIWSLT
- 2022
A novel multitask learning method which trains a model to both diacritize and translate, which has applications in text-to-speech, speech- to-speech translation, and other NLP tasks is proposed.
Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization
- Computer ScienceWANLP
- 2020
We propose a novel architecture for labelling character sequences that achieves state-of-the-art results on the Tashkeela Arabic diacritization benchmark. The core is a two-level recurrence hierarchy…
Morphological Analysis and Disambiguation for Gulf Arabic: The Interplay between Resources and Methods
- Computer ScienceLREC
- 2020
This paper uses an existing state-of-the-art morphological disambiguation system to investigate the effects of different data sizes and different combinations of morphological analyzers for Modern Standard Arabic, Egyptian Arabic, and Gulf Arabic and finds that in very low settings, morphological Analyzers help boost the performance of the full morphology disambigsuation task.
Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell
- Computer ScienceACL
- 2020
This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.
Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered
- Computer ScienceNODALIDA
- 2021
A method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered, is presented.
Multi-Task Sequence Prediction For Tunisian Arabizi Multi-Level Annotation
- Computer ScienceWANLP
- 2020
A multi-task sequence prediction system, based on recurrent neural networks and used to annotate on multiple levels an Arabizi Tunisian corpus, developed for the Fairseq framework, which allows for a fast and easy use for any other sequence prediction problem.
Improving Arabic Diacritization with Regularized Decoding and Adversarial Training
- Computer ScienceACL
- 2021
Experimental results on two benchmark datasets show that the proposed regularized decoding and adversarial training model can still learn adequate diacritics and outperform all previous studies, on both datasets.
References
SHOWING 1-10 OF 58 REFERENCES
The Role of Context in Neural Morphological Disambiguation
- Computer Science, LinguisticsCOLING
- 2016
This paper addresses the problem of using context in morphological disambiguation by presenting several LSTM-based neural architectures that encode long-range surface-level and analysis-level contextual dependencies, and applies this approach to Turkish, Russian, and Arabic to compare effectiveness.
A Simple Joint Model for Improved Contextual Neural Lemmatization
- LinguisticsNAACL
- 2019
A simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages from the Universal Dependencies corpora is presented.
Noise-Robust Morphological Disambiguation for Dialectal Arabic
- Computer ScienceNAACL
- 2018
This work presents a neural morphological tagging and disambiguation model for Egyptian Arabic, with various extensions to handle noisy and inconsistent content.
Scaling character-based morphological tagging to fourteen languages
- Computer Science2016 IEEE International Conference on Big Data (Big Data)
- 2016
This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets and shows consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where the state of the art is matched.
Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages
- Computer ScienceACL
- 1998
The results of the combination of stochastic and rule-based disambiguation methods applied to Basque languagel think that this combined method can achieve good results, and it would be appropriate for other agglutinative languages.
Joint Lemmatization and Morphological Tagging with Lemming
- Computer ScienceEMNLP
- 2015
LEMMING sets the new state of the art in token-based statistical lemmatization on six languages and reduces the error by 60%, and gives empirical evidence that jointly modeling morphological tags and lemmata is mutually beneficial.
Joint Prediction of Morphosyntactic Categories for Fine-Grained Arabic Part-of-Speech Tagging Exploiting Tag Dictionary Information
- Computer ScienceCoNLL
- 2017
This paper proposes an approach that utilizes this information by jointly modeling multiple morphosyntactic tagging tasks with a multi-task learning framework and proposes a method of incorporating tag dictionary information into the authors' neural models by combining word representations with representations of the sets of possible tags.
Morphological Analysis and Disambiguation for Dialectal Arabic
- Computer ScienceNAACL
- 2013
This paper retargets an existing state-of-the-art MSA morphological tagger to Egyptian Arabic (ARZ), and demonstrates that the ARZ morphology tagger outperforms its MSA variant on ARZ input in terms of accuracy in part- of-speech tagging, diacritization, lemmatization and tokenization; and interms of utility for ARZ-toEnglish statistical machine translation.
Highly Effective Arabic Diacritization using Sequence to Sequence Modeling
- Computer ScienceNAACL
- 2019
This work presents a unified character level sequence-to-sequence deep learning model that recovers both types of diacritics without the use of explicit feature engineering and outperforms all previous state-of-the-art systems.
Context Sensitive Neural Lemmatization with Lematus
- Computer ScienceNAACL
- 2018
Lematus, a lemmatizer based on a standard encoder-decoder architecture, which incorporates character-level sentence context, is introduced, and it is shown that including context significantly improves results against a context-free version of the model.