• Corpus ID: 43981019

From Arabic user-generated content to machinetranslation: integrating automatic errorcorrection

  title={From Arabic user-generated content to machinetranslation: integrating automatic errorcorrection},
  author={Haithem Afli and Walid Aransa and Pintu Lohar and Andy Way},
  booktitle={CICLing 2016},
With the wide spread of the social media and online forums, individual users have been able to actively participate in the generation of online content in different languages and dialects. Arabic is one of the fastest growing languages used on Internet, but dialects (like Egyptian and Saudi Arabian) have a big share of the Arabic online content. There are many differences between Dialectal Arabic and Modern Standard Arabic which cause many challenges for Machine Translation of informal… 

Figures and Tables from this paper

An Empirical Analysis of Moroccan Dialectal User-Generated Text

This paper investigates online written text generated by Moroccan users in social media with an emphasis on Moroccan Dialectal Arabic, and finds the use of code switching, multi-script and low amount of words in Moroccan UGT text.

Machine Translation and Vernacular: Interpreting the Informal

The history of MT evolved from Rule-Based MT (RBMT) to Statistical MT (SMT), which contends with challenges stemming from linguistic, semantic, and contextual complexity, as well as the necessity of robust training data in the source and target languages.



CMUQ@QALB-2014: An SMT-based System for Automatic Arabic Error Correction

The CMUQ system combines rule-based linguistic techniques with statistical language modeling techniques and machine translationbased methods and reaches an F-score of 65.42% on the test set of QALB corpus, ranking it 3rd in the competition.

Morphological Analysis and Disambiguation for Dialectal Arabic

This paper retargets an existing state-of-the-art MSA morphological tagger to Egyptian Arabic (ARZ), and demonstrates that the ARZ morphology tagger outperforms its MSA variant on ARZ input in terms of accuracy in part- of-speech tagging, diacritization, lemmatization and tokenization; and interms of utility for ARZ-toEnglish statistical machine translation.

Improved Spelling Error Detection and Correction for Arabic

This work semi-automatically develops a dictionary of 9.3 million fully inflected Arabic words using a morphological transducer and a large corpus and improves the error model and language model.

Dudley North visits North London: Learning When to Transliterate to Arabic

A classification-based framework is constructed to automate the transliteration decision of named entities for English to Arabic machine translation, and a reduction of translation error and an improvement in the performance of an English-to-Arabic machine translation system are demonstrated.

COLABA : Arabic Dialect Annotation and Processing

This paper describes COLABA, a large effort to create resources and processing tools for Dialectal Arabic Blogs and sketches how these resources and tools are put together to create DIRA, a termexpansion tool for information retrieval over dialectal Arabic collections using Modern Standard Arabic queries.

Standard language variety conversion for content localisation via SMT

It is shown that the SMT baseline already constitutes a strong system which in a number of experiments the authors fail to improve upon, and conjecture that bilingual dictionaries mined from client data would help if more heterogeneous training data were to be added.

Using SMT for OCR Error Correction of Historical Texts

This paper performs a qualitative and quantitative comparison of several error-correction techniques for historical French documents and shows that the Machine Translation for Error Correction method is superior to other Language Modelling correction techniques.

Large Scale Arabic Error Annotation: Guidelines and Framework

We present annotation guidelines and a web-based annotation framework developed as part of an effort to create a manually annotated Arabic corpus of errors and corrections for various text types.

Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology

Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect.

MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic

MADAMIRA is a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude.