A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

  title={A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration},
  author={Avi Shmidman and Joshua Guedalia and Shaltiel Shmidman and Moshe Koppel and Reut Tsarfaty},
One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to train effective classifiers. In this paper we address the issue of unbalanced morphological ambiguities in Hebrew. We offer a challenge set for Hebrew… 

What do we really know about State of the Art NER?

A broad evaluation of NER is performed using a popular dataset, that takes into consideration various text genres and sources constituting the dataset at hand, and recommends some useful reporting practices for NER researchers that could help in providing a better understanding of a SOTA model’s performance in future.



What’s Wrong with Hebrew NLP? And How to Make it Right

The design and use of the ONLP suite is described, a joint morpho-syntactic infrastructure for processing Modern Hebrew texts, which provides rich and expressive annotations which already serve diverse academic and commercial needs.

Disambiguation by short contexts

This paper describes a technique that is of great help in many text-processing situations, and reports on an experiment recently conducted to test its validity and scope, namely that of disambiguation by short contexts.

A Challenge Set and Methods for Noun-Verb Ambiguity

A new dataset of over 30,000 naturally-occurring non-trivial examples of noun-verb ambiguity is created, with a 28% reduction in error over the prior best learned model for homograph disambiguation for textto-speech synthesis.

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

An accurate, relatively inexpensive method for the disambiguation of noun homographs using large text corpora using both machine readable dictionaries and unrestricted text and the use of training instances is determined to be a crucial di erence.

MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization

We describe the MADA+TOKAN toolkit, a versatile and freely available system that can derive extensive morphological and contextual information from raw Arabic text, and then use this information for

Don’t Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction

It is demonstrated that the performance of state-of-the-art models drops considerably when evaluated on infrequent morphological inflections and then it is shown that adding a simple morphological constraint at training time improves the performance, proving that the bilingual lexicon inducers can benefit from better encoding of morphology.

Supertagging: An Approach to Almost Parsing

Novel methods for robust parsing that integrate the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques are proposed.

A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge

This paper presents a fully unsupervised word sense disambiguation method that requires only a dictionary and unannotated text as input and overcomes the problem of brittleness suffered in many existing methods and makes broad-coverage wordsense disambIGuation feasible in practice.

Handling Homographs in Neural Machine Translation

Empirical evidence is provided that existing NMT systems in fact still have significant problems in properly translating ambiguous words, and methods are described that model the context of the input word with context-aware word embeddings that help to differentiate the word sense before feeding it into the encoder.

Getting the ##life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology?

The results show that, while models trained to predict multi-tags for complete words outperform models tuned to predict the distinct tags of WPs, the WPs tag prediction can be improved by purposefully constraining the word-pieces to reflect their internal functions.