A reproduction of Apple’s bi-directional LSTM models for language identification in short strings

@article{Toftrup2021ARO,
  title={A reproduction of Apple’s bi-directional LSTM models for language identification in short strings},
  author={Mads Bech Toftrup and Soren Asger Sorensen and Manuel R. Ciosici and Ira Assent},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.06282}
}
Language Identification is the task of identifying a document’s language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model’s performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes… 

Figures and Tables from this paper

Language Identification using Ensembled Deep Neural Networks

  • Computer Science
  • 2021
This work examines the use of an ensemble of deep neural networks with the help of an integrated ensemble of bidirectional LSTMs with varying feature extraction techniques for language identification in a monolingual benchmark dataset.

Language Models for Code-switch Detection of te reo Māori and English in a Low-resource Setting

It is shown that BiLSTM with bilingual sub-word embed- 024 dings outperforms large-scale contextual lan- 025 guage models such as BERT on down stream- 026 ing tasks of detecting M ¯ aori language.

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

This paper explores adapter-based fine-tuning of PMLMs for CMCS text classification and presents a newly annotated dataset for the classification of Sinhala–English code-mixed and code-switched text data, whereSinhala is a low-resourced language.

Discriminating Between Similar Nordic Languages

This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools.

Query Language Identification with Weak Supervision and Noisy Label Pruning

This work proposes a learning framework that combines weak supervision with noisy label pruning and uses Convolutional Neural Networks (CNN) based models to carry out such a combination.

HeLI-OTS, Off-the-shelf Language Identifier for Text

This paper introduces HeLI-OTS, an off-the-shelf text language identification tool using the HeLI language identification method, and compares its performance with that of fastText on two different data sets, showing that fastText favors the recall of common languages, whereas He LI-OTS reaches both high recall and high precision for all languages.

Modernizing Open-Set Speech Language Identification

This work tackles the open-set task by adapting two modern-day state-of-the-art approaches to closed-set language identification: the first using a CRNN with attention and the second using a TDNN.

References

SHOWING 1-10 OF 17 REFERENCES

LanideNN: Multilingual Language Identification on Character Window

This work proposes a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages.

Automatic Language Identification in Texts: A Survey

A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.

Automatic Language Identification for Romance Languages Using Stop Words and Diacritics

This paper presents a statistical method for automatic language identification of written text using dictionaries containing stop words and diacritics, and proposes different approaches that combine the two dictionaries to accurately determine the language of textual corpora.

Natural Language Identification using Corpus-Based Models

Three approaches to the task of automatically identifying the language a text is written in are described and experiments are conducted to compare the success of each approach in identifying languages from a set of texts.

langid.py: An Off-the-shelf Language Identification Tool

It is found that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

Language Identification from Text Using N-gram Based Cumulative Frequency Addition

The preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams are described, which is simpler than the conventional Naive Bayesian classification method but performs similarly in speed overall and better in accuracy on short input strings.

Microblogs as Parallel Corpora

This work has been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs, and automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improved in translating edited news commentary.

Enriching Word Vectors with Subword Information

A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.

Discriminating Between Similar Nordic Languages

This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools.

Cross-domain Feature Selection for Language Identification

We show that transductive (cross-domain) learning is an important consideration in building a general-purpose language identification system, and develop a feature selection method that generalizes