Automatic Detection and Language Identification of Multilingual Documents

  title={Automatic Detection and Language Identification of Multilingual Documents},
  author={Marco Lui and Jey Han Lau and Timothy Baldwin},
  journal={Transactions of the Association for Computational Linguistics},
Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well… 
Automatic Language Identification from Written Texts - An Overview
A brief overview of the challenges involved in automatic language identification, existing methodologies and some of the tools available for language identification is presented.
Language Set Identification in Noisy Synthetic Multilingual Documents
This paper uses a previously developed language identifier for monolingual docu- ments with the multilingual documents from the WikipediaMulti dataset published in a recent study, and outperforms previous methods tested with the same data.
Automatic Detection of Multilingual Dictionaries on the Web
This paper presents an approach to query construction to detect multilingual dictionaries for predetermined language combinations on the web, based on the identification of terms which are likely to
Evaluation of language identification methods using 285 languages
This paper presents the evaluation of seven language identification methods that was done in tests between 285 languages with an out-of-domain test set, and shows that a method performing well with a small number of languages does not necessarily scale to a large number of Languages.
Language Identification for Multilingual Machine Translation
  • A. Babhulgaonkar, S. Sonavane
  • Computer Science, Linguistics
    2020 International Conference on Communication and Signal Processing (ICCSP)
  • 2020
N-gram based and machine learning based language identifiers are trained and used to identify three Indian languages present in a document given for machine translation and it is observed that, support vector machine based language identifier is more accurate than any other technique and it achieves 89% accuracy that is 18% more than traditional n-grambased approach.
Automatic Language Identification in Texts: A Survey
A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.
Language identification in texts
This work investigates the task of identifying the language of digitally encoded text by taking a detailed look at the research so far conducted in the field and presenting the methods for language identification developed while participating in shared tasks from 2015 to 2017.
LanideNN: Multilingual Language Identification on Character Window
This work proposes a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages.
Language Independent and Multilingual Language Identification using Infinity Ngram Approach
An approach which able to eradicate unsolved issues of language identification without language barrier is proposed, which has a capability of identify the language of a text at any text unit in both monolingual and multilingual setting.
Language Lexicons for Hindi-English Multilingual Text Processing
The language lexicons are proposed, a novel kind of lexical database that augments several bilingual language processing tasks and possess condensed quantitative characteristics which reflect their linguistic strength in respect of Hindi and English language.


Linguini: language identification for multilingual documents
  • J. Prager
  • Computer Science
    Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers
  • 1999
Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy, and can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Language Identification: The Long and the Short of the Matter
It is demonstrated that the task becomes increasingly difficult as the authors increase the number of languages, reduce the amount of training data and reduce the length of documents, and it is shown that it is possible to perform language identification without having to perform explicit character encoding detection.
Language Identification on the Web: Extending the Dictionary Method
A new method is proposed and evaluated, which constructs language models based on word relevance and addresses the limitations of existing approaches when applied to real-world web pages.
Language identification in web pages
The language "guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure, and achieves very high accuracy in discriminating different languages on Web pages.
Mining the Web for Bilingual Text
The preliminary STRAND results are extended by adding automatic language identification, scaling up by orders of magnitude, and formally evaluating performance.
Text Segmentation by Language Using Minimum Description Length
The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment through dynamic programming.
Automatic language identification of written texts
Efficient and easily extensible solutions to the problem of identifying the language of written texts based on closed grammatical classes based on Closed Grammatical classes are described.
Language Identification Strategies for Cross Language Information Retrieval
This work experimented with the identification of the natural language used in the queries of the European Library (TEL) logs by combining together different strategies: corpus based, character model based and a priori hypothesis.
Language Identification of Search Engine Queries
This work proposes a method to automatically generate a data set, which uses click-through logs of the Yahoo! Search Engine to derive the language of a query indirectly from thelanguage of the documents clicked by the users, and uses this data set to train two decision tree classifiers.
Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web
It is shown that using a probabilistic model, it is able to obtain performances close to those using an MT system, and the possibility of automatically gather parallel texts from the Web in an attempt to construct a reasonable training corpus is investigated.