• Corpus ID: 1428364

Reconsidering Language Identification for Written Language Resources

@inproceedings{Hughes2006ReconsideringLI,
  title={Reconsidering Language Identification for Written Language Resources},
  author={Baden Hughes and Timothy Baldwin and Steven Bird and Jeremy Nicholson and Andrew D. MacKinlay},
  booktitle={LREC},
  year={2006}
}
The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approachesto written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written language identification reveals a number of questions which remain openand ripe for… 
Language Identification: The Long and the Short of the Matter
TLDR
It is demonstrated that the task becomes increasingly difficult as the authors increase the number of languages, reduce the amount of training data and reduce the length of documents, and it is shown that it is possible to perform language identification without having to perform explicit character encoding detection.
Generalized language identification
TLDR
This thesis argues that a characterization of language identification as a supervised machine learning problem is inadequate, and develops a method that allows for language identification of multilingual documents, i.e. documents that contain text in more than one language.
Language ID for a Thousand Languages
TLDR
It is demonstrated that a coreference approach to the language ID task significantly outperforms existing algorithms as it provides an elegant solution to the unseen language problem.
Language Set Identification in Noisy Synthetic Multilingual Documents
TLDR
This paper uses a previously developed language identifier for monolingual docu- ments with the multilingual documents from the WikipediaMulti dataset published in a recent study, and outperforms previous methods tested with the same data.
Automatic Language Identification in Texts: A Survey
TLDR
A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.
Language Identification for Text Chats
TLDR
This work aims to classify the language of typed messages in a text chat system used by language learners from unlabeled data and obtains over 95% accuracy on the classification of messages that are unambiguously in one language.
Language identification in texts
TLDR
This work investigates the task of identifying the language of digitally encoded text by taking a detailed look at the research so far conducted in the field and presenting the methods for language identification developed while participating in shared tasks from 2015 to 2017.
Addressing challenges in automatic Language Identification of Romanized Text
TLDR
A Romanized text language identification system (RoLI) that uses an n-gram based approach and also exploits sound based similarity of words that achieves an average accuracy of 98.3%, despite the spelling variations as well as sound variations in Indian languages.
Automatic Detection and Language Identification of Multilingual Documents
TLDR
This work introduces a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions.
LanideNN: Multilingual Language Identification on Character Window
TLDR
This work proposes a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages.
...
...

References

SHOWING 1-10 OF 24 REFERENCES
Automatic language identification of written texts
TLDR
Efficient and easily extensible solutions to the problem of identifying the language of written texts based on closed grammatical classes based on Closed Grammatical classes are described.
Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex
TLDR
A generalized approach to language identiication of on-line text based on techniques of cryptanalysis is outlined, and the results are promising.
Language Determination: Natural Language Processing from Scanned Document Images
TLDR
This paper describes a method for converting a document image into character shape codes and word shape tokens, which it is shown is sufficient for determining which of 23 languages the document is written in, using only a small number of features.
Character N-Gram Tokenization for European Language Text Retrieval
TLDR
It is demonstrated empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages and is a good choice for those languages, and the increased storage and time requirements of the technique.
Gauging Similarity with n-Grams: Language-Independent Categorization of Text
TLDR
A language-independent means of gauging topical similarity in unrestricted text by combining information derived from n-grams with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents.
Language identification based on string kernels
TLDR
This paper provides empirical evidence that applying the string kernels to the language identification problem yields an impressive performance using two different kernel classifiers: the kernelized version of the centroid-based method and the support vector machines.
Applying Monte Carlo Techniques to Language Identification
TLDR
A new language identification technique based on Monte Carlo sampling is introduced that, by determining the language of a large enough number of random features, can determine the document language to be the language which result most often from these features.
N-gram-based text categorization
TLDR
An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Language Identification With Confidence Limits
TLDR
The results show that some of the problems of other language identification techniques can be avoided, and illustrate a more important point: that a statistical language process can be used to provide feedback about its own success rate.
The OGI multi-language telephone speech corpus
TLDR
The recording protocol, data collection procedure, ongoing corpus development, preliminary results of the statistical evaluation of the 10 languages, and plans to provide orthographic transcriptions of the speech are described.
...
...