Language Identification from Text Using N-gram Based Cumulative Frequency Addition
@inproceedings{Ahmed2004LanguageIF, title={Language Identification from Text Using N-gram Based Cumulative Frequency Addition}, author={Bashir Ahmed and Sung-Hyuk Cha and Charles C. Tappert}, year={2004} }
This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naive Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-order statistical classifiers. Language classification using N-gram based rank-order statistics has…
Figures and Tables from this paper
56 Citations
Comparing Neural Network Approach With N- Gram Approach For Text Categorization
- Computer Science
- 2010
It is demonstrated that the identification rate of Neural networks is similar to the corresponding Ngram approach but with much less judging time and the speed of classification is also a crucial factor for a classifier in a huge volume of categorization environment.
The textcat Package for n-Gram Based Text Categorization in R
- Computer Science
- 2013
A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the R extension package textcat for n-gram based text categorization and the performance of the provided language identification methods.
Selecting and Weighting N-Grams to Identify 1100 Languages
- Computer ScienceTSD
- 2013
This paper presents a language identification algorithm using cosine similarity against a filtered and weighted subset of the most frequent n-grams in training data with optional inter-string score…
Index-based n-gram extraction from large document collections
- Computer Science2011 Sixth International Conference on Digital Information Management
- 2011
An index-based method to the n-gram extraction for large collections using common data structures like B+-tree and Hash table is shown and the scalability of this method is shown by presenting experiments with the gigabytes collection.
Comparison of Language Identification Techniques
- Computer Science
- 2015
The results of the present work show that for all used datasets the frequent words approach outperforms the short words approach and works with cumulative frequency addition classifier better than with other classifiers.
Text-based language identification for the South African languages
- Computer Science
- 2008
We investigate the performance of text-based language identification systems on the 11 official languages of South Africa, when n-gram statistics are used as features for classification. In…
Automatic Language Identification in Texts: A Survey
- Computer ScienceJ. Artif. Intell. Res.
- 2019
A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.
On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources - A Study on Italian Language
- Computer ScienceIRCDL
- 2018
This paper proposes an experimental study on the frequency behavior of stopwords, aimed at providing useful information for the development of automatic techniques for the compilation of stopword lists from a corpus of documents.
Language Identification Strategies for Cross Language Information Retrieval
- Computer ScienceCLEF
- 2010
This work experimented with the identification of the natural language used in the queries of the European Library (TEL) logs by combining together different strategies: corpus based, character model based and a priori hypothesis.
N-gram based algorithm for distinguishing between Hindi and Sanskrit texts
- Computer Science, Linguistics2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT)
- 2013
This paper presents an N-gram based method of language identification for documents written in Hindi and Sanskrit, which have the same script and the results are shown.
References
SHOWING 1-10 OF 15 REFERENCES
N-gram-based text categorization
- Computer Science
- 1994
An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
High-quality text-to-speech synthesis : an overview
- Computer Science
- 2004
This paper tries to give a comprehensive introduction to state-of-the-art Text-ToSpeech (TTS) synthesis by highlighting its Digital Signal Processing (DSP) and Natural Language Processing (NLP)…
Multilingual text analysis for text-to-speech synthesis
- LinguisticsProceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
- 1996
We present a model of text analysis for text-to-speech (TTS) synthesis based on weighted finite state transducers, which serves as the text-analysis module of the multilingual Bell Labs TTS system.…
Mixed-lingual text analysis for polyglot TTS synthesis
- Computer Science, LinguisticsINTERSPEECH
- 2003
It is shown how an analyzer for mixedlingual texts can be realized for a set of languages, starting from a corresponding set of monolingual analyzers which are based on DCGs and chart parsing.
From multilingual to polyglot speech synthesis
- LinguisticsEUROSPEECH
- 1999
A distinction between existing multilingual synthesis systems and mixed-lingual or polyglot synthesis systems that should be capable of synthesising with the same voice utterances which contain foreign language words or word groups is proposed.
Multilingual Sentence Categorization according to Language
- Computer ScienceArXiv
- 1995
An approach to sentence categorization which has the originality to be based on natural properties of languages with no training set dependency is described, which is fast, small, robust and textual errors tolerant.
Statistical Identification of Languages
- Statistical Identification of Languages
- 1994
N-gram based Text Categorization, Symposium on Document Analysis and Information Retrieval
- N-gram based Text Categorization, Symposium on Document Analysis and Information Retrieval
- 1994
The Stakes of Multilinguality: Multilingual Text Tokenization in Natural Language Diagnosis
- Proceedings of the 4rth International Conference on Artificial Inteligence Workshop
- 1996