• Corpus ID: 17221043

Language Identification from Text Using N-gram Based Cumulative Frequency Addition

  title={Language Identification from Text Using N-gram Based Cumulative Frequency Addition},
  author={Bashir Ahmed and Sung-Hyuk Cha and Charles C. Tappert},
This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naive Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-order statistical classifiers. Language classification using N-gram based rank-order statistics has… 

Figures and Tables from this paper

Comparing Neural Network Approach With N- Gram Approach For Text Categorization

It is demonstrated that the identification rate of Neural networks is similar to the corresponding Ngram approach but with much less judging time and the speed of classification is also a crucial factor for a classifier in a huge volume of categorization environment.

The textcat Package for n-Gram Based Text Categorization in R

A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the R extension package textcat for n-gram based text categorization and the performance of the provided language identification methods.

Selecting and Weighting N-Grams to Identify 1100 Languages

This paper presents a language identification algorithm using cosine similarity against a filtered and weighted subset of the most frequent n-grams in training data with optional inter-string score

Index-based n-gram extraction from large document collections

An index-based method to the n-gram extraction for large collections using common data structures like B+-tree and Hash table is shown and the scalability of this method is shown by presenting experiments with the gigabytes collection.

Comparison of Language Identification Techniques

The results of the present work show that for all used datasets the frequent words approach outperforms the short words approach and works with cumulative frequency addition classifier better than with other classifiers.

Text-based language identification for the South African languages

We investigate the performance of text-based language identification systems on the 11 official languages of South Africa, when n-gram statistics are used as features for classification. In

Automatic Language Identification in Texts: A Survey

A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.

On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources - A Study on Italian Language

This paper proposes an experimental study on the frequency behavior of stopwords, aimed at providing useful information for the development of automatic techniques for the compilation of stopword lists from a corpus of documents.

Language Identification Strategies for Cross Language Information Retrieval

This work experimented with the identification of the natural language used in the queries of the European Library (TEL) logs by combining together different strategies: corpus based, character model based and a priori hypothesis.

N-gram based algorithm for distinguishing between Hindi and Sanskrit texts

  • C. SreejithM. InduP. Raj
  • Computer Science, Linguistics
    2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT)
  • 2013
This paper presents an N-gram based method of language identification for documents written in Hindi and Sanskrit, which have the same script and the results are shown.



N-gram-based text categorization

An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.

High-quality text-to-speech synthesis : an overview

This paper tries to give a comprehensive introduction to state-of-the-art Text-ToSpeech (TTS) synthesis by highlighting its Digital Signal Processing (DSP) and Natural Language Processing (NLP)

Multilingual text analysis for text-to-speech synthesis

  • R. Sproat
  • Linguistics
    Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
  • 1996
We present a model of text analysis for text-to-speech (TTS) synthesis based on weighted finite state transducers, which serves as the text-analysis module of the multilingual Bell Labs TTS system.

Mixed-lingual text analysis for polyglot TTS synthesis

It is shown how an analyzer for mixedlingual texts can be realized for a set of languages, starting from a corresponding set of monolingual analyzers which are based on DCGs and chart parsing.

From multilingual to polyglot speech synthesis

A distinction between existing multilingual synthesis systems and mixed-lingual or polyglot synthesis systems that should be capable of synthesising with the same voice utterances which contain foreign language words or word groups is proposed.

Multilingual Sentence Categorization according to Language

An approach to sentence categorization which has the originality to be based on natural properties of languages with no training set dependency is described, which is fast, small, robust and textual errors tolerant.

Statistical Identification of Languages

  • Statistical Identification of Languages
  • 1994

N-gram based Text Categorization, Symposium on Document Analysis and Information Retrieval

  • N-gram based Text Categorization, Symposium on Document Analysis and Information Retrieval
  • 1994

The Stakes of Multilinguality: Multilingual Text Tokenization in Natural Language Diagnosis

  • Proceedings of the 4rth International Conference on Artificial Inteligence Workshop
  • 1996