• Corpus ID: 17221043

Language Identification from Text Using N-gram Based Cumulative Frequency Addition

  title={Language Identification from Text Using N-gram Based Cumulative Frequency Addition},
  author={Bashir Ahmed and Sung-Hyuk Cha and Charles C. Tappert},
This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naive Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-order statistical classifiers. Language classification using N-gram based rank-order statistics has… 

Figures and Tables from this paper

Comparing Neural Network Approach With N- Gram Approach For Text Categorization

It is demonstrated that the identification rate of Neural networks is similar to the corresponding Ngram approach but with much less judging time and the speed of classification is also a crucial factor for a classifier in a huge volume of categorization environment.

The textcat Package for n-Gram Based Text Categorization in R

A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the R extension package textcat for n-gram based text categorization and the performance of the provided language identification methods.

Selecting and Weighting N-Grams to Identify 1100 Languages

This paper presents a language identification algorithm using cosine similarity against a filtered and weighted subset of the most frequent n-grams in training data with optional inter-string score

Index-based n-gram extraction from large document collections

An index-based method to the n-gram extraction for large collections using common data structures like B+-tree and Hash table is shown and the scalability of this method is shown by presenting experiments with the gigabytes collection.

Comparison of Language Identification Techniques

The results of the present work show that for all used datasets the frequent words approach outperforms the short words approach and works with cumulative frequency addition classifier better than with other classifiers.

Language identification in texts

This work investigates the task of identifying the language of digitally encoded text by taking a detailed look at the research so far conducted in the field and presenting the methods for language identification developed while participating in shared tasks from 2015 to 2017.

Text-based language identification for the South African languages

We investigate the performance of text-based language identification systems on the 11 official languages of South Africa, when n-gram statistics are used as features for classification. In

Text Based Language Identification System for Indian Languages Following Devanagiri Script

This paper investigates the performance of statistical measures to determine the text-based language identification system, with an emphasis on five languages used in India based on Devanagiri script Hindi, Sanskrit, Marathi, Nepali and Bhojpuri.

Automatic Language Identification in Texts: A Survey

A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.

On Frequency-Based Approaches to Learning Stopwords and the Reliability of Existing Resources - A Study on Italian Language

This paper proposes an experimental study on the frequency behavior of stopwords, aimed at providing useful information for the development of automatic techniques for the compilation of stopword lists from a corpus of documents.



N-gram-based text categorization

An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.

Multilingual Sentence Categorization according to Language

An approach to sentence categorization which has the originality to be based on natural properties of languages with no training set dependency is described, which is fast, small, robust and textual errors tolerant.

High-quality text-to-speech synthesis : an overview

This paper tries to give a comprehensive introduction to state-of-the-art Text-ToSpeech (TTS) synthesis by highlighting its Digital Signal Processing (DSP) and Natural Language Processing (NLP)

Multilingual text analysis for text-to-speech synthesis

  • R. Sproat
  • Linguistics
    Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96
  • 1996
We present a model of text analysis for text-to-speech (TTS) synthesis based on weighted finite state transducers, which serves as the text-analysis module of the multilingual Bell Labs TTS system.

Mixed-lingual text analysis for polyglot TTS synthesis

It is shown how an analyzer for mixedlingual texts can be realized for a set of languages, starting from a corresponding set of monolingual analyzers which are based on DCGs and chart parsing.

From multilingual to polyglot speech synthesis

A distinction between existing multilingual synthesis systems and mixed-lingual or polyglot synthesis systems that should be capable of synthesising with the same voice utterances which contain foreign language words or word groups is proposed.

Statistical Identification of Languages

  • Statistical Identification of Languages
  • 1994

N-gram based Text Categorization, Symposium on Document Analysis and Information Retrieval

  • N-gram based Text Categorization, Symposium on Document Analysis and Information Retrieval
  • 1994

The Stakes of Multilinguality: Multilingual Text Tokenization in Natural Language Diagnosis

  • Proceedings of the 4rth International Conference on Artificial Inteligence Workshop
  • 1996