Gauging Similarity with n-Grams: Language-Independent Categorization of Text

@article{Damashek1995GaugingSW,
  title={Gauging Similarity with n-Grams: Language-Independent Categorization of Text},
  author={Marc Damashek},
  journal={Science},
  year={1995},
  volume={267},
  pages={843 - 848}
}
  • M. Damashek
  • Published 10 February 1995
  • Computer Science
  • Science
A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure… 
Linguini: language identification for multilingual documents
  • J. Prager
  • Computer Science
    Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers
  • 1999
TLDR
Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy, and can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Linguini: language identification for multilingual documents
TLDR
Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy, and can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval
TLDR
A description is given of tests carried out for documents in Spanish, which involved some stemming techniques widely used in English, as well as the application of n-grams, and the results are compared.
The HAIRCUT information retrieval system
TLDR
Through extensive empirical evaluation on multiple internationally developed test sets, it is demonstrated that the knowledge-light, language-neutral approach used in HAIRCUT can achieve state-of-the-art retrieval performance.
Evaluation of a language identification system for mono- and multilingual text documents
TLDR
It could be shown that n-gram-based approaches outperform word-based algorithms for short texts and for longer texts, the performance is comparable.
A Variant of N-Gram Based Language Classification
TLDR
This work addresses the problem of rapid classification of documents by a simple n-grams based technique, a variation of techniques of this family, which is very robust and successful, even for 20-fold classification, and even for short text strings.
From Words to Corpora: Recognizing Translation
TLDR
This paper presents a technique for discovering translationally equivalent texts comprised of the application of a matching algorithm at two different levels of analysis and a well-founded similarity score that is adaptable to varying levels of multilingual resource availability.
Language and Task Independent Text Categorization with Simple Language Models
TLDR
This work presents a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models, which achieves effective performance across a variety of languages and tasks without requiring feature selection or extensive pre-processing.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 41 REFERENCES
Highlights: language- and domain-independent automatic indexing terms for abstracting
TLDR
A method of drawing index terms from text using n‐gram counts, achieving a function similar to, but more general than, a stemmer.
N-gram-based text categorization
TLDR
An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
n-Gram Statistics for Natural Language Understanding and Text Processing
  • C. Suen
  • Linguistics
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1979
TLDR
The positional distributions of n-grams obtained in the present study are discussed and statistical studies on word length and trends ofn-gram frequencies versus vocabulary are presented.
Global Text Matching for Information Retrieval
TLDR
An approach is outlined for the retrieval of natural language texts in response to available search requests and for the recognition of content similarities between text excerpts that appears to outperform other currently available methods.
Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms
TLDR
Experiments with the Cranfield test collection show that trigram encoding of words performs noticeably better than the use of digrams; however, use of the least frequent digram in each term produces more acceptable results.
Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts
TLDR
Methods are given for determining text themes, traversing texts selectively, and extracting summary statements that reflect text content in arbitrary subject areas in accordance with user needs.
Automatic detection and correction of spelling errors in a large data base
TLDR
The techniques used to detect and correct spelling errors in the data base of Chemical Abstracts Service are described, which achieves a high level of performance using hashing techniques for dictionary look-up and compression.
...
1
2
3
4
5
...