Gauging Similarity with n-Grams: Language-Independent Categorization of Text
@article{Damashek1995GaugingSW,
title={Gauging Similarity with n-Grams: Language-Independent Categorization of Text},
author={Marc Damashek},
journal={Science},
year={1995},
volume={267},
pages={843 - 848}
}A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about document content or language is required. Context, as it applies to document similarity, can be accommodated by a well-defined procedure…
Topics from this paper
615 Citations
Linguini: language identification for multilingual documents
- Computer ScienceProceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers
- 1999
Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy, and can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Linguini: language identification for multilingual documents
- Computer Science
- 1999
Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy, and can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval
- Computer ScienceJ. Inf. Sci.
- 2000
A description is given of tests carried out for documents in Spanish, which involved some stemming techniques widely used in English, as well as the application of n-grams, and the results are compared.
The HAIRCUT information retrieval system
- Computer Science
- 2005
Through extensive empirical evaluation on multiple internationally developed test sets, it is demonstrated that the knowledge-light, language-neutral approach used in HAIRCUT can achieve state-of-the-art retrieval performance.
Evaluation of a language identification system for mono- and multilingual text documents
- Computer ScienceSAC
- 2006
It could be shown that n-gram-based approaches outperform word-based algorithms for short texts and for longer texts, the performance is comparable.
A Variant of N-Gram Based Language Classification
- Computer ScienceAI*IA
- 2007
This work addresses the problem of rapid classification of documents by a simple n-grams based technique, a variation of techniques of this family, which is very robust and successful, even for 20-fold classification, and even for short text strings.
From Words to Corpora: Recognizing Translation
- Computer ScienceEMNLP
- 2002
This paper presents a technique for discovering translationally equivalent texts comprised of the application of a matching algorithm at two different levels of analysis and a well-founded similarity score that is adaptable to varying levels of multilingual resource availability.
On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks
- Computer ScienceComput. Speech Lang.
- 2016
Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?☆
- Computer Science, Linguistics
- 2013
Language and Task Independent Text Categorization with Simple Language Models
- Computer ScienceNAACL
- 2003
This work presents a simple method for language independent and task independent text categorization learning, based on character-level n-gram language models, which achieves effective performance across a variety of languages and tasks without requiring feature selection or extensive pre-processing.
References
SHOWING 1-10 OF 41 REFERENCES
Highlights: language- and domain-independent automatic indexing terms for abstracting
- Computer Science
- 1995
A method of drawing index terms from text using n‐gram counts, achieving a function similar to, but more general than, a stemmer.
N-gram-based text categorization
- Computer Science
- 1994
An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
n-Gram Statistics for Natural Language Understanding and Text Processing
- LinguisticsIEEE Transactions on Pattern Analysis and Machine Intelligence
- 1979
The positional distributions of n-grams obtained in the present study are discussed and statistical studies on word length and trends ofn-gram frequencies versus vocabulary are presented.
Global Text Matching for Information Retrieval
- Computer ScienceScience
- 1991
An approach is outlined for the retrieval of natural language texts in response to available search requests and for the recognition of content similarities between text excerpts that appears to outperform other currently available methods.
Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms
- Computer ScienceJ. Documentation
- 1979
Experiments with the Cranfield test collection show that trigram encoding of words performs noticeably better than the use of digrams; however, use of the least frequent digram in each term produces more acceptable results.
Automatic Spelling Correction Using a Trigram Similarity Measure
- EducationInf. Process. Manag.
- 1983
Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts
- Computer ScienceScience
- 1994
Methods are given for determining text themes, traversing texts selectively, and extracting summary statements that reflect text content in arbitrary subject areas in accordance with user needs.
Automatic detection and correction of spelling errors in a large data base
- Computer ScienceJ. Am. Soc. Inf. Sci.
- 1980
The techniques used to detect and correct spelling errors in the data base of Chemical Abstracts Service are described, which achieves a high level of performance using hashing techniques for dictionary look-up and compression.
A re-examination of relevance: toward a dynamic, situational definition
- Computer ScienceInf. Process. Manag.
- 1990
The generation and use of text fragments for data compression
- Computer ScienceInf. Process. Manag.
- 1982