Gauging Similarity with n-Grams: Language-Independent Categorization of Text.


A language-independent means of gauging topical similarity in unrestricted text is described. The method combines information derived from n-grams (consecutive sequences of n characters) with a simple vector-space technique that makes sorting, categorization, and retrieval feasible in a large multilingual collection of documents. No prior information about… (More)


Figures and Tables

Sorry, we couldn't extract any figures or tables for this paper.


Citations per Year

569 Citations

Semantic Scholar estimates that this publication has 569 citations based on the available data.

See our FAQ for additional information.

Slides referencing similar topics