• Corpus ID: 170740

N-gram-based text categorization

@inproceedings{Cavnar1994NgrambasedTC,
  title={N-gram-based text categorization},
  author={William B. Cavnar and John M. Trenkle},
  year={1994}
}
Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems. We… 

Figures and Tables from this paper

Text Categorization in R: A Reduced N-Gram Approach
TLDR
This contribution shows how to produce language and document profiles using a reduced version of Cavnar and Trenkle’s original algorithm and presents the R package textcat, which enables the user to generate language profile databases as well as document profiles and allows to perform text classifications according to both the original and the reduced N-gram approach.
Multilingual Sentence Categorization According to Language 1 Categorization According to Language 1.1 from Text Categorization
TLDR
An approach to sentence categorization which has the originality to be based on natural properties of languages with no training set dependency is described, which is fast, small, robust and textual errors tolerant.
PAPER ON ALGORITHMS USED FOR TEXT CLASSIFICATION
TLDR
The aim of this paper is to highlight the important algorithms that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges that remain to be solved.
Text Representation for Automatic Text Categorization
TLDR
Today’s learning-based ATC systems are able to reach nearly human-being, and the model basic idea is to induce an automatic classification function by learning categories properties from manually labelled documents, instead of codifying rules by hand to classify documents.
A REVIEW PAPER ON ALGORITHMS USED FOR TEXT CLASSIFICATION
TLDR
The aim of this paper is to highlight the important algorithms that are employed in text documents classification, while at the same time making awareness of some of the interesting challenges that remain to be solved.
Refinement of Feature Terms and Improvement of Classification Accuracy on Multilingual Text Categorization Using Character N-gram
TLDR
The proposed method is language-independent because it does not depend on grammatical knowledge peculiar to the language by using Character N-gram, and can classify multi-language into some categories using only one program.
Language Identification from Text Using N-gram Based Cumulative Frequency Addition
TLDR
The preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams are described, which is simpler than the conventional Naive Bayesian classification method but performs similarly in speed overall and better in accuracy on short input strings.
Text Categorization Using n-Gram Based Language Independent Technique
TLDR
Comparisons between results obtaining by the presented technique and results obtained by other n-gram based and traditional ”bag of words” text categorization techniques, demonstrate that this technique is sound and promising.
N-gram Based Text Categorization Method for Improved Data Mining
TLDR
The simple modification is able to improve the performance of Naive Bayes for text classification significantly and it is shown that it can be solved by modeling text data differently using N-Grams.
...
...

References

SHOWING 1-10 OF 13 REFERENCES
N-Gram-Based Text Filtering For TREC-2
TLDR
An experimental text filtering system that uses N-gram-based matching for document retrieval and routing tasks, pointing the way for several types of enhancements, both for speed and effectiveness.
n-Gram Statistics for Natural Language Understanding and Text Processing
  • C. Suen
  • Linguistics
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1979
TLDR
The positional distributions of n-grams obtained in the present study are discussed and statistical studies on word length and trends ofn-gram frequencies versus vocabulary are presented.
Using Superimposed Coding Of N-Gram Lists For Efficient Inexact Matching
TLDR
An extension of the superimposed coding idea which encodes every N-gram with an ensemble of bit vectors in such a way as to yield even greater space savings.
Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology
TLDR
Reading is a need and a hobby at once, and the principle of least effort an introduction to human ecology as the choice of reading is found here.
Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and
Searching for text? Send an N-gram]
K.,Human Behavior and the Principle of Least Ef fort, an Introduction to Human Ecology, Addison-Wesley
  • 1949
Human Beha vior and the Principle of Least Ef fort, an Introduction to Human Ecology, Addison-Wesley
  • 1949
...
...