How Short is a Piece of String? : The Impact of Text Length and Text Augmentation on Short-text Classification

  title={How Short is a Piece of String? : The Impact of Text Length and Text Augmentation on Short-text Classification},
  author={Austin Mccartney and Svetlana Hensman and Luca Longo},
Recent increases in the use and availability of short messages have created opportunities to harvest vast amounts of information through machine-based classification. However, traditional classification methods have failed to yield accuracies comparable to classification accuracies on longer texts. Several approaches have previously been employed to extend traditional methods to overcome this problem, including the enhancement of the original texts through the construction of associations with… 

Figures and Tables from this paper


Short Text Classification: A Survey
The characters of short text and the difficulty of shortText classification are discussed, and the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classified, ensemble short text Classification, and real-time classification are introduced.
Concept-based Short Text Classification and Ranking
This paper proposes using ``Bag-of-Concepts'' in short text representation, aiming to avoid the surface mismatching and handle the synonym and polysemy problem, and proposes a novel framework for lightweight short text classification applications.
Enhancing naive bayes with various smoothing methods for short text classification
The experimental results on a large real question data show that the smoothing methods are able to significantly improve the question classification performance of Naive Bayes.
Short text classification using very few words
This work proposes a simple, scalable, and non-parametric approach for short text classification that mimics human labeling process for a piece of short text and achieves comparable classification accuracy with the baseline Maximum Entropy classifier.
Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge
It is proposed to enrich document representation through automatic use of a vast compendium of human knowledge--an encyclopedia, and empirical results confirm that this knowledge-intensive representation brings text categorization to a qualitatively new level of performance across a diverse collection of datasets.
The Unreasonable Effectiveness of Data
A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior.
Tackling the Poor Assumptions of Naive Bayes Text Classifiers
This paper proposes simple, heuristic solutions to some of the problems with Naive Bayes classifiers, addressing both systemic issues as well as problems that arise because text is not actually generated according to a multinomial model.
Some Effective Techniques for Naive Bayes Text Classification
This paper proposes two empirical heuristics: per-document text normalization and feature weighting method, which performs very well in the standard benchmark collections, competing with state-of-the-art text classifiers based on a highly complex learning method such as SVM.
Scaling to Very Very Large Corpora for Natural Language Disambiguation
This paper examines methods for effectively exploiting very large corpora when labeled data comes at a cost, and evaluates the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambigsuation.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are