Compression and machine learning: a new perspective on feature space vectors

@article{Sculley2006CompressionAM,
  title={Compression and machine learning: a new perspective on feature space vectors},
  author={D. Sculley and C. Brodley},
  journal={Data Compression Conference (DCC'06)},
  year={2006},
  pages={332-341}
}
  • D. Sculley, C. Brodley
  • Published 2006
  • Mathematics, Computer Science
  • Data Compression Conference (DCC'06)
The use of compression algorithms in machine learning tasks such as clustering and classification has appeared in a variety of fields, sometimes with the promise of reducing problems of explicit feature selection. [...] Key Result To underscore this point, we find theoretical and empirical connections between traditional machine learning vector models and compression, encouraging cross-fertilization in future workExpand
An investigation of implicit features in compression-based learning for comparing webpages
TLDR
This work performs feature selection in the feature space induced by a well-known compression algorithm and finds that a subset of the features is sufficient for a near-perfect classification of these webpages. Expand
Text Mining Using Data Compression Models
TLDR
A compression-based method for instance selection, capable of extracting a diverse subset of documents that are representative of a larger document collection that is useful for initializing k-means clustering, and as a pool-based active learning strategy for supervised training of text classifiers. Expand
Compression-Based Data Mining
Compression-based data mining is a universal approach to clustering, classification, dimensionality reduction, and anomaly detection. It is motivated by results in bioinformatics, learning, andExpand
Compressive Feature Learning
TLDR
This paper addresses the problem of unsupervised feature learning for text data by using a dictionary-based compression scheme to extract a succinct feature set and finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. Expand
An Efficient Algorithm for Large Scale Compressive Feature Learning
TLDR
The recently proposed Compressive Feature Learning framework is expanded and it is shown that CFL is NP–Complete and a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem is provided. Expand
Text Classification Using Compression-Based Dissimilarity Measures
TLDR
Experimental evaluation of the proposed efficient methods for text classification based on information-theoretic dissimilarity measures reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering. Expand
Text Classification with Compression Algorithms
TLDR
A kernel function that estimates the similarity between two objects computing by their compressed lengths is defined, which is important because compression algorithms can detect arbitrarily long dependencies within the text strings. Expand
Verification based on Compression-Models
Compression models represent an interesting approach for different classification tasks and have been used widely across many research fields. We adapt compression models to the field of authorshipExpand
PyLZJD: An Easy to Use Tool for Machine Learning
TLDR
PyLZJD is introduced, a library that implements LZJD in a manner meant to be easy to use and apply for novice practitioners, followed by examples of how to use it on problems of disparate data types. Expand
Construction of Efficient V-Gram Dictionary for Sequential Data Analysis
TLDR
A new method for constructing an optimal feature set from sequential data that creates a dictionary of n-grams of variable length, based on the minimum description length principle, which shows competitive results on standard text classification collections without using the text structure. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Clustering by compression
TLDR
Evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors is reported. Expand
Text categorization using compression models
TLDR
Test categorization is the assignment of natural language texts to predefined categories based on their concept to provide an overall judgement on the document as a whole, rather than discarding information by pre-selecting features. Expand
The similarity metric
TLDR
A new "normalized information distance" is proposed, based on the noncomputable notion of Kolmogorov complexity, and it is demonstrated that it is a metric and called the similarity metric. Expand
Introduction to Information Theory and Data Compression
TLDR
This pioneering textbook serves two independent courses-in information theory and in data compression-and also proves valuable for independent study and as a reference. Expand
Spam Filtering Using Compression Models
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spamExpand
Text mining: a new frontier for lossless compression
TLDR
This paper aims to promote text compression as a key technology for text mining, allowing databases to be created from formatted tables such as stock-market information on Web pages. Expand
Kernel Methods for Pattern Analysis
TLDR
This book provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so. Expand
A repetition based measure for verification of text collections and for text categorization
TLDR
The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods. Expand
Towards parameter-free data mining
TLDR
This work shows that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm, and shows that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets. Expand
Data Compression Using Adaptive Coding and Partial String Matching
TLDR
This paper describes how the conflict can be resolved with partial string matching, and reports experimental results which show that mixed-case English text can be coded in as little as 2.2 bits/ character with no prior knowledge of the source. Expand
...
1
2
3
4
...