• Corpus ID: 237635224

Text Ranking and Classification using Data Compression

@inproceedings{Kasturi2021TextRA,
  title={Text Ranking and Classification using Data Compression},
  author={Nitya Kasturi and Igor L. Markov},
  booktitle={ICBINB@NeurIPS},
  year={2021}
}
A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but their success depends on the compression tools used. We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting language-agnostic technique Zest . In applications, this approach simplifies configuration, avoiding careful… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 17 REFERENCES

Tweet classification by data compression

A compression-based tweet classification method, called CTC, that uses the Deflate algorithm (used in gzip) for empirical evaluations, achieved higher precision and recall rates than state-of-the-art online learning algorithms.

A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs

Using compressed sensing theory it is shown that representations combining the constituent word vectors can be information-preserving linear measurements of Bag-of-n- Grams (BonG) representations of text, leading to a new theoretical result about LSTMs: embeddings derived from a low-memory LSTM are provably at least as powerful on classification tasks as a linear classi fier over BonG vectors.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks.

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

Language trees and zipping.

A very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series based on data-compression techniques, featuring highly accurate results for language recognition, authorship attribution, and language classification.

On J. Goodman's comment to "Language Trees and Zipping"

Motivated by the recent submission to cond-mat archives by J. Goodman (cond-mat/0202383) whose results apparently discredit the approach we have proposed in a recent paper (Phys. Rev. Lett., 88,

NewsWeeder: Learning to Filter Netnews

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.

Speech and language processing - an introduction to natural language processing, computational linguistics, and speech recognition

This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora, to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation.

Speech and language processing: an introduction to natural language processing

The first of its kind to thoroughly cover language technology at all levels and with all modern technologies this text takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations.