Corpus ID: 231846515

Additive Feature Hashing

  title={Additive Feature Hashing},
  author={Mircea Andrecut},
  • M. Andrecut
  • Published 7 February 2021
  • Computer Science
  • ArXiv
The hashing trick is a machine learning technique used to encode categorical features into a numerical vector representation of pre-defined fixed length. It works by using the categorical hash values as vector indices, and updating the vector values at those indices. Here we discuss a different approach based on additive-hashing and the "almost orthogonal" property of high-dimensional random vectors. That is, we show that additive feature hashing can be performed directly by adding the hash… Expand

Figures from this paper


Feature hashing for large scale multitask learning
This paper provides exponential tail bounds for feature hashing and shows that the interaction between random subspaces is negligible with high probability, and demonstrates the feasibility of this approach with experimental results for a new use case --- multitask learning. Expand
A New Paradigm for Collision-Free Hashing: Incrementality at Reduced Cost
A simple, new paradigm for the design of collision-free hash functions, where any function emanating from this paradigm is incremental, which means that rather than having to re-compute the hash of x′ from scratch, I can quickly "update" the old hash value to the new one, in time proportional to the amount of modification made in x to get x′. Expand
Incremental Cryptography: The Case of Hashing and Signing
The idea is that having once applied the transformation to some document M, the time to update the result upon modification of M should be "proportional" to the "amount of modification" done to M. Expand
High-Dimensional Vector Semantics
This paper shows that this intriguing property of “almost orthogonal” property of high-dimensional random vectors can be used to “memorize” random vectors by simply adding them, and provides an efficient probabilistic solution to the set membership problem. Expand
There is substantial deviation in users’ notions of what constitutes spam and ham, and these realities make it extremelydifficult to assemble a single, global spam classifi er. Expand
An Introduction to Random Indexing
The Random Indexing word space approach is introduced, which presents an efficient, scalable and incremental alternative to standard word space methods. Expand
Random indexing of text samples for latent semantic analysis
Random Indexing of Text Samples for Latent Semantic Analysis Pentti Kanerva Jan Kristoferson Anders Holst, RWCP Theoretical Foundation SICS Laboratory Swedish Institute of Computer Science, Box 1263, SE-16429 Kista, Sweden LatentSemantic Analysis is a method of computing vectors that captures ent corpus and the vectors capture words-by-contexts matrix meaning. Expand
Contributions to the study of SMS spam filtering: new collection and results
A new real, public and non-encoded SMS spam collection that is the largest one as far as the authors know is offered and the performance achieved by several established machine learning methods is compared. Expand
The WiLI benchmark dataset for written language identification
This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts fromExpand
Introduction to Information Retrieval
  • R. Larson
  • Computer Science
  • J. Assoc. Inf. Sci. Technol.
  • 2010