• Corpus ID: 231846515

Additive Feature Hashing

  title={Additive Feature Hashing},
  author={Mircea Andrecut},
  • M. Andrecut
  • Published 7 February 2021
  • Computer Science
  • ArXiv
The hashing trick is a machine learning technique used to encode categorical features into a numerical vector representation of pre-defined fixed length. It works by using the categorical hash values as vector indices, and updating the vector values at those indices. Here we discuss a different approach based on additive-hashing and the "almost orthogonal" property of high-dimensional random vectors. That is, we show that additive feature hashing can be performed directly by adding the hash… 

Figures from this paper


Feature hashing for large scale multitask learning
This paper provides exponential tail bounds for feature hashing and shows that the interaction between random subspaces is negligible with high probability, and demonstrates the feasibility of this approach with experimental results for a new use case --- multitask learning.
A New Paradigm for Collision-Free Hashing: Incrementality at Reduced Cost
A simple, new paradigm for the design of collision-free hash functions, where any function emanating from this paradigm is incremental, which means that rather than having to re-compute the hash of x′ from scratch, I can quickly "update" the old hash value to the new one, in time proportional to the amount of modification made in x to get x′.
Incremental Cryptography: The Case of Hashing and Signing
The idea is that having once applied the transformation to some document M, the time to update the result upon modification of M should be "proportional" to the "amount of modification" done to M.
High-Dimensional Vector Semantics
This paper shows that this intriguing property of “almost orthogonal” property of high-dimensional random vectors can be used to “memorize” random vectors by simply adding them, and provides an efficient probabilistic solution to the set membership problem.
There is substantial deviation in users’ notions of what constitutes spam and ham, and these realities make it extremelydifficult to assemble a single, global spam classifi er.
An Introduction to Random Indexing
The Random Indexing word space approach is introduced, which presents an efficient, scalable and incremental alternative to standard word space methods.
Random indexing of text samples for latent semantic analysis
Random Indexing of Text Samples for Latent Semantic Analysis Pentti Kanerva Jan Kristoferson Anders Holst kanerva@sics.se, aho@sic.se RWCP Theoretical Foundation SICS Laboratory Swedish Institute of Computer Science, Box 1263, SE-16429 Kista, Sweden LatentSemantic Analysis is a method of computing vectors that captures ent corpus and the vectors capture words-by-contexts matrix meaning.
Contributions to the study of SMS spam filtering: new collection and results
A new real, public and non-encoded SMS spam collection that is the largest one as far as the authors know is offered and the performance achieved by several established machine learning methods is compared.
The WiLI benchmark dataset for written language identification
This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from
Introduction to Information Retrieval
  • R. Larson
  • Computer Science
    J. Assoc. Inf. Sci. Technol.
  • 2010