Performance in Practice of String Hashing Functions

@inproceedings{Ramakrishna1997PerformanceIP,
  title={Performance in Practice of String Hashing Functions},
  author={M. V. Ramakrishna and Justin Zobel},
  booktitle={DASFAA},
  year={1997}
}
String hashing is a fundamental operation, used in countless applications where fast access to distinct strings is required. In this paper we describe a class of string hashing functions and explore its performance. In particular, using experiments with both small sets of keys and a large key set from a text database, we show that it is possible to achieve performance close to that theoretically predicted for hashing functions. We also consider criteria for choosing a hashing function and use… 
Choosing Best Hashing Strategies and Hash Functions
  • Mahim Singh, D. Garg
  • Computer Science, Mathematics
    2009 IEEE International Advance Computing Conference
  • 2009
TLDR
The paper gives the guideline to choose a best suitable hashing method hash function for a particular problem and presents six suitable various classes of hash functions in which most of the problems can find their solution.
Strongly Universal String Hashing is Fast
TLDR
Fast strongly universal string hashing families are presented: they can process data at a rate of 0.2 CPU cycle per byte and it is found that these families—though they require a large buffer of random numbers—are often faster than popular hash functions with weaker theoretical guarantees.
The universality of iterated hashing over variable-length strings
  • D. Lemire
  • Computer Science, Mathematics
    Discret. Appl. Math.
  • 2012
Fast and Compact Hash Tables for Integer Keys
TLDR
This paper explains how to efficiently implement an array hash table for integers and demonstrates, through careful experimental evaluations, which hash table offers the best performance for maintaining a large dictionary of integers in-memory, on a current cache-oriented processor.
Cache-Conscious Collision Resolution in String Hash Tables
TLDR
Two alternatives to the standard representation of string hash tables are explored: the simple expedient of including the string in its node, and the more drastic step of replacing each list of nodes by a contiguous array of characters.
Redesigning the string hash table, burst trie, and BST to exploit cache
TLDR
Two alternatives to the standard representation of strings are explored: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters.
Performance of Data Structures for Small Sets of Strings
TLDR
This paper test the performance of the same data structures on small sets of strings, in the context of document processing for index construction, and shows that the new structures, in particular the burst trie, are the most efficient choice for this task.
Coding schemes variation and its impact on string hashing
  • S. Mustafa
  • Computer Science
    Comput. Stand. Interfaces
  • 2002
String hashing for collection-based compression
TLDR
A CBC system, cobald, was developed which employs a two-step scheme: a preliminary long-range delta encoding step using the fingerprint index, followed by a compression of the delta file by a standard compression utility.
Burst tries: a fast, efficient data structure for string keys
TLDR
These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
...
...

References

SHOWING 1-10 OF 26 REFERENCES
Distribution-dependent hashing functions and their characteristics
TLDR
A study of the performance measures obtained during tests of "Distribution-dependent" hashing functions indicates that in certain cases, distribution-dependent methods perform better than the division method.
Hashing practice: analysis of hashing and universal hashing
TLDR
This paper considers the problem of achieving analytical performance of hashing techniques in practice with reference to successful search lengths, unsuccessful search lengths and the expected worst case performance (expected length of the longest probe sequence).
Selecting a hashing algorithm
TLDR
The results of investigations into the performance of some widely used hashing algorithms are presented and it is shown that some of these algorithms are far from optimal.
File organization using composite perfect hashing
TLDR
This work proposes and analyzes a composite perfect hashing scheme for large external files that guarantees retrieval of any record in a single disk access and supports efficient range searches in addition to being a completely dynamic file organization scheme.
Expected Worst-Case Performance of Hash Files
The following problem is studied: consider a hash file and the longest probe sequence that occurs when retrieving a record. How long is this probe sequence expected to be? The approach taken differs
Expected Length of the Longest Probe Sequence in Hash Code Searching
TLDR
An investigation ts made of the expected value of the maximum number of accesses needed to locate any element m a hashing file under various colhston resoluuon schemes, showing that the actual behawor of the worst case in hash tables is quite good on the average.
Phonetic string matching: lessons from information retrieval
TLDR
The parallels between information retrieval and phonetic matching are explained, and the new phonetics matching techniques described are compared with existing techniques to demonstrate that the new techniques are superior.
Practical performance of Bloom filters and parallel free-text searching
TLDR
The performance of hash transformations with reference to the filter error rate is the focus of this article.
General performance analysis of key-to-address transformation methods using an abstract file concept
  • V. Lum
  • Computer Science
    CACM
  • 1973
This paper presents a new approach to the analysis of performance of the various key-to-address transformation methods. In this approach the keys in a file are assumed to have been selected from the
Algorithms in C
TLDR
Algorithms in C is a comprehensive repository of algorithms, complete with code, with extensive treatment of searching and advanced data structures, sorting, string processing, computational geometry, graph problems, and mathematical algorithms.
...
...