Max-hashing fragments for large data sets detection

@article{David2013MaxhashingFF,
  title={Max-hashing fragments for large data sets detection},
  author={J. David},
  journal={2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig)},
  year={2013},
  pages={1-6}
}
  • J. David
  • Published 2013
  • Computer Science
  • 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig)
The standard way to detect known digital objects inside a stream of bytes consists in using a string matching algorithm initialized with a dictionary containing the objects to detect. Depending on the application, the algorithm may be implemented in software or with dedicated hardware, to speedup the processing. Nevertheless, such approach requires an automaton with a complexity that is linear in the size of the dictionary. Large dictionaries result in large automatons that must be stored in… Expand
Detecting very large sets of referenced files at 40/100 GbE, especially MP4 files
TLDR
This work proposes a parallel implementation of the max-hashing algorithm that enables the detection of millions of referenced files by deep packet inspection over high bandwidth connections and a method to extract high-entropy signatures from MP4 files compatible with themax-hashes algorithm in order to have low false positive rates. Expand
Application de l'algorithme de Max-hashing pour le référencement de fichiers vidéo et la détection de contenus et de flux connus à haute vitesse sur GPU
Resume : Le nombre croissant d'utilisateurs d'Internet et le developpement de la technologie des communications s'accompagnent par l'emergence de comportements illegaux sur le reseau. Notons parExpand

References

SHOWING 1-10 OF 11 REFERENCES
Identifying almost identical files using context triggered piecewise hashing
TLDR
A new technique is introduced for constructing hash signatures by combining a number of traditional hashes whose boundaries are determined by the context of the input to identify modified versions of known files even if data has been inserted, modified, or deleted in the new files. Expand
An Algorithm for Differential File Comparison
TLDR
The program diff reports differences between two files, expressed as a minimal list of line changes to bring either file into agreement with the other, based on ideas from several sources. Expand
Data Fingerprinting with Similarity Digests
TLDR
A new, statistical approach that relies on entropy estimates and a sizeable empirical study to pick out the features that are most likely to be unique to a data object and, therefore, least likely to trigger false positives is proposed. Expand
Copy detection mechanisms for digital documents
TLDR
This paper proposes a system for registering documents and then detecting copies, either complete copies or partial copies, and describes algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). Expand
Using purpose-built functions and block hashes to enable small block and sub-file forensics
TLDR
Techniques are presented for improved detection of JPEG, MPEG and compressed data; for rapidly classifying the forensic contents of a drive using random sampling; and for carving data based on sector hashes. Expand
Winnowing: local algorithms for document fingerprinting
TLDR
The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved. Expand
GPU-based NFA implementation for memory efficient high speed regular expression matching
TLDR
This work proposes effective methods for fitting NFAs into GPU architecture through proper data structure and parallel programming design, so that GPU's parallel processing power can be better utilized to achieve high speed regular expression matching. Expand
On the resemblance and containment of documents
  • A. Broder
  • Mathematics, History
  • Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)
  • 1997
TLDR
The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document. Expand
Govdocs1m
  • 109,282 JPEG pictures from the govdocs1m corpus available online at http://digitalcorpora.org/corpora/files, 2012.
  • 2012
Govdocs1m 109,282 JPEG pictures from the gov- docs1m corpus available online at http://digitalcorpora.org/corpora/files
  • Govdocs1m 109,282 JPEG pictures from the gov- docs1m corpus available online at http://digitalcorpora.org/corpora/files
  • 2012
...
1
2
...