A multi-layer bloom filter for duplicated URL detection

@article{Zhiwang2010AMB,
  title={A multi-layer bloom filter for duplicated URL detection},
  author={Cen Zhiwang and Xu Jungang and Sun Jian},
  journal={2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE)},
  year={2010},
  volume={1},
  pages={V1-586-V1-591}
}
  • Cen Zhiwang, Xu Jungang, Sun Jian
  • Published 20 September 2010
  • Computer Science
  • 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE)
It is of great significance to improve the speed of data collecting and updating in a web crawler because there are a large number of web pages in Internet. [] Key Result The experimental result shows that the false positive of multi-layer bloom filter algorithm is significantly lower than that of classical bloom filter algorithm, while the efficiency of the former is almost the same as the later.

Figures from this paper

An Improved Bloom Filter in Distributed Crawler
  • Weipeng Zhou, Pan Wang, Xuejiao Chen, Feng Ye
  • Computer Science
    2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI)
  • 2018
TLDR
The MD5 algorithm is used to pretreat the URL, and an improved multi-dimensional bloom filter algorithm is proposed, which effectively reduces the rate of false positive and improves the efficiency of distributed crawler.
Application of Bloom Filter for Duplicate URL Detection in a Web Crawler
  • Aveksha Kapoor, V. Arora
  • Computer Science
    2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC)
  • 2016
TLDR
The paper is an attempt to improve upon the time-efficiency of a map-reduce based crawler, Apache Nutch by using bloom filter, and based upon comparison against different parameters the results indicate the efficiency and effectiveness of the proposed approach.
A Survey on Duplicate Data Filtering Methods in Big Data
TLDR
This article presents the review of different filtering methods and algorithms used for duplicate elimination such as Bloom filter, Stable Bloom Filter, multi-layer bloom filter, Counting Bloom Filter with some disadvantages such as false positive and false negative.
A parallel bloom filter string searching algorithm on a many-core processor
TLDR
The underlying architecture of a serial Bloom filter string searching algorithm is analyzed to identify the performance impact of this algorithm for large datasets and a many-core driven parallel Bloom filter algorithm is proposed using the Compute Unified Device Architecture (CUDA) parallel computing platform.
Malicious URL detection with feature extraction based on machine learning
Many web applications suffer from various web attacks due to the lack of awareness concerning security. Therefore, it is necessary to improve the reliability of web applications by accurately
Discovery protocol for data distribution service in naval warships using extended counting bloom filters
TLDR
The delay time for filters construction and the total discovery time needed in a naval warship network topology are presented and the proposed method gives low delay time and no false positive probability.
Linklets - Formal Function Description and Permission Model
TLDR
The advantages of the Linklet concept provide a way to enhance and monetize the value of the semantic web and show limits of OWLs open world assumption.
A Web Scraper For Forums : Navigation and text extraction methods
TLDR
Web forums are a popular way of exchanging information and discussing various topics and usually have a special structure, divided into boards, threads and posts.
Improving sequence analysis with probabilistic data structures and algorithms
........................................................................................................................................ iii Lay Summary
...
1
2
...

References

SHOWING 1-10 OF 23 REFERENCES
Research and Implementation of Distributed and Multi-topic Web Crawler System
TLDR
This paper proposes an architecture of distributed Web crawler system based on data-trapper that implements a multi-topic schema based on classics-label, and designs a two-tiered weighted task partition algorithm that realizes target-guided URL configuration based on Agents’ load while providing better dynamic scalability.
On Distributed Web Crawler: Architecture, Algorithms and Strategy
TLDR
The experiments show Igloo can quickly crawl high-quality Web pages as well as present high performance, and a new URL repository access method based on "delayed merging' strategy to enable high-speed crawling is proposed.
Designing a Bloom filter for differential file access
TLDR
The design process for a Bloom filter for an on-line student database is described, and it is shown that a very effective filter can be constructed with a modest expenditure of system resources.
Distributed High-performance Web Crawlers : A Survey of the State of the Art
Web Crawlers (also called Web Spiders or Robots), are programs used to download documents from the internet. Simple crawlers can be used by individuals to copy an entire web site to their hard drive
Automatic Compilation Framework for Bloom Filter Based Intrusion Detection
TLDR
The results show that a single engine tailored for handling virus signatures of length eight bytes can achieve a throughput of 18.6 Gbps while occupying only 8% of the FPGA area.
Design and implementation of a high-performance distributed Web crawler
TLDR
This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.
Compressed bloom filters
A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications
Compressed bloom filters
A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications
Split Bloom Filter
TLDR
It is proved that the Split Bloom Filter can efficiently solve or weaken the two problems, space/time/error rate tradeoffs and a new kind of Bloom Filter Split Bloom filter, which uses a s×m bit matrix to represent a set, and is presented.
Spectral bloom filters
TLDR
The Spectral Bloom Filter is introduced, an extension of the original Bloom Filter to multi-sets, allowing the filtering of elements whose multiplicities are below a threshold given at query time.
...
1
2
3
...