Learn More
One widely used mechanism for representing membership of a set of items is the simple space-efficient randomized data structure known as Bloom filters. Yet, Bloom filters are not entirely suitable for many new network applications that support network services like the representation and querying of items that have multiple attributes as opposed to a single(More)
Data Deduplication is becoming increasingly popular in storage systems as a space-efficient approach to data backup and archiving. Most existing state-of-the-art deduplication methods are either locality based or similarity based, which, according to our analysis, do not work adequately in many situations. While the former produces poor deduplication(More)
Data deduplication is a scalable compression technique used in large-scale backup systems. Traditional compression compresses a piece of data (e.g., a small file) at byte granularity. Data deduplication compresses the entire storage system at chunk granularity.
When replication forks stall at damaged bases or upon nucleotide depletion, the intra-S phase checkpoint ensures they are stabilized and can restart. In intra-S checkpoint-deficient budding yeast, stalling forks collapse, and ∼10% form pathogenic chicken foot structures, contributing to incomplete replication and cell death (Lopes et al., 2001; Sogo et al.,(More)
—Data deduplication has gained increasing attention and popularity as a space-efficient approach in backup storage systems. One of the main challenges for centralized data deduplication is the scalability of fingerprint-index search. In this paper, we propose SiLo, a near-exact and scalable deduplication system that effectively and complementarily exploits(More)
This paper presents a scalable and adaptive decentralized metadata lookup scheme for ultra large-scale file systems (ges Petabytes or even Exabytes). Our scheme logically organizes metadata servers (MDS) into a multi-layered query hierarchy and exploits grouped bloom filters to efficiently route metadata requests to desired MDS through the hierarchy. This(More)
Data deduplication has become a standard component in modern backup systems. In order to understand the fundamental tradeoffs in each of its design choices (such as prefetching and sampling), we disassemble data dedupli-cation into a large N-dimensional parameter space. Each point in the space is of various parameter settings, and performs a tradeoff among(More)
Unsupervised learning of units (phonemes, words, phrases, etc.) is important to the design of statistical speech and NLP systems. This paper presents a general source-coding framework for inducing words from natural language text without word boundaries. An efficient search algorithm is developed to optimize the minimum description length (MDL) induction(More)
—Multidimensional data indexing has received much research attention recently in a centralized system. However, it remains a nascent area of research in providing an integrated structure for multiple queries on multidimensional data in a distributed environment. In this paper, we propose a new data structure, called BR-tree (Bloom-filter-based R-tree), and(More)
I. INTRODUCTION Fast and flexible metadata retrieving is critical in the next-generation data storage systems. As the storage capacity approaches the Exabyte level and the stored files number is in the billions, directory-tree based metadata management widely deployed in conventional file systems can no longer meet the requirements of scalability and(More)