DDFlasks: Deduplicated Very Large Scale Data Store

  title={DDFlasks: Deduplicated Very Large Scale Data Store},
  author={Francisco Maia and Jo{\~a}o Paulo and F{\'a}bio Coelho and Francisco Neves and J. Pereira and R. Oliveira},
With the increasing number of connected devices, it becomes essential to find novel data management solutions that can leverage their computational and storage capabilities. However, developing very large scale data management systems requires tackling a number of interesting distributed systems challenges, namely continuous failures and high levels of node churn. In this context, epidemic-based protocols proved suitable and effective and have been successfully used to build DataFlasks, an… 


DATAFLASKS: Epidemic Store for Massive Scale Systems
This paper proposes a novel data store solely based on epidemic (or gossip-based) protocols that leverages the capacity of these protocols to provide data persistence guarantees even in highly dynamic, massive scale systems.
Tradeoffs in Scalable Data Routing for Deduplication Clusters
A cluster-based deduplication system that can dedupleicate with high throughput, support dedUplication ratios comparable to that of a single system, and maintain a low variation in the storage utilization of individual nodes is presented.
Probabilistic deduplication for cluster-based storage systems
Produck is proposed, a stateful, yet light-weight cluster-based backup system that provides deduplication rates close to those of a single-node system at a very low computational cost and with minimal memory overhead, and provides two main contributions: a lightweight probabilistic node-assignment mechanism and a new bucket-based load-balancing strategy.
A Scalable Inline Cluster Deduplication Framework for Big Data Protection
Cluster deduplication has become a widely deployed technology in data protection services for Big Data to satisfy the requirements of service level agreement (SLA). However, it remains a great
Cassandra: a decentralized structured storage system
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of
High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two
This work uses a simple resource usage model to measured behavior from the Gnutella file-sharing network to argue that large-scale cooperative storage is limited by likely dynamics and cross-system bandwidth -- not by local disk space.
HYDRAstor: A Scalable Secondary Storage
This paper concentrates on the back-end which is, to this knowledge, the first commercial implementation of a scalable, high-performance content-addressable secondary storage delivering global duplicate elimination, per-block user-selectable failure resiliency, self-maintenance including automatic recovery from failures with data and network overlay rebuilding.
Extreme Binning: Scalable, parallel deduplication for chunk-based file backup
Extreme Binning is presented, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time.
On the Expressiveness and Trade-Offs of Large Scale Tuple Stores
DataDroplets is introduced, a novel tuple store that shifts the current trade-off towards the needs of common business users, providing additional consistency guarantees and higher level data processing primitives smoothing the migration path for existing applications.
Bigtable: A Distributed Storage System for Structured Data
The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.