Bigtable: A Distributed Storage System for Structured Data

@article{Chang2008BigtableAD,
  title={Bigtable: A Distributed Storage System for Structured Data},
  author={Fay W. Chang and Jeffrey Dean and Sanjay Ghemawat and Wilson C. Hsieh and Deborah A. Wallach and Michael Burrows and Tushar Chandra and Andrew Fikes and Robert E. Gruber},
  journal={ACM Trans. Comput. Syst.},
  year={2008},
  volume={26},
  pages={4:1-4:26}
}
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving… 

Figures and Tables from this paper

Developing a Working Data Hub
TLDR
The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
Cassandra: a decentralized structured storage system
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of
Cassandra-A Decentralized Structured Storage System
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of
Moving from Relational Data Storage to Decentralized Structured Storage System
TLDR
Current research is a step towards moving relational data storage to decentralized structured storage system (Cassandra), for achieving high availability demand of users for any type of data (structured and unstructured) with zero fault tolerance.
GLORY-DB: A Distributed Data Management System for Large Scale High-Dimensional Data
TLDR
The design of a distributed data management system, highly available and scalable storage system which provides contents-based retrieval using a hybrid spill tree with local signature files and how it can be used to find the nearest neighbors in the cluster environments is presented.
Scalable Storage for Data-Intensive Computing
Persistent storage is a fundamental abstraction in computing. It consists of a named set of data items that come into existence through explicit creation, persist through temporary failures of the
Clouder: a flexible large scale decentralized object store: architecture overview
TLDR
Preliminary ideas for the architecture of a flexible, efficient and dependable fully decentralized object store able to manage very large sets of variable size objects and to coordinate in place processing are presented.
An Efficient and Performance-Aware Big Data Storage System
TLDR
An in-depth analysis of the key features of future big data storage services for both unstructured and semi-structured data, and how such services should be constructed and deployed, and especially focuses on the issues of data de-duplication for enterprises and private organisations.
Understanding query performance in Accumulo
TLDR
An Apache Accumulo-based big data system designed for a network situational awareness application is studied and its storage schema and data retrieval requirements are analyzed, and the correspondingAccumulo performance bottlenecks are characterized.
Dynamic Table: A Scalable Storage Structure in the Cloud
TLDR
A new NF2 scalable storage structure named “Dynamic Table” based on the key-value storage is proposed and the formal definition of dynamic table and implemention on HDFS is introduced.
...
...

References

SHOWING 1-10 OF 54 REFERENCES
The Google file system
TLDR
This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
MapReduce: simplified data processing on large clusters
TLDR
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Interpreting the data: Parallel analysis with Sawzall
TLDR
The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.
Implementation techniques for main memory database systems
TLDR
This paper considers the changes necessary to permit a relational database system to take advantage of large amounts of main memory, and evaluates AVL vs B+-tree access methods, hash-based query processing strategies vs sort-merge, and study recovery issues when most or all of the database fits in main memory.
Mariposa: a new architecture for distributed data
TLDR
The design of Mariposa is described, an experimental distributed data management system that provides high performance in an environment of high data mobility and heterogeneous host capabilities and a general, flexible platform for the development of new algorithms for distributed query optimization, storage management, and scalable data storage structures.
Boxwood: Abstractions as the Foundation for Storage Infrastructure
TLDR
This paper has built a system called Boxwood to explore the feasibility and utility of providing high-level abstractions or data structures as the fundamental storage infrastructure, and has implemented an NFSv2 file service that demonstrates the promise of this approach.
Weaving Relations for Cache Performance
TLDR
This paper proposes a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page, and demonstrates that in-page data placement is the key to high cache performance.
Integrating compression and execution in column-oriented database systems
TLDR
This paper shows how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems and evaluates a set of compression schemes and shows that the best scheme depends not only on the properties of the data but also on the nature of the query workload.
Chord: A scalable peer-to-peer lookup service for internet applications
TLDR
Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.
DB2 Parallel Edition
TLDR
The DB2® Parallel Edition product is described, a commercial parallel database system that evolved from a prototype developed at IBM Research in Hawthorne, New York, and now is being jointly developed with the IBM Toronto laboratory.
...
...