Sector and Sphere: the design and implementation of a high-performance data cloud

@article{Gu2009SectorAS,
  title={Sector and Sphere: the design and implementation of a high-performance data cloud},
  author={Yunhong Gu and Robert L. Grossman},
  journal={Philosophical transactions. Series A, Mathematical, physical, and engineering sciences},
  year={2009},
  volume={367},
  pages={2429 - 2445}
}
  • Yunhong GuR. Grossman
  • Published 28 June 2009
  • Computer Science
  • Philosophical transactions. Series A, Mathematical, physical, and engineering sciences
Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply, given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. By contrast with the existing storage and compute clouds, Sector can manage data not only within a data centre, but also across geographically distributed data centres. Similarly, the Sphere compute cloud supports… 

Figures and Tables from this paper

Data Mining for Data Cloud and Compute Cloud

The design of the Sector storage cloud is described and how it provides the storage services required by the Sphere compute cloud and a distributed data mining application is described that is developed using Sector and Sphere.

High-Performance Big Data Management Across Cloud Data Centers

This thesis presents a transfer service architecture that enables configurable cost-performance optimizations for inter-site transfers and investigates the viability of leveraging this data movement solution as a cloud-provided service, following a Transfer-as-a-Service paradigm based on a flexible pricing scheme.

Efficient Management of Geographically Distributed Big Data on Clouds

This report introduces an uniform data management system for disseminating scientific data across geographically distributed sites that is environment-aware, as it monitors and models the global cloud infrastructure, and offers predictable data handling performances for transfer cost and time.

Efficient Management of Geographically Distributed Big Data on Clouds

This report introduces an uniform data management system for disseminating scientific data across geographically distributed sites that is environment-aware, as it monitors and models the global cloud infrastructure, and offers predictable data handling performances for transfer cost and time.

High Performance Parallel Computing with Clouds and Cloud Technologies

This paper first discusses large scale data analysis using different MapReduce implementations and then, a performance analysis of high performance parallel applications on virtualized resources is presented.

MapReduce in the Cloud: Data-Location-Aware VM Scheduling

The inefficiency of MapReduce in the cloud is studied, the causes of this inefficiency are studied, and a solution is proposed that can improve performance significantly when run different applications.

Performance Evaluation of Data Intensive Computing In the Cloud

This study shows that GCE is more efficient for data-intensive applications compared to Amazon EC2, and compares the Amazon Elastic Compute Cloud (Amazon EC2) and Google Compute Engine (GCE) clouds using multiple benchmarks.

A case for MapReduce over the internet

This paper investigates real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets and proposes and evaluates extensions to Hadoops MapReduced framework to improve its performance in such environments.

Managing Data-Intensive Workloads in a Cloud

A taxonomy is presented for workload management of data-intensive computing in the cloud and use the taxonomy to classify and evaluate current workload management mechanisms.

Toward Efficient and Simplified Distributed Data Intensive Computing

The design and implementation of a distributed file system called Sector and an associated programming framework called Sphere that processes the data managed by Sector in parallel, designed so that the processing of data can be done in place over the data whenever possible.
...

References

SHOWING 1-10 OF 24 REFERENCES

Data mining using high performance data clouds: experimental studies using sector and sphere

The design and implementation of a high performance cloud that is used to archive, analyze and mine large distributed data sets, and a distributed data mining application that is developed using Sector and Sphere are described.

Bigtable: A Distributed Storage System for Structured Data

The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.

GATES: a grid-based middleware for processing distributed data streams

  • Liang ChenK. ReddyG. Agrawal
  • Computer Science
    Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.
  • 2004
This system is designed to use the existing grid standards and tools to the extent possible and flexibly achieves the best accuracy that is possible while maintaining the real-time constraint on the analysis.

The Google file system

This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

MapReduce: simplified data processing on large clusters

This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

The Hadoop Distributed File System

The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.

Distributing the Sloan Digital Sky Survey Using UDT and Sector

A peer-to-peer storage system called Sector that is designed to access and transport large data sets over wide area high performance networks and is described to distribute the Sloan Digital Sky Survey BESTDR4 catalog data.

DataCutter: Middleware for Filtering Very Large Scientific Datasets on Archival Storage Systems

A middleware infrastructure, called DataCutter, that enables processing of scientific datasets stored in archival storage systems across a widearea network and provides support for subsetting of datasets through multidimensional range queries, and application specific aggregation on scientific dataset stored in an archivalstorage system.

Globally Distributed Content Delivery

The Akamai system has since evolved to distribute dynamically generated pages and even applications to the network's edge, providing customers with on-demand bandwidth and computing capacity and lets content providers' infrastructure requirements be reduced, and lets them deploy or expand services more quickly and easily.

GPUTeraSort: high performance graphics co-processor sorting for large database management

Overall, the results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.