Overview of sciDB: large scale array storage, processing and analysis

@article{Brown2010OverviewOS,
  title={Overview of sciDB: large scale array storage, processing and analysis},
  author={Paul G. Brown},
  journal={Proceedings of the 2010 ACM SIGMOD International Conference on Management of data},
  year={2010}
}
  • P. Brown
  • Published 6 June 2010
  • Computer Science
  • Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
SciDB [4, 3] is a new open-source data management system intended primarily for use in application domains that involve very large (petabyte) scale array data; for example, scientific applications such as astronomy, remote sensing and climate modeling, bio-science information management, risk management systems in financial applications, and the analysis of web log data. In this talk we will describe our set of motivating examples and use them to explain the features of SciDB. We then briefly… 

Figures from this paper

SciQL: array data processing inside an RDBMS

This demo presents a proof of concept implementation of SciQL in the relational database system MonetDB, and demonstrates the storage of arrays in the Monet DB as first class citizens and the execution of a comprehensive set of basic operations on arrays.

Parallel query evaluation as a Scientific Data Service

The design and implementation of one such service, the parallel querying service, is introduced, which achieves 22X, 55X, and 62X speedups compared to conventional full-scan approach of sifting through data in answering three queries from a plasma physics analysis application.

SciQL, a query language for science applications

SciQL1 provides a seamless symbiosis of array-, set-, and sequence- interpretation using a clear separation of the mathematical object from its underlying implementation, and leads to a generalization of window-based query processing with wide applicability in science domains.

Array Database Scalability: Intercontinental Queries on Petabyte Datasets

This demonstration aims to showcase the capabilities of rasdaman by allowing users to execute queries that combine petabyte datasets stored at two institutions on different continents.

FASTDB: An Array Database System for Efficient Storing and Analyzing Massive Scientific Data

FASTDB is presented, a distributed array database system that optimized for massive scientific data management and provide a share-nothing, parallel array processing analysis and can be significantly fast than traditional database based SkyServer in many typical analytical scenarios.

Parallel Query Service for Object-centric Data Management Systems

This paper introduces a parallel query service, called PDC-Query, for an object data management systems (ODMS) on HPC systems, which operates on partitioned objects in parallel, and provides several optimization strategies for fast query evaluation.

Selective Scan for Filter Operator of SciDB

It is demonstrated that the implementation of the filter operator will reduce the processing time of a selection query significantly and enable SciDB to handle a massive amount of scientific data in more scalable manner.

Scalable parallel data loading in SciDB

This work streamline the conversion process and modify the distribution method in loading stages of SciDB to reduce the overhead and eliminate two heavy-duty steps, namely sort and redistribution, which account for a dominant portion of the redimensioning cost.

DataJoint: managing big scientific data using MATLAB or Python

DataJoint is described, an open-source toolbox designed for manipulating and processing scientific data under the relational data model that facilitates multiuser access, efficient queries, and distributed computing.

SAGA: array storage as a DB with support for structural aggregations

This paper presents algorithms, different partitioning strategies, and an analytical model for supporting structural (grid, sliding, hierarchical, and circular) aggregations over native array storage, and describes implementation of this approach in a system referred to as S-AGgregations over A-rray storage (SAGA).
...

References

SHOWING 1-10 OF 11 REFERENCES

MapReduce: simplified data processing on large clusters

This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

A Demonstration of SciDB: A Science-Oriented DBMS

An overview of Sci DB's key features is presented and a demonstration of the first version of SciDB on data and operations from one of the authors' lighthouse users, the Large Synoptic Survey Telescope (LSST).

C-Store: A Column-oriented DBMS

Preliminary performance data on a subset of TPC-H is presented and it is shown that the system the team is building, C-Store, is substantially faster than popular commercial products.

The design of POSTGRES

The main design goals of the new system are toprovide better support for complex objects, provide user extendibility for data types, operators and access methods, provide facilities for active databases and inferencing including forward- and backward-chaining.

Breaking the memory wall in MonetDB

This paper reports how research around the MonetDB database system has led to a redesign of database architecture in order to take advantage of modern hardware, and in particular to avoid hitting the memory wall.

Additional Authors

  • Medicine
  • 2011

The SciDB Development Team are

  • The SciDB Development Team are

Hubble space telescope servicing mission 4 fact sheet

  • Hubble space telescope servicing mission 4 fact sheet
  • 2007

Zetics)

  • Zetics)

Readers interested in learning more, or volunteering to help