Automated partitioning design in parallel database systems

@inproceedings{Nehme2011AutomatedPD,
  title={Automated partitioning design in parallel database systems},
  author={Rimma V. Nehme and Nicolas Bruno},
  booktitle={SIGMOD '11},
  year={2011}
}
In recent years, Massively Parallel Processors (MPPs) have gained ground enabling vast amounts of data processing. [] Key Method Our tool recommends which tables should be replicated (i.e., copied into every compute node) and which ones should be distributed according to specific column(s) so that the cost of evaluating similar workloads is minimized. In contrast to previous work, our techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations…
Locality-aware Partitioning in Parallel Database Systems
TLDR
This paper presents a novel partitioning scheme called predicate-based reference partition (or PREF) that allows to co-partition sets of tables based on given join predicates that helps to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.
Resource Bricolage for Parallel Database Systems
TLDR
The approach quantifies the performance differences among machines with various resources as they process workloads with diverse resource requirements and formalizes the problem of minimizing workload execution time and view it as an optimization problem, and employs linear programming to obtain a recommended data partitioning scheme.
Resource Bricolage for Parallel DBMSs on Heterogeneous Clusters
TLDR
This work introduces a technique it calls resource bricolage that improves database performance in heterogeneous environments and quantifies the performance differences among machines with various resources as they process workloads with diverse resource requirements.
AdaptDB : Adaptive Partitioning for Distributed Joins by
  • Yi Lu
  • Computer Science
  • 2017
TLDR
AdaptDB is presented, an adaptive storage manager for analytical database workloads in a distributed setting that works by partitioning datasets across a cluster and incrementally refining data partitioning as queries are run by introducing a novel hyper-join.
AdaptDB: Adaptive Partitioning for Distributed Joins
TLDR
AdaptDB is presented, an adaptive storage manager for analytical database workloads in a distributed setting that works by partitioning datasets across a cluster and incrementally refining data partitioning as queries are run by introducing a novel hyper-join.
Clay: Fine-Grained Adaptive Partitioning for General Database Schemas
TLDR
A new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships is presented and it is shown that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.
Dynamic Workload-Based Partitioning Algorithms for Continuously Growing Databases
TLDR
DynPart and DynPartGroup are proposed, two dynamic partitioning algorithms for continuously growing databases that efficiently adapt the data partitioning to the arrival of new data elements by taking into account the affinity ofnew data with queries and fragments.
On Scalable Transaction Execution in Partitioned Main Memory Database Management Systems
TLDR
This dissertation presents the design of H-Store, a distributed, main memory DBMS that is optimized for short-lived, write-heavy transactional workloads, and presents a Markov model-based approach for automatically selecting which optimizations to enable at run time for new transaction requests based on their most likely behavior.
Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems
TLDR
A novel approach to automatically partitioning databases for enterprise-class OLTP systems that significantly extends the state of the art by minimizing the number distributed transactions, while concurrently mitigating the effects of temporal skew in both the data distribution and accesses is presented.
Resource bricolage and resource selection for parallel database systems
TLDR
These approaches quantify the performance differences among machines with various resources as they process workloads with diverse resource requirements and introduce techniques the authors call resource bricolage and resource selection that improve database performance in heterogeneous environments.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Automating physical database design in a parallel database
TLDR
This work seeks to automate the process of data partitioning in a shared-nothing parallel database system by using the query optimizer itself both to recommend candidate partitions for each table that will benefit each query in the workload, and to evaluate various combinations of these candidates.
Physical database design decision algorithms and concurrent reorganization for parallel database systems
TLDR
The studies indicate that a low priority for the reorganization process compared to the priorities for the workload processes is often but not always best and a method to estimate the costs of executing a reorganization with a workload is developed, and some decision algorithms are developed.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
TLDR
This paper explores the feasibility of building a hybrid system that takes the best features from both technologies; the prototype built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.
SCOPE: easy and efficient parallel processing of massive data sets
TLDR
A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters.
Two techniques for on-line index modification in shared nothing parallel databases
TLDR
It is shown that BULK is an order of magnitude faster than OAT in terms of the impact on transaction performance during reorganization: when the number of indexes to be modified is either one or two, OAT has a lesser impact on the transaction performance degradation, however, when theNumber of indexes is greater than two, both techniques have the same impact onTransaction performance.
Integrating vertical and horizontal partitioning into automated physical database design
TLDR
This paper presents novel techniques for designing a scalable solution to this integrated physical design problem that takes both performance and manageability into account and implements it on Microsoft SQL Server.
Schism: a Workload-Driven Approach to Database Replication and Partitioning
TLDR
Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.
Map-reduce-merge: simplified relational data processing on large clusters
TLDR
A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms.
Parallel database systems: the future of high performance database systems
TLDR
Over the last decade 'Eradata, Tandem, and a host of startup companies have successfully developed and marketed highly parallel machines that refutes a 1983 paper predicting the demise of database machines.
Distributed query evaluation with performance guarantees
TLDR
This paper provides evaluation algorithms and optimizations for generic XPath queries in the same distributed and fragmented setting that explore parallelism and retain the performance guarantees of their counterpart for Boolean queries, regardless of how the tree is fragmented and distributed.
...
...