State-space optimization of ETL workflows

@article{Simitsis2005StatespaceOO,
  title={State-space optimization of ETL workflows},
  author={Alkis Simitsis and Panos Vassiliadis and Timos K. Sellis},
  journal={IEEE Transactions on Knowledge and Data Engineering},
  year={2005},
  volume={17},
  pages={1404-1419}
}
Extraction-transformation-loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a data warehouse. In this paper, we derive into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide an exhaustive and two heuristic algorithms… 
Macro-level Scheduling of ETL Workflows
TLDR
This paper experimentally shows that the use of different scheduling policies can improve ETL performance in terms of memory consumption and execution time.
An Efficient Heuristic for Logical Optimization of ETL Workflows
TLDR
This paper identifies activities that can be transferred between linear segments in an ETL flow and uses the re-orderings of the linear segments to obtain a cost-optimal semantically equivalent flow for a given complex flow.
From conceptual design to performance optimization of ETL workflows: current state of research and open problems
TLDR
This paper explains the existing techniques for constructing a conceptual and a logical model of an ETL workflow, its corresponding physical implementation, and its optimization, and proposes a theoretical ETL framework for ETL optimization.
Deciding the physical implementation of ETL workflows
TLDR
The problem of determining the best possible physical implementation of an ETL workflow, given its logical-level description and an appropriate cost model as inputs, is formulated as a state-space problem and a suitable solution is provided.
Scheduling strategies for efficient ETL execution
Novel approach in ETL
  • A. Prema, A. Pethalakshmi
  • Computer Science
    2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering
  • 2013
TLDR
A new ETL is designed, which is titled as Hyper-ETL for increasing an efficiency of ETL process, which allows the integration of XML document file and Oracle data warehouse to reduce an execution time and to remove the mismanagement of metadata in an existingETL process.
ETL Workflows: From Formal Specification to Optimization
TLDR
The goal of this research was to facilitate, manage, and optimize the design and implementation of the ETL workflows both during the initial design and deployment stage, as well as, during the continuous evolution of a data warehouse.
Data warehouse ETL+Q auto-scale framework
TLDR
This paper proposes parallelisation solutions, called AScale, for each part of the ETL+Q, that is, an approach that enables the automatic scalability and freshness of any data warehouse and ETL-Q process.
Determining Essential Statistics for Cost Based Optimization of an ETL Workflow
TLDR
This paper proposes an optimization framework to choose a set of statistics to collect for a given workflow, using which the optimizer can estimate the cost of any alternative plan for the workflow, and experimentally demonstrates the effective and efficiency of the proposed algorithms.
ETL Workflow Analysis and Verification Using Backwards Constraint Propagation
TLDR
A novel approach, Backwards Constraint Propagation (BCP), is proposed that automatically analyzes ETL workflows and verifies the target-end restrictions at their earliest points and supports most relational algebra operators and data transformation functions.
...
...

References

SHOWING 1-10 OF 31 REFERENCES
Optimizing ETL processes in data warehouses
TLDR
This paper delves into the logical optimization of ETL processes, modeling it as a state-space search problem and provides algorithms towards the minimization of the execution cost of an ETL workflow.
Modeling ETL activities as graphs
TLDR
This paper focuses on the logical design of the ETL scenario of a data warehouse, which is based on a formal logical model that includes the data stores, activities and their constituent parts as a graph, which it is called the Architecture Graph.
A Framework for the Design of ETL Scenarios
TLDR
This paper describes a framework for the declarative specification of ETL scenarios with two main characteristics: genericity and customization and presents a palette of several templates, representing frequently used ETL activities along with their semantics and their interconnection.
Lineage tracing for general data warehouse transformations
TLDR
This work formally defines the lineage tracing problem in the presence of general data warehouse transformations, and presents algorithms for lineage tracing in this environment, and can be used as the basis for a lineage tracing tool in a general warehousing setting.
Efficient resumption of interrupted warehouse loads
TLDR
This work develops a resumption algorithm called DR that imposes no overhead and relies only on the high-level properties of the transformations of the data and shows that DR can lead to a ten-fold reduction in resumption time by performing experiments using commercial software.
Data Transformation Services
TLDR
OLE DB provides an infrastructure that allows developers to connect to multiple, unrelated data sources using identical code, and Microsoft developed OLE DB, a component of Universal Data Access, their new strategy for accessing information across the enterprise.
Data Warehouse Configuration
In the data warehousing approach to the integration of data from multiple information sources, selected information is extracted in advance and stored in a repository. A data warehouse (DW) can
Query Optimization in Database Systems
TLDR
These methods are presented in the framework of a general query evaluation procedure using the relational calculus representation of queries, and nonstandard query optimization issues such as higher level query evaluation, query optimization in distributed databases, and use of database machines are addressed.
Data Warehouse Population Platform
TLDR
A generalised platform for population of data warehouses named Data Warehouse Population Platform (DWPP), a set of modules whose aim is to resolve typical aspects arising during the transformation and loading vast amount of data into data warehouse.
AJAX: an extensible data cleaning tool
TLDR
The AJAX system applied to two real world problems: the consolidation of a telecommunication database, and the conversion of a dirty database of bibliographic references into a set of clean, normalized, and redundancy free relational tables maintaining the same data are presented.
...
...