Hamake: A Data Flow Approach to Data Processing in Hadoop

  title={Hamake: A Data Flow Approach to Data Processing in Hadoop},
  author={Vadim Zaliva and Vladimir Orlov},
Most non-trivial data processing scenarios using Hadoop typically involve launching more than one MapReduce job. Usually, such processing is data-driven with the data funneled through a sequence of jobs. The processing model could be expressed in terms of dataflow programming, represented as a directed graph with datasets as vertices. Using fuzzy timestamps as a way to detect which dataset needs to be updated, we can calculate a sequence in which Hadoop jobs should be launched to bring all… 

Figures from this paper

PROB: A tool for Tracking Provenance and Reproducibility of Big Data Experiments
This work proposes a tool that aids researchers to improve reproducibility of their experiments through automated keeping of provenance records.
Uma máquina de redução de grafos extensível para a implementação de fluxos de trabalho
Maquinas de reducao de grafos, sao tradicionalmente utilizadas na implementacao de linguagens de programacao. Elas permitem executar programas (representados como grafos), atraves da aplicacao


Nova: continuous Pig/Hadoop workflows
A workflow manager developed and deployed at Yahoo called Nova is described, which pushes continually-arriving data through graphs of Pig programs executing on Hadoop clusters, which is a good fit for a large fraction of Yahoo's data processing use-cases.
Efficient clustering of high-dimensional data sets with application to reference matching
This work presents a new technique for clustering large datasets, using a cheap, approximate distance measure to eciently divide the data into overlapping subsets the authors call canopies, and presents ex- perimental results on grouping bibliographic citations from the reference sections of research papers.
Kangaroo: Reliable Execution of Scientific Applications with DAG Programming Model
The implementations of Kangaroo system are described, designs of scheduling and fault tolerance are discussed, the performance by a dense matrix inversion program is evaluated, and the results demonstrate that scheduling policies have a strong effect on program performance.
Topological sorting of large networks
The approach to the problem presented here centers upon the use of multiple adaptive matched filters that classify normalized signals that compare between machine and human performance.
Retrieved Feburay 06 , 2012 ) . Hamake syntax reference
  • 2010
Retrieved Feburay 08 , 2012 ) . Cascading
  • 2010
Retrieved Feburay 06 , 2012 ) . Hadoop at twitter
  • 2010
Hadoop at twitter. http://engineering.twitter.com/2010/04/ hadoop-at-twitter.html
  • Retrieved Feburay
  • 2010
Retrieved Feburay 06 , 2012 ) . Benchmarking and optimizing hadoop
  • 2010
Topological sorting of large networks. Communications of the ACM
  • 1962