Hamake: A Data Flow Approach to Data Processing in Hadoop

  title={Hamake: A Data Flow Approach to Data Processing in Hadoop},
  author={Vadim Zaliva and Vladimir Orlov},
Most non-trivial data processing scenarios using Hadoop typically involve launching more than one MapReduce job. Usually, such processing is data-driven with the data funneled through a sequence of jobs. The processing model could be expressed in terms of dataflow programming, represented as a directed graph with datasets as vertices. Using fuzzy timestamps as a way to detect which dataset needs to be updated, we can calculate a sequence in which Hadoop jobs should be launched to bring all… Expand


Nova: continuous Pig/Hadoop workflows
MapReduce: Simplified Data Processing on Large Clusters
Efficient clustering of high-dimensional data sets with application to reference matching
Kangaroo: Reliable Execution of Scientific Applications with DAG Programming Model
  • Kai Zhang, K. Chen, Wei Xue
  • Computer Science
  • 2011 40th International Conference on Parallel Processing Workshops
  • 2011
Topological sorting of large networks
Benchmarking and optimizing hadoop
  • Retrieved Feburay
  • 2010
Hadoop at twitter. http://engineering.twitter.com/2010/04/ hadoop-at-twitter.html
  • Retrieved Feburay
  • 2010
Retrieved Feburay 06 , 2012 ) . Benchmarking and optimizing hadoop
  • 2010
Retrieved Feburay 06 , 2012 ) . Hadoop at twitter
  • 2010
Retrieved Feburay 06 , 2012 ) . Hamake syntax reference
  • 2010