• Publications
  • Influence
The Stratosphere platform for big data analytics
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming ofExpand
  • 397
  • 24
  • PDF
Apache Flink: Stream Analytics at Scale
Summary form only given. Apache Flink is an open source system for expressive, declarative, fast, and efficient data analysis on both historical (batch) and real-time (streaming) data. Flink combinesExpand
  • 33
  • 7
Collaborative Filtering with Apache Mahout
Apache Mahout [1] is an Apache-licensed, open source library for scalable machine learning. It is well known for algorithm implementations that run in parallel on a cluster of machines using theExpand
  • 38
  • 6
  • PDF
Automatically Tracking Metadata and Provenance of Machine Learning Experiments
We present a lightweight system to extract, store and manage metadata and provenance information of common artifacts in machine learning (ML) experiments: datasets, models, predictions, evaluationsExpand
  • 48
  • 5
  • PDF
"Deep" Learning for Missing Value Imputationin Tables with Non-Numerical Data
The success of applications that process data critically depends on the quality of the ingested data. Completeness of a data source is essential in many cases. Yet, most missing value imputationExpand
  • 23
  • 4
  • PDF
Distributed matrix factorization with mapreduce using a series of broadcast-joins
The efficient, distributed factorization of large matrices on clusters of commodity machines is crucial to applying latent factor models in industrial-scale recommender systems. We propose anExpand
  • 48
  • 3
  • PDF
"All roads lead to Rome": optimistic recovery for distributed iterative data processing
Executing data-parallel iterative algorithms on large datasets is crucial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executingExpand
  • 54
  • 3
  • PDF
Scalable similarity-based neighborhood methods with MapReduce
Similarity-based neighborhood methods, a simple and popular approach to collaborative filtering, infer their predictions by finding users with similar taste or items that have been similarly rated.Expand
  • 48
  • 3
  • PDF
Factorbird - a Parameter Server Approach to Distributed Matrix Factorization
We present Factorbird, a prototype of a parameter server approach for factorizing large matrices with Stochastic Gradient Descent-based algorithms. We designed Factorbird to meet the followingExpand
  • 27
  • 3
  • PDF
On Challenges in Machine Learning Model Management
The training, maintenance, deployment, monitoring, organization and documentation of machine learning (ML) models – in short model management – is a critical task in virtually all production ML useExpand
  • 30
  • 3
  • PDF