Helix: Accelerating Human-in-the-loop Machine Learning

@article{Xin2018HelixAH,
  title={Helix: Accelerating Human-in-the-loop Machine Learning},
  author={Doris Xin and Litian Ma and Jialin Liu and Stephen Macke and Shuchen Song and Aditya G. Parameswaran},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={1958-1961}
}
Data application developers and data scientists spend an inordinate amount of time iterating on machine learning (ML) workflows -- by modifying the data pre-processing, model training, and post-processing steps -- via trial-and-error to achieve the desired model performance. Existing work on accelerating machine learning focuses on speeding up one-shot execution of workflows, failing to address the incremental and dynamic nature of typical ML development. We propose Helix, a declarative machine… Expand
Helix: Holistic Optimization for Accelerating Iterative Machine Learning
TLDR
Empirical evaluation shows that Helix is not only able to handle a wide variety of use cases in one unified workflow but also much faster, providing run time reductions of up to 19x over state-of-the-art systems, such as DeepDive or KeystoneML. Expand
Data Civilizer 2.0: A Holistic Framework for Data Preparation and Analytics
TLDR
This paper introduces Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements of data cleaning and machine learning development, and shows how this system was used to help scientists at the Massachusetts General Hospital build their cleaning andmachine learning pipeline on their 30TB brain activity dataset. Expand
A Human-in-the-loop Perspective on AutoML: Milestones and the Road Ahead
TLDR
The vision for a Mixed-Initiative machine Learning Environment (MILE) is outlined, by rethinking the role that automation and human supervision play across the ML development lifecycle, to enable a better user experience and benefits from system optimizations that both leverage human input and are tailored to the fact that MILE interacts with a human in the loop. Expand
RASL: Relational Algebra in Scikit-Learn Pipelines
Integrating data preparation with machine-learning (ML) pipelines has been a longstanding challenge. Prior work tried to solve it by building new data processing platforms such as MapReduce or Spark,Expand
Optimizing Machine Learning Workloads in Collaborative Environments
TLDR
This paper presents a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results, and devise a linear-time reuse algorithm to find the optimal execution plan for incomingML workloads. Expand
Debugging large-scale data science pipelines using dagger
TLDR
Dagger is introduced, an end-to-end system to debug and mitigate data-centric errors in data pipelines, such as a data transformation gone wrong or a classifier underperforming due to noisy training data. Expand
Dagger: A Data (not code) Debugger
TLDR
Dagger (Data Debugger) is an end-to-end data debugger that abstracts key data-centric primitives to enable users to quickly identify and mitigate data-related problems in a given pipeline. Expand
Technical Report on Data Integration and Preparation
TLDR
A view of state-of-the-art data integration tools and techniques, a deep dive into an exemplar data integration tool, and a deep-dive in the evolving field of knowledge graphs are focused on. Expand
Pipeline Combinators for Gradual AutoML
Automated machine learning (AutoML) can make data scientists more productive. But if machine learning is totally automated, that leaves no room for data scientists to apply their intuition. Hence,Expand
Active Reinforcement Learning for Data Preparation: Learn2Clean with Human-In-The-Loop
TLDR
Learn2Clean+HIL is presented, a novel contribution enhancing Learn2Clean with the “human-in-the-loop”, which selects, for a given dataset, a given ML model, and a preselected quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality metric is maximized with the help of the user. Expand
...
1
2
...

References

SHOWING 1-10 OF 16 REFERENCES
Helix: Holistic Optimization for Accelerating Iterative Machine Learning
TLDR
Empirical evaluation shows that Helix is not only able to handle a wide variety of use cases in one unified workflow but also much faster, providing run time reductions of up to 19x over state-of-the-art systems, such as DeepDive or KeystoneML. Expand
MLbase: A Distributed Machine-learning System
TLDR
This work presents the vision for MLbase, a novel system harnessing the power of machine learning for both end-users and ML researchers, which provides a simple declarative way to specify ML tasks and a novel optimizer to select and dynamically adapt the choice of learning algorithm. Expand
MLlib: Machine Learning in Apache Spark
TLDR
MLlib is presented, Spark's open-source distributed machine learning library that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Expand
TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries
TLDR
TuPAQ, a component of the MLbase system, is proposed, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines. Expand
KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics
TLDR
KeystoneML is presented, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API that offers increased ease of use and higher performance over existing systems for large scale learning. Expand
Scikit-learn: Machine Learning in Python
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringingExpand
Towards a unified architecture for in-RDBMS analytics
TLDR
This work proposes a unified architecture for in-database analytics that requires changes to only a few dozen lines of code to integrate a new statistical technique, and demonstrates the feasibility of this architecture by integrating several popular analytics techniques into two commercial and one open-source RDBMS. Expand
Spark SQL: Relational Data Processing in Spark
TLDR
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API, and includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language. Expand
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
TLDR
Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand
The Stanford CoreNLP Natural Language Processing Toolkit
TLDR
The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage. Expand
...
1
2
...