Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities

@article{Xin2018AcceleratingHM,
  title={Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities},
  author={Doris Xin and Litian Ma and Jialin Liu and Stephen Macke and Shuchen Song and Aditya G. Parameswaran},
  journal={Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning},
  year={2018}
}
Development of machine learning (ML) workflows is a tedious process of iterative experimentation: developers repeatedly make changes to workflows until the desired accuracy is attained. We describe our vision for a "human-in-the-loop" ML system that accelerates this process: by intelligently tracking changes and intermediate results over time, such a system can enable rapid iteration, quick responsive feedback, introspection and debugging, and background execution and automation. We finally… 

Figures from this paper

Helix: Holistic Optimization for Accelerating Iterative Machine Learning
TLDR
Empirical evaluation shows that Helix is not only able to handle a wide variety of use cases in one unified workflow but also much faster, providing run time reductions of up to 19x over state-of-the-art systems, such as DeepDive or KeystoneML.
Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges
TLDR
This survey discusses the challenges of executing scientific workflows as well as existing Machine Learning (ML) techniques to alleviate those challenges and provides suggestions for improving the performance of their execution using ML techniques.
Alpine Meadow : A System for Interactive AutoML
TLDR
Alpine Meadow is presented, a first Interactive Automated Machine Learning tool able to significantly outperform the other AutoML systems while — in contrast to the other systems — providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets the authors have never seen before.
Towards a Human-in-the-Loop Library for Tracking Hyperparameter Tuning in Deep Learning Development
TLDR
This paper presents DL-Steer, the first prototype to aid trainers to fine-tune hyperparameters and for tracking trainer steering actions, which is stored in a relational database for online and post-hoc data analyses.
The Design of Reciprocal Learning Between Human and Artificial Intelligence
TLDR
A new abstract configuration of Human-Machine Learning that focuses on reciprocal learning that incorporates software to support combining human and artificial intelligences, and describes the development of a system called Fusion that supports human-machine reciprocal learning.
MARS: Assisting Human with Information Processing Tasks Using Machine Learning
TLDR
This article studies the problem of automated information processing from large volumes of unstructured, heterogeneous, and sometimes untrustworthy data sources with a novel framework called Machine Assisted Record Selection (MARS), which learns the optimal record selection via an online learning algorithm.
Supporting User Steering In Large-Scale Workflows With Provenance Data. (Support des actions de pilotage dans les workflows à grande échelle avec données de provenance)
TLDR
Using real use cases in the Oil and Gas industry, the experiments show that the proposed approach enables users to understand how their actions directly affect the workflow results at runtime and that the system design principles were essential to add negligible overhead to the HPC workflows.
Democratizing Data Science through Interactive Curation of ML Pipelines
TLDR
Alpine Meadow is able to significantly outperform the other AutoML systems while --- in contrast to the other systems --- providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets the authors have never seen before.
NOAH: Creating Data Integration Pipelines over Continuously Extracted Web Data
TLDR
Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines, is presented, based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning models.
...
...

References

SHOWING 1-10 OF 17 REFERENCES
Helix: Holistic Optimization for Accelerating Iterative Machine Learning
TLDR
Empirical evaluation shows that Helix is not only able to handle a wide variety of use cases in one unified workflow but also much faster, providing run time reductions of up to 19x over state-of-the-art systems, such as DeepDive or KeystoneML.
Supporting Fast Iteration in Model Building
TLDR
The need for a system to support the entire modeling process by making it cheap to run and track experiments is argued, and the desiderata for such a system spanning systems optimizations to visual interfaces are described.
MLlib: Machine Learning in Apache Spark
TLDR
MLlib is presented, Spark's open-source distributed machine learning library that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.
KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics
TLDR
KeystoneML is presented, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API that offers increased ease of use and higher performance over existing systems for large scale learning.
Towards Unified Data and Lifecycle Management for Deep Learning
TLDR
A high-level domain specific language (DSL) is proposed, inspired by SQL, to raise the abstraction level and thereby accelerate the modeling process and to manage the variety of data artifacts, especially the large amount of checkpointed float parameters.
TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries
TLDR
TuPAQ, a component of the MLbase system, is proposed, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines.
SystemML: Declarative machine learning on MapReduce
TLDR
This paper proposes SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment and describes and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source mapReduce implementation.
TensorFlow: A system for large-scale machine learning
TLDR
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
Scikit-learn: Machine Learning in Python
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing
Materialization Optimizations for Feature Selection Workloads
TLDR
It is argued that managing the feature selection process is a pressing data management challenge, and it is shown that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.
...
...