Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities
@article{Xin2018AcceleratingHM, title={Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities}, author={Doris Xin and Litian Ma and Jialin Liu and Stephen Macke and Shuchen Song and Aditya G. Parameswaran}, journal={Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning}, year={2018} }
Development of machine learning (ML) workflows is a tedious process of iterative experimentation: developers repeatedly make changes to workflows until the desired accuracy is attained. We describe our vision for a "human-in-the-loop" ML system that accelerates this process: by intelligently tracking changes and intermediate results over time, such a system can enable rapid iteration, quick responsive feedback, introspection and debugging, and background execution and automation. We finally…
55 Citations
Helix: Holistic Optimization for Accelerating Iterative Machine Learning
- Computer ScienceProc. VLDB Endow.
- 2018
Empirical evaluation shows that Helix is not only able to handle a wide variety of use cases in one unified workflow but also much faster, providing run time reductions of up to 19x over state-of-the-art systems, such as DeepDive or KeystoneML.
Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges
- Computer ScienceArXiv
- 2021
This survey discusses the challenges of executing scientific workflows as well as existing Machine Learning (ML) techniques to alleviate those challenges and provides suggestions for improving the performance of their execution using ML techniques.
Alpine Meadow : A System for Interactive AutoML
- Computer Science
- 2019
Alpine Meadow is presented, a first Interactive Automated Machine Learning tool able to significantly outperform the other AutoML systems while — in contrast to the other systems — providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets the authors have never seen before.
A Survey of Human-in-the-loop for Machine Learning
- Computer ScienceFuture Generation Computer Systems
- 2022
Towards a Human-in-the-Loop Library for Tracking Hyperparameter Tuning in Deep Learning Development
- Computer ScienceLADaS@VLDB
- 2018
This paper presents DL-Steer, the first prototype to aid trainers to fine-tune hyperparameters and for tracking trainer steering actions, which is stored in a relational database for online and post-hoc data analyses.
The Design of Reciprocal Learning Between Human and Artificial Intelligence
- Computer ScienceProc. ACM Hum. Comput. Interact.
- 2021
A new abstract configuration of Human-Machine Learning that focuses on reciprocal learning that incorporates software to support combining human and artificial intelligences, and describes the development of a system called Fusion that supports human-machine reciprocal learning.
MARS: Assisting Human with Information Processing Tasks Using Machine Learning
- Computer ScienceACM Trans. Comput. Heal.
- 2022
This article studies the problem of automated information processing from large volumes of unstructured, heterogeneous, and sometimes untrustworthy data sources with a novel framework called Machine Assisted Record Selection (MARS), which learns the optimal record selection via an online learning algorithm.
Supporting User Steering In Large-Scale Workflows With Provenance Data. (Support des actions de pilotage dans les workflows à grande échelle avec données de provenance)
- Computer Science
- 2019
Using real use cases in the Oil and Gas industry, the experiments show that the proposed approach enables users to understand how their actions directly affect the workflow results at runtime and that the system design principles were essential to add negligible overhead to the HPC workflows.
Democratizing Data Science through Interactive Curation of ML Pipelines
- Computer ScienceSIGMOD Conference
- 2019
Alpine Meadow is able to significantly outperform the other AutoML systems while --- in contrast to the other systems --- providing interactive latencies, but also outperforms in 80% of the cases expert solutions over data sets the authors have never seen before.
NOAH: Creating Data Integration Pipelines over Continuously Extracted Web Data
- Computer ScienceEDBT/ICDT Workshops
- 2021
Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines, is presented, based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning models.
References
SHOWING 1-10 OF 17 REFERENCES
Helix: Holistic Optimization for Accelerating Iterative Machine Learning
- Computer ScienceProc. VLDB Endow.
- 2018
Empirical evaluation shows that Helix is not only able to handle a wide variety of use cases in one unified workflow but also much faster, providing run time reductions of up to 19x over state-of-the-art systems, such as DeepDive or KeystoneML.
Supporting Fast Iteration in Model Building
- Computer Science
- 2015
The need for a system to support the entire modeling process by making it cheap to run and track experiments is argued, and the desiderata for such a system spanning systems optimizations to visual interfaces are described.
MLlib: Machine Learning in Apache Spark
- Computer ScienceJ. Mach. Learn. Res.
- 2016
MLlib is presented, Spark's open-source distributed machine learning library that provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.
KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics
- Computer Science2017 IEEE 33rd International Conference on Data Engineering (ICDE)
- 2017
KeystoneML is presented, a system that captures and optimizes the end-to-end large-scale machine learning applications for high-throughput training in a distributed environment with a high-level API that offers increased ease of use and higher performance over existing systems for large scale learning.
Towards Unified Data and Lifecycle Management for Deep Learning
- Computer Science2017 IEEE 33rd International Conference on Data Engineering (ICDE)
- 2017
A high-level domain specific language (DSL) is proposed, inspired by SQL, to raise the abstraction level and thereby accelerate the modeling process and to manage the variety of data artifacts, especially the large amount of checkpointed float parameters.
TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries
- Computer ScienceArXiv
- 2015
TuPAQ, a component of the MLbase system, is proposed, which solves the PAQ planning problem with comparable quality to exhaustive strategies but an order of magnitude more efficiently than the standard baseline approach, and can scale to models trained on terabytes of data across hundreds of machines.
SystemML: Declarative machine learning on MapReduce
- Computer Science2011 IEEE 27th International Conference on Data Engineering
- 2011
This paper proposes SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment and describes and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source mapReduce implementation.
TensorFlow: A system for large-scale machine learning
- Computer ScienceOSDI
- 2016
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
Scikit-learn: Machine Learning in Python
- Computer ScienceJ. Mach. Learn. Res.
- 2011
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing…
Materialization Optimizations for Feature Selection Workloads
- Computer ScienceTODS
- 2016
It is argued that managing the feature selection process is a pressing data management challenge, and it is shown that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.