Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool

  title={Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool},
  author={Randal S. Olson and Jason H. Moore},
As data science continues to grow in popularity, there will be an increasing need to make data science tools more scalable, flexible, and accessible. [] Key Method Further, we analyze a large database of pipelines that were previously used to solve various supervised classification problems and identify 100 short series of machine learning operations that appear the most frequently, which we call the building blocks of machine learning pipelines. We harness these building blocks to initialize TPOT with…

Scaling tree-based automated machine learning to biomedical big data with a dataset selector

TPOT-DS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual, and develops two novel features for TPOT, Dataset Selector and Template, that leverage domain knowledge, greatly reduce the computational expense and flexibly extend TPOT’s application to biomedical big data analysis.

Scaling tree-based automated machine learning to biomedical big data with a feature set selector

Two new features implemented in TPOT that helps increase the system’s scalability: Feature Set Selector (FSS) and Template and help reduce TPOT computation time and may provide more interpretable results.

Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming

An open source pipeline optimization tool (TPOT-MDR) that uses genetic programming to automatically design machine learning pipelines for bioinformatics studies and significantly outperforms modern machine learning methods such as logistic regression and eXtreme Gradient Boosting.

Workflow Recommendation for Text Classification Problems

Ever since writing was invented, text is used to communicate crossing the boundaries of time and space. Text classification is the task of categorising documents according to pre-defined labels. This

Machine learning for metabolic engineering: A review.

A System for Accessible Artificial Intelligence

An ongoing project is presented whose ultimate goal is to deliver an open source, user-friendly AI system that is specialized for machine learning analysis of complex data in the biomedical and health care domains.

Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets

A model evaluation tool capable of simulating classifier models for these dataset characteristics and performance metrics such as true positive rate, false positive rates, and accuracy under specific conditions is developed and found that when increasing the variance in the explanatory and noise variables, logistic regression consistently performed with a higher overall accuracy as compared to random forest.

Solving test case based problems with fuzzy dominance

A new fuzzy selection operator that takes into account the statistical nature of machine learning problems based on test cases, and compute a probability of Pareto optimality through covariance estimation and Markov chain Monte Carlo simulation.

Population-based Ensemble Learning with Tree Structures for Classification

This thesis addresses two of the major limitations of existing ensemble learning, i.e. the complex construction process and the black- box models that are often difficult to interpret.



Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

This paper implements an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and shows that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user.

Efficient and Robust Automated Machine Learning

This work introduces a robust new AutoML system based on scikit-learn, which improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization.

Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

This work implements a Tree-based Pipeline Optimization Tool (TPOT) and shows that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets.

Deep feature synthesis: Towards automating data science endeavors

This paper proposes and develops the Deep Feature Synthesis algorithm for automatically generating features for relational datasets, and implements a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach.

Practical Bayesian Optimization of Machine Learning Algorithms

This work describes new algorithms that take into account the variable cost of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation and shows that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms.

Scikit-learn: Machine Learning in Python

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing

Multiple Objective Vector-Based Genetic Programming Using Human-Derived Primitives

The GTMOEP framework was applied to the problem of predicting how long an emergency responder can remain in a hazmat suit before the effects of heat stress cause the user to become unsafe, resulting in a safer algorithm for emergency responders to determine operating times in harsh environments.

Initializing Bayesian Hyperparameter Optimization via Meta-Learning

This paper mimics a strategy human domain experts use: speed up optimization by starting from promising configurations that performed well on similar datasets, and substantially improves the state of the art for the more complex combined algorithm selection and hyperparameter optimization problem.

Beyond Manual Tuning of Hyperparameters

This work discusses two strategies towards making machine learning algorithms more autonomous: automated optimization of hyperparameters (including mechanisms for feature selection, preprocessing, model selection, etc) and the development of algorithms with reduced sets ofhyperparameters.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

This book is a valuable resource, both for the statistician needing an introduction to machine learning and related Ž elds and for the computer scientist wishing to learn more about statistics, and statisticians will especially appreciate that it is written in their own language.