• Corpus ID: 53039735

Preprocessor Selection for Machine Learning Pipelines

  title={Preprocessor Selection for Machine Learning Pipelines},
  author={Brandon Schoenfeld and Christophe G. Giraud-Carrier and Mason Poggemann and Jarom Christensen and Kevin D. Seppi},
Much of the work in metalearning has focused on classifier selection, combined more recently with hyperparameter optimization, with little concern for data preprocessing. Yet, it is generally well accepted that machine learning applications require not only model building, but also data preprocessing. In other words, practical solutions consist of pipelines of machine learning operators rather than single algorithms. Interestingly, our experiments suggest that, on average, data preprocessing… 

Figures and Tables from this paper

Benchmark and Survey of Automated Machine Learning Frameworks

This paper is a combination of a survey on current AutoML methods and a benchmark of popular AutoML frameworks on real data sets to summarize and review important AutoML techniques and methods concerning every step in building an ML pipeline.

Survey on Automated Machine Learning

This survey summarizes the recent developments in academy and industry regarding AutoML and introduces a holistic problem formulation, approaches for solving various subproblems of AutoML, and provides an extensive empirical evaluation of the presented approaches on synthetic and real data.

Extended Pre-Processing Pipeline For Text Classification: On the Role of Meta-Features, Sparsification and Selective Sampling

This Master Thesis introduces three new steps into the traditional pre-processing phase of pipelines for Text Classification: 1) Meta-Features Generation; 2) Sparsification; and 3) Selective Sampling.

AutonoML: Towards an Integrated Framework for Autonomous Machine Learning

This review seeks to motivate a more expansive perspective on what constitutes an automated/autonomous ML system, alongside consideration of how best to consolidate those elements, and develops a conceptual framework to illustrate one possible way of fusing high-level mechanisms into an autonomous ML system.

Novel authorship verification model for social media accounts compromised by a human

An authorship verification model that uses XGBoost, as a preprocessor, to discover functional features of the text message, which ranked using MCDM methods to build a classification model.

Using learning analytics to support students’ engineering design: the angle of prediction

This research presents a novel, scalable, scalable and scalable approaches that can be used to improve the quality of teaching and learning in the rapidly changing environment of online education.

Correction to: Novel authorship verification model for social media accounts compromised by a human

A Correction to this paper has been published: https://doi.org/10.1007/s11042-021-10617-5

A universal information theoretic approach to the identification of stopwords

This work formulates an information theoretic framework that automatically identifies uninformative words in a corpus and shows that it not only outperforms other stopword heuristics, but also allows for a substantial reduction of document size in applications of topic modelling.



Layered TPOT: Speeding up Tree-based Pipeline Optimization

This work introduces Layered TPOT, a modification to TPOT which aims to create pipelines equally good as the original, but in significantly less time, using a modified evolutionary algorithm.

Efficient and Robust Automated Machine Learning

This work introduces a robust new AutoML system based on scikit-learn, which improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization.

Metalearning - Applications to Data Mining

This book discusses several approaches to obtaining knowledge concerning the performance of machine learning and data mining algorithms and shows how this knowledge can be reused to select, combine, compose and adapt both algorithms and models to yield faster, more effective solutions to data mining problems.

Scikit-learn: Machine Learning in Python

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing

A Comprehensive Dataset for Evaluating Approaches of Various Meta-learning Tasks

This paper presents a novel and publicly available dataset for meta-learning based on 83 datasets, six classification algorithms, and 49 meta-features based on which different target variables like accuracy and training time of the classifiers as well as parameter dependent measures are included as ground-truth information.

OpenML: networked science in machine learning

This paper introduces OpenML, a place for machine learning researchers to share and organize data in fine detail, so that they can work more effectively, be more visible, and collaborate with others to tackle harder problems.

Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA

The new version of Auto-WEKA is described, a system designed to help novice users by automatically searching through the joint space of WEKA's learning algorithms and their respective hyperparameter settings to maximize performance, using a state-of-the-art Bayesian optimization method.