Corpus ID: 236924652

How to avoid machine learning pitfalls: a guide for academic researchers

  title={How to avoid machine learning pitfalls: a guide for academic researchers},
  author={Michael Adam Lones},
  • M. Lones
  • Published 2021
  • Computer Science
  • ArXiv
This document gives a concise outline of some of the common mistakes that occur when using machine learning techniques, and what can be done to avoid them. It is intended primarily as a guide for research students, and focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to… Expand
1 Citations
Small data problems in political research: a critical replication study
It is argued that A&W’s conclusions regarding the automated classification of organizational reputation tweets – either substantive or methodological – can not be maintained and require a larger data set for training and more careful validation. Expand


Hidden Technical Debt in Machine Learning Systems
It is found it is common to incur massive ongoing maintenance costs in real-world ML systems, and several ML-specific risk factors to account for in system design are explored. Expand
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach
  • S. Salzberg
  • Computer Science
  • Data Mining and Knowledge Discovery
  • 2004
Several phenomena that can, if ignored, invalidate an experimental comparison and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. Expand
Scikit-learn: Machine Learning Without Learning the Machinery
A quick introduction to scikit-learn as well as to machine-learning basics are given. Expand
Data and its (dis)contents: A survey of dataset development and use in machine learning research
The many concerns raised about the way the authors collect and use data in machine learning are surveyed and it is advocated that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field. Expand
Sustainable MLOps: Trends and Challenges
  • D. Tamburri
  • Computer Science
  • 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)
  • 2020
The more these platforms penetrate the day-to-day activities of software operations, the more the risk for AI Software becoming unsustainable from a social, technical, or organisational perspective is increased. Expand
Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning
Different flavors of the bootstrap technique are introduced for estimating the uncertainty of performance estimates, as an alternative to confidence intervals via normal approximation if bootstrapping is computationally feasible. Expand
Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View
A set of guidelines was generated to enable correct application of machine learning models and consistent reporting of model specifications and results in biomedical research and it is believed that such guidelines will accelerate the adoption of big data analysis, particularly with machine learning methods, in the biomedical research community. Expand
A Survey of Data-driven and Knowledge-aware eXplainable AI
A survey, reviewing and taxonomizing existing efforts from the view-point of DKE, summarizing their contribution, technical essence and comparative characteristics, and categorizing methods into data-driven methods where explanation comes from the task-related data, and knowledge-aware methods where extraneous knowledge is incorporated. Expand
A critical analysis of metrics used for measuring progress in artificial intelligence
The results suggest that the large majority of metrics currently used to evaluate classification AI benchmark tasks have properties that may result in an inadequate reflection of a classifiers' performance, especially when used with imbalanced datasets. Expand
Learning from class-imbalanced data: Review of methods and applications
An in depth review of rare event detection from an imbalanced learning perspective and a comprehensive taxonomy of the existing application domains of im balanced learning are provided. Expand