Code Smells for Machine Learning Applications

  title={Code Smells for Machine Learning Applications},
  author={Haiyin Zhang and Luis Cruz and Arie van Deursen},
  journal={2022 IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI (CAIN)},
  • Haiyin Zhang, Luis Cruz, A. Deursen
  • Published 25 March 2022
  • Computer Science
  • 2022 IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI (CAIN)
The popularity of machine learning has wildly expanded in recent years. Machine learning techniques have been heatedly studied in academia and applied in the industry to create business value. However, there is a lack of guidelines for code quality in machine learning applications. In particular, code smells have rarely been studied in this domain. Although machine learning code is usually integrated as a small part of an overarching system, it usually plays an important role in its core… 

Figures and Tables from this paper



MLSmellHound: A Context-Aware Code Analysis Tool

This work attempts to resolve the problem of context by exploring the use of context which includes i) purpose of the source code, ii) technical domain, iii) problem domain, iv) team norms, v) operational environment, and vi) development lifecycle stage to provide contextualised error reporting for code analysis.

Feature scaling

AI lifecycle models need to be revised

The work shows that the real challenges of applying Machine Learning go much beyond sophisticated learning algorithms – more focus is needed on the entire lifecycle, and that the existing development tools for Machine Learning are still not meeting the particularities of this field.

The Prevalence of Code Smells in Machine Learning projects

Manual analysis of code smells in open-source ML projects showed that code duplication is widespread and that the PEP8 convention for identifier naming style may not always be applicable to ML code due to its resemblance with mathematical notation, but several major obstructions to the maintainability and reproducibility of ML projects were found.

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

This study investigates the extent to which Data Science projects follow code standards and indicates that Data Science codebases are distinct from traditional software codebased and do not follow traditional software engineering conventions.

The State of the ML-universe: 10 Years of Artificial Intelligence & Machine Learning Software Development on GitHub

A large-scale empirical study of AI & ML Tool and Application repositories hosted on GitHub to identify unique properties, development patterns, and trends, and an elaborate study of developer workflow that measures collaboration and autonomy within a repository is enhanced.

Code Smells and Refactoring: A Tertiary Systematic Review of Challenges and Observations

A tertiary systematic literature review of previous surveys, secondary systematic literature reviews, and systematic mappings shows that code smells and refactoring have a strong relationship with quality attributes, i.e., with understandability, maintainability, testability, complexity, functionality, and reusability.

Pitfalls Analyzer: Quality Control for Model-Driven Data Science Pipelines

This paper implemented a prototype of the Pitfalls Analyzer for KNIME, which is one of the most popular data science pipeline tools, and the prototype is model-driven, since the detection of pitfalls is accomplished using pipelines that were created with KNIME building blocks.

Taxonomy of Real Faults in Deep Learning Systems

This paper introduces a large taxonomy of faults in deep learning (DL) systems, manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks and from related Stack Overflow posts.