An Empirical Study on Program Failures of Deep Learning Jobs

@article{Zhang2020AnES,
  title={An Empirical Study on Program Failures of Deep Learning Jobs},
  author={Ru Zhang and Wencong Xiao and Hongyu Zhang and Yu Liu and Haoxiang Lin and Mao Yang},
  journal={2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE)},
  year={2020},
  pages={1159-1170}
}
  • Ru ZhangWencong Xiao Mao Yang
  • Published 27 June 2020
  • Computer Science
  • 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE)
Deep learning has made significant achievements in many application areas. To train and test models more efficiently, enterprise developers submit and run their deep learning programs on a shared, multi-tenant platform. However, some of the programs fail after a long execution time due to code/script defects, which reduces the development productivity and wastes expensive resources such as GPU, storage, and network I/O. This paper presents the first comprehensive empirical study on program… 

Impact of programming languages on machine learning bugs

This paper proposes the first empirical study on the impact of programming languages on bugs in ML programs, and plans to analyze software from GitHub and related discussions in GitHub issues and Stack Overflow for bug distributions, aiming to identify correlations with the chosen programming language, its features and the application domain.

Towards Demystifying the Impact of Dependency Structures on Bug Locations in Deep Learning Libraries

A large set of benchmarks and a prototype toolkit to automatically detect various forms of dependency structures for deep learning libraries, and demonstrates the significant differences among syntactic, history, and semantic structures, and their vastly different impacts on bug locations.

gDefects4DL: A Dataset of General Real-World Deep Learning Program Defects

This work presents gDefects4DL, a dataset for general bugs of deep learning programs, which contains 64 bugs falling into 6 categories (i.e., API Misuse, Shape Mismatch, Number Error, Type Mismatches, Violation of Architecture Convention, and Performance Bug).

Characterizing Performance Bugs in Deep Learning Systems

This first comprehensive study to characterize symptoms, root causes, and introducing and exposing stages of PBs in DL systems developed in TensorFLow and Keras is presented, with a total of 238 PBs collected from 225 StackOverflow posts.

Comparative analysis of real bugs in open-source Machine Learning projects - A Registered Report

Whether there is a discrepancy in the distribution of resolution time between ML and non-ML issues and whether certain categories of ML issues require a longer time to resolve based on real issue reports in open-source applied ML projects is investigated.

Demystifying Developers' Issues in Distributed Training of Deep Learning Software

This paper extracts and analyzes 1,054 real-world developers’ issues in distributed training from Stack Overflow and GitHub and constructs a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms.

Detecting TensorFlow Program Bugs in Real-World Industrial Environment

  • Chen LiuJie Lu Jingling Xue
  • Computer Science
    2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)
  • 2021
An extensive empirical study on 12,289 failed TensorFlow jobs is reported, showing that existing static tools can effectively detect 72.55% of the top three types of Python bugs in industrial Tensor Flow programs, and a constraint-based approach for detecting Tensor flow shape-related errors is proposed, together with an associated tool, ShapeTracer.

On Reporting Performance and Accuracy Bugs for Deep Learning Frameworks: An Exploratory Study from GitHub

An exploratory study on the nature of reporting performance and accuracy bugs for DL frameworks, aiming to improve knowledge on this topic, finds that low speed is the primary reason that a performance bug related report is submitted but there is no consistent pattern for accuracy related ones.

On the Variability of Software Engineering Needs for Deep Learning: Stages, Trends, and Application Types

This work distill several actionable insights for SE4DL research, practice, and education such as better support on using trained models, application-type specific tools and teaching materials.

Characterizing and Detecting Bugs in WeChat Mini-Programs

  • Tao WangQi Xu Tao Huang
  • Computer Science
    2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)
  • 2022
The first empirical study on 83 WeChat Mini-Program bugs is conducted, and an in-depth analysis of their root causes, impacts and fixes are performed, resulting in many interesting findings that can open up new research directions for combating Wechat Mini- program bugs.

References

SHOWING 1-10 OF 49 REFERENCES

A comprehensive study on deep learning bug characteristics

The key findings of this study include: data bug and logic bug are the most severe bug types in deep learning software appearing more than 48% of the times, and major root causes of these bugs are Incorrect Model Parameter (IPS) and Structural Inefficiency (SI) showing up more than 43% ofthe times.

An Empirical Study on Real Bugs for Machine Learning Programs

An empirical study on real machine learning bugs to examine their patterns and how they evolve over time is conducted and shows that there are seven categories of bugs in machine learning programs.

A characteristic study on failures of production distributed data-parallel programs

The study results provide valuable guidelines for future development of data-parallel programs, and it is believed that these guidelines are not limited to SCOPE, but can also be generalized to other similar data- parallel platforms.

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

DeepXplore efficiently finds thousands of incorrect corner case behaviors in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data.

An Empirical Study of Bugs in Machine Learning Systems

This study analyzes three machine learning systems, Apache Mahout, Lucene, and OpenNLP, which are data mining, information retrieval, and natural language processing tools respectively, and looks into their bug databases and code repositories, analyzes a sample set of bugs and corresponding fixes, and labels the bugs into various categories.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

A detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise is presented and design guidelines pertaining to next-generation cluster schedulers for DNN training workloads are provided.

Scalable Multi-Framework Multi-Tenant Lifecycle Management of Deep Learning Training Jobs

A deep learning stack specifically designed for on-demand cloud environments is presented and real usage data from internal users is examined, and performance experiments that illustrate the scalability of the system are discussed.

An empirical study on TensorFlow program bugs

This work studied deep learning applications built on top of Tensor Flow and collected program bugs related to TensorFlow from StackOverflow QA pages and Github projects to examine the root causes and symptoms of coding defects in Tensorflow programs.

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Optimus is proposed, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job.