Revisiting Dockerfiles in Open Source Software Over Time

  title={Revisiting Dockerfiles in Open Source Software Over Time},
  author={Kalvin Eng and Abram Hindle},
  journal={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
  • Kalvin Eng, Abram Hindle
  • Published 23 March 2021
  • Computer Science
  • 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)
Docker is becoming ubiquitous with containerization for developing and deploying applications. Previous studies have analyzed Dockerfiles that are used to create container images in order to better understand how to improve Docker tooling. These studies obtain Dockerfiles using either Docker Hub or Github. In this paper, we revisit the findings of previous studies using the largest set of Dockerfiles known to date with over 9.4 million unique Dockerfiles found in the World of Code… 


A Large-scale Data Set and an Empirical Study of Docker Images Hosted on Docker Hub
The results demonstrate the maturity of the Docker ecosystem: more reliance on ready-to-use language and application base images as opposed to yet-to be-configured OS images, a downward trend of Docker image sizes demonstrating the adoption of best practices of keeping images small, and a declining trend in the number of smells suggesting a general improvement in quality.
An Empirical Analysis of the Docker Container Ecosystem on GitHub
An exploratory empirical study on the Docker ecosystem, prevalent quality issues, and the evolution of Dockerfiles finds that most quality issues arise from missing version pinning, and proposes to introduce an abstraction that could deal with the intricacies of different package managers and could improve migration to more light-weight images.
A clustering-based approach for mining dockerfile evolutionary trajectories
The potential to implement the best practices through the analysis of the dockerfile evolutionary trajectories motivated this work.
An Insight Into the Impact of Dockerfile Evolutionary Trajectories on Quality and Latency
An empirical study on a large dataset of 2,840 projects to shed light on the impact of dockerfile evolutionary trajectories on quality and latency in the Docker-based containerization, which derives a number of suggestions for practitioners.
Characterizing the Occurrence of Dockerfile Smells in Open-Source Software: An Empirical Study
An empirical study on a large dataset of 6,334 projects to help developers gain some insights into the occurrence of Dockerfile smells, including its coverage, distribution, co-occurrence, and correlation with project characteristics.
Learning from, Understanding, and Supporting DevOps Artifacts for Docker
A toolset, binnacle, is introduced that enabled us to ingest 900,000 GitHub repositories and learn rules and analyzer that can be used to aid developers in the IDE when creating Dockerfiles, and in a post-hoc fashion to identify issues in, and to improve, existing Dockerfiles.
World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data
A very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC), which is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage, and is expected to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem.
A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits
The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 400K repositories, and expects that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications.
Curating GitHub for engineered software projects
This work proposes a framework, and presents a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project and identifies software engineering practices (called dimensions) and proposes means for validating their existence in a GitHub repository.
Determining sample size.
  • E. Fess
  • Mathematics
    Journal of hand therapy : official journal of the American Society of Hand Therapists
  • 1995