Better software analytics via “DUO”: Data mining algorithms using/used-by optimizers

@article{Agrawal2020BetterSA,
  title={Better software analytics via “DUO”: Data mining algorithms using/used-by optimizers},
  author={Amritanshu Agrawal and Tim Menzies and Leandro L. Minku and Markus Wagner and Zhe Yu},
  journal={Empirical Software Engineering},
  year={2020},
  volume={25},
  pages={2099-2136}
}
This paper claims that a new field of empirical software engineering research and practice is emerging: data mining using/used-by optimizers for empirical studies, or DUO . For example, data miners can generate models that are explored by optimizers. Also, optimizers can advise how to best adjust the control parameters of a data miner. This combined approach acts like an agent leaning over the shoulder of an analyst that advises “ask this question next” or “ignore that problem, it is not… 

When, and Why, Simple Methods Fail. Lessons Learned from Hyperparameter Tuning in Software Analytics (and Elsewhere)

TLDR
The conclusion will be that this special properties of SE data can be exploited to great effect to find better optimizations for SE tasks via a tactic called "dodging" (explained in this paper).

How to “DODGE” Complex Software Analytics

TLDR
By ignoring redundant tunings, ODGE, a tuning tool, runs orders of magnitude faster, while also generating learners with more accurate predictions than seen in prior state-of-the-art approaches.

An Empirical Study of Model-Agnostic Techniques for Defect Prediction Models

TLDR
It is concluded that model-agnostic techniques are needed to explain individual predictions of defect models and that more than half of the practitioners perceive that the contrastive explanations are necessary and useful to understand the predictions of defects.

Is AI dierent for SE?

TLDR
A new kind of SE research is needed for developing new AI tools that are more suited to SE problems, as it is shown that standard AI tools work best when the target is relatively more frequent.

Predicting health indicators for open source projects (using hyperparameter optimization)

TLDR
This is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects, and finds that traditional estimation algorithms make many mistakes.

Is AI different for SE?

TLDR
Standard AI tools work best when the target is relatively more frequent, and a new kind of SE research is needed for developing new AI tools that are more suited to SE problems.

A Pragmatic Approach for Hyper-Parameter Tuning in Search-based Test Case Generation

TLDR
A new metric is proposed (“Tuning Gain”), which estimates how cost-effective tuning a particular class is, and a tuning approach called Meta-GA is used, which shows that for a low tuning budget, prioritizing classes outperforms the alternatives in terms of extra covered branches.

What makes a good Node.js package? Investigating Users, Contributors, and Runnability

TLDR
This study conducts a survey asking Node.js developers to evaluate the importance of 30 features derived from existing work, including GitHub activity, software usability, and properties of the repository and documentation, and finds that predicting the runnability of packages is viable.

VEER: A Fast and Disagreement-Free Multi-objective Configuration Optimizer

TLDR
This paper shows that model disagreement can be mitigated via VEER, a one-dimensional approximation to the N-objective space, which is recommended as a very fast method to solve complex configuration problems, while at the same time avoiding model disagreement.

Predicting Good Configurations for GitHub and Stack Overflow Topic Models

TLDR
It is found that popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in the experiments, and corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and one can predict good configurations for unseen corpora reliably.

References

SHOWING 1-10 OF 143 REFERENCES

Tuning for Software Analytics: is it Really Necessary?

Data-Driven Search-Based Software Engineering

TLDR
It is argued that combining these two fields is useful for situations which require learning from a large data source or when optimizers need to know the lay of the land to find better solutions, faster.

Perspectives on Data Science for Software Engineering

Which models of the past are relevant to the present? A software effort estimation approach to exploiting useful past models

TLDR
Dynamic Cross-company Learning (DCL) is proposed to dynamically identify which WC or CC past models are most useful for making predictions to a given company at the present, and automatically emphasizes the predictions given by these models in order to improve predictive performance.

What is wrong with topic modeling? And how to fix it using search-based software engineering

Building Better Quality Predictors Using "ε-Dominance"

TLDR
DART, an algorithm especially selected to succeed for large ε software quality prediction problems, is explored, which dramatically outperforms three sets of state-of-the-art defect prediction methods.

On the value of user preferences in search-based software engineering: A case study in software product lines

TLDR
The conclusion is that search-based software engineering methods need to change, particularly when studying complex decision spaces, since methods in widespread use perform much worse than IBEA (Indicator-Based Evolutionary Algorithm).

Building Better Quality Predictors Using "$\epsilon$-Dominance"

TLDR
DART, an algorithm especially selected to succeed for large $\epsilon$ software quality prediction problems, is explored, which dramatically out-performs three sets of state-of-the-art defect prediction methods.
...