How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms

@article{Panichella2013HowTE,
  title={How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms},
  author={Annibale Panichella and Bogdan Dit and Rocco Oliveto and Massimiliano Di Penta and Denys Poshyvanyk and Andrea De Lucia},
  journal={2013 35th International Conference on Software Engineering (ICSE)},
  year={2013},
  pages={522-531}
}
Information Retrieval (IR) methods, and in particular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However… 

Figures and Tables from this paper

A survey on the use of topic models when mining software repositories

TLDR
This paper surveys 167 articles from the software engineering literature that make use of topic models and provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task.

Semantic topic models for source code analysis

TLDR
The results show that the proposed approach for topic modeling designed for source code produces stable, more interpretable, and more expressive topics than classical topic modeling techniques without the necessity for extensive parameter calibration.

Toward Optimal Selection of Information Retrieval Models for Software Engineering Tasks

TLDR
A generalized framework, SRCH, is proposed to automatically select the most favorable IR model(s) for a given SE task, and a preliminary user study shows that SRCH's intelligent adaption of the IR models to the task at hand not only improves precision and recall for SE tasks but may also improve users' satisfaction.

A Mechanism for Automatically Summarizing Software Functionality from Source Code

TLDR
A vectorizer is used to extract information from variable/method names and comments, and Latent Dirichlet Allocation is applied to cluster the source code files of a project into different semantic topics.

Comparison of Data Preprocessing Techniques on Software Sources for Topic Modeling

TLDR
An experiment in which four data preprocessing techniques are compared and results suggest there is minor difference between the four techniques, which implies the software source code as-is can be used for topic modeling.

Modeling the evolution of development topics using Dynamic Topic Models

TLDR
Dynamic Topic Models (DTM) are used to analyze commit messages within a project's lifetime to capture both strength and content evolution simultaneously to help developers better understand the evolution of their projects.

A Systematic Comparison of Search Algorithms for Topic Modelling - A Study on Duplicate Bug Report Identification

TLDR
A systematic comparison of five different meta-heuristics used to configure LDA in the context of duplicate bug reports identification shows that no master algorithm outperforms the others for all software projects, and random search and PSO are the least effective meta- heuristics.
...

References

SHOWING 1-10 OF 42 REFERENCES

Validating the Use of Topic Models for Software Evolution

TLDR
This paper performs a qualitative case study on 12 releases of JHot Draw, a well studied and documented system, and finds that topic evolutions are characterizable through spikes and drops in their metric values, and are encouraged by the use of topic models as a tool for analyzing the evolution of software.

Using IR methods for labeling source code artifacts: Is it worthwhile?

TLDR
Results indicate that clustering-based approaches (LSI and LDA) are much more worthwhile to be used on source code artifacts having a high verbosity, as well as for artifacts requiring more effort to be manually labeled.

Bug localization using latent Dirichlet allocation

Configuring latent Dirichlet allocation based feature location

TLDR
The key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context.

Assigning change requests to software developers

TLDR
The paper presents an approach to recommend a ranked list of expert developers to assist in the implementation of software change requests, and shows that the presented approach outperforms them with a substantial margin.

What's hot and what's not: Windowed developer topic analysis

TLDR
This paper proposes windowing the topic analysis to give a more nuanced view of the system's evolution and demonstrates that windowed topic analysis offers advantages over topic analysis applied to a project's lifetime because many topics are quite local.

Information Retrieval Applications in Software Maintenance and Evolution

TLDR
The techniques described pertain to the maintenance and evolution phase of the software life cycle and focus on such problems as feature location and impact analysis, highlighting the bright future that IR brings to addressing software engineering problems.

Using Relational Topic Models to capture coupling among classes in object-oriented software systems

TLDR
A new coupling metric for object-oriented software systems is proposed, namely Relational Topic based Coupling (RTC) of classes, which uses Relational topic Models (RTM), generative probabilistic model, to capture latent topics in source code classes and relationships among them.

A theory of aspects as latent topics

TLDR
This work provides not only a concrete approach for identifying aspects at several scales in an unsupervised manner but, more importantly, a formulation of AOP grounded in information theory.

An information retrieval process to aid in the analysis of code clones

TLDR
LSI is used to cluster clone classes that have been identified initially by a clone detection tool to detect trends and associations among the clustered clone classes and determine if they provide further comprehension to assist in the maintenance of clones.