Natural Language Processing Based Detection of Duplicate Defect Patterns

Abstract

A Defect pattern repository collects different kinds of defect patterns, which are general descriptions of the characteristics of commonly occurring software code defects. Defect patterns can be widely used by programmers, static defect analysis tools, and even runtime verification. Following the idea of web 2.0, defect pattern repositories allow these users to submit defect patterns they found. However, submission of duplicate patterns would lead to a redundancy in the repository. This paper introduces an approach to suggest potential duplicates based on natural language processing. Our approach first computes field similarities based on Vector Space Model, and then employs Information Entropy to determine the field importance, and next combines the field similarities to form the final defect pattern similarity. Two strategies are introduced to make our approach adaptive to special situations. Finally, groups of duplicates are obtained by adopting Hierarchical Clustering. Evaluation indicates that our approach could detect most of the actual duplicates (72% in our experiment) in the repository.

DOI: 10.1109/COMPSACW.2010.45

1 Figure or Table

Cite this paper

@article{Wu2010NaturalLP, title={Natural Language Processing Based Detection of Duplicate Defect Patterns}, author={Qian Wu and Qianxiang Wang}, journal={2010 IEEE 34th Annual Computer Software and Applications Conference Workshops}, year={2010}, pages={220-225} }