Semantic clustering: Identifying topics in source code

@article{Kuhn2007SemanticCI,
  title={Semantic clustering: Identifying topics in source code},
  author={Adrian Kuhn and St{\'e}phane Ducasse and Tudor G{\^i}rba},
  journal={Inf. Softw. Technol.},
  year={2007},
  volume={49},
  pages={230-243}
}
Discrete Characterization of Domain Using Semantic Clustering
TLDR
The mapping of domain to the code using the information retrieval techniques to use linguistic information, such as identifier names and comments in source code, to understand software as a whole is proposed.
Extracting High-Level Concepts from Open-Source Systems
TLDR
This paper extracts topic models from the textual content of source code by conducting a case study on the source code of Java-based open-source systems, ArgoUML, Checkstyle, JHotDraw and jEdit, and investigates the effectiveness of LDA in comprehending large open- source software systems.
On the Effect of Semantically Enriched Context Models on Software Modularization
TLDR
The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis.
Identifying domain expertise of developers from source code
TLDR
The analysis first derives documents from source code by discarding all the programming language constructs, and KMeans clustering is further used to cluster documents and extract closely related concepts.
Topic modeling of public repositories at scale using names in source code
TLDR
The goal of this paper is to apply topic modeling to names used in over 13.6 million repositories and perceive the inferred topics through data analysis together with open-access to the source code, tools and datasets.
Investigating the use of lexical information for software system clustering
TLDR
This paper explores the contribution of the combined use of six different dictionaries corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements.
Estimating Semantic Relatedness in Source Code
TLDR
Normalized Software Distance (nsd), an information-theoretic method that captures semantic relatedness in source code by exploiting the distributional cues of code terms across the system, is proposed.
Identifying Semantic Outliers of Source Code Artifacts and Their Application to Software Architecture Recovery
TLDR
A novel measure Conceptual Conformity (CC) is proposed, which computes the similarity between two latent topic distributions obtained from both the source code and its package, and is used to identify source code that is not relevant to the package’s semantic context and define it as a semantic outlier.
Supporting program comprehension with program summarization
TLDR
This paper proposes to use latent semantic indexing and clustering to group source artifacts with similar vocabulary to analyze the composition of each package in the program and employs Minipar, a nature language parser, to help generate the summaries.
Using Developers Contributions on Software Vocabularies to Identify Experts
TLDR
Results confirm similarity between vocabularies might be explored to point out code experts and can recommend among current team members one whose vocabulary is closest to the entity for orphaned entities.
...
...

References

SHOWING 1-10 OF 50 REFERENCES
Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in So
TLDR
Semantic Clustering is introduced, an algorithm to group source artifacts based on how they use similar terms, which works at the source code textual level which makes it language independent.
Enriching reverse engineering with semantic clustering
TLDR
This paper analyzes how semantics of the source code are spread over the source artifacts using latent semantic indexing, an information retrieval technique that cluster artifacts that use similar terms, and reveals the most relevant terms for the computed clusters.
Using latent semantic analysis to identify similarities in source code to support program understanding
  • J. Maletic, A. Marcus
  • Computer Science
    Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000
  • 2000
TLDR
The paper describes the results of applying Latent Semantic Analysis (LSA), an advanced information retrieval method, to program source code and associated documentation to assist in the understanding of a nontrivial software system, namely a version of Mosaic.
Recovering Traceability Links between Code and Documentation
TLDR
A probabilistic and a vector space information retrieval model is applied in two case studies to trace C++ source code onto manual pages and Java code to functional requirements to recover traceability links between source code and free text documents.
Identification of high-level concept clones in source code
  • A. Marcus, J. Maletic
  • Computer Science
    Proceedings 16th Annual International Conference on Automated Software Engineering (ASE 2001)
  • 2001
TLDR
The intention of the approach is to enhance and augment existing clone detection methods that are based on structural analysis and improve the quality of clone detection.
Extracting concepts from file names; a new file clustering criterion
TLDR
This work discusses techniques for extracting concepts (abbreviations) from a more informal source of information: file names and shows by experiment that the techniques proposed allow about 90% of the abbreviations to be found automatically.
MUDABlue: an automatic categorization system for open source repositories
Recovering documentation-to-source-code traceability links using latent semantic indexing
  • A. Marcus, J. Maletic
  • Computer Science
    25th International Conference on Software Engineering, 2003. Proceedings.
  • 2003
TLDR
The method presented proves to give good results by comparison and additionally it is a low cost, highly flexible method to apply with regards to preprocessing and/or parsing of the source code and documentation.
An information retrieval approach to concept location in source code
TLDR
This work addresses the problem of concept location using an advanced information retrieval method, Latent Semantic Indexing (LSI), used to map concepts expressed in natural language by the programmer to the relevant parts of the source code.
The conceptual cohesion of classes
  • A. Marcus, D. Poshyvanyk
  • Computer Science
    21st IEEE International Conference on Software Maintenance (ICSM'05)
  • 2005
TLDR
A new set of measures for the cohesion of individual classes within an OO software system is proposed, based on the analysis of the semantic information embedded in the source code, such as comments and identifiers.
...
...