From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering
@article{Ye2016FromWE, title={From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering}, author={Xin Ye and Hui Shen and Xiao Ma and Razvan C. Bunescu and Chang Liu}, journal={2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)}, year={2016}, pages={404-415} }
The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages. [] Key Method In the proposed architecture, word embeddings are rst trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical…
Figures and Tables from this paper
209 Citations
Improving software text retrieval using conceptual knowledge in source code
- Computer Science2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)
- 2017
A novel approach for improving the retrieval of API learning resources through leveraging software-specific conceptual knowledge in software source code, which leads to at least 13.77% improvement with respect to mean average precision (MAP).
EmbSE: A Word Embeddings Model Oriented Towards Software Engineering Domain
- Computer ScienceSBES
- 2019
The results are promising, presenting a 48% improvement in the mAP values for the EmbSE concerning the model trained on the generic corpus, which reinforces the hypothesis that a model of this nature can bring significant improvements in the classification of texts of the area.
Crawling Wikipedia Pages to Train Word Embeddings Model for Software Engineering Domain
- Computer ScienceISEC
- 2021
A pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words and can outperform the state-of-the-art word embedDings model trained on Google news in terms of its representational power.
Complementing global and local contexts in representing API descriptions to improve API retrieval tasks
- Computer ScienceESEC/SIGSOFT FSE
- 2018
D2Vec is a neural network model that considers two complementary contexts to better capture the semantics of API documentation and demonstrates the usefulness and good performance in three applications: API code search, API tutorial fragment search, and mining API mappings between software libraries (code-to-code retrieval).
SCOR: Source Code Retrieval with Semantics and Order
- Computer Science2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
- 2019
This work demonstrates that by combining word2vec with the power of MRF, it is possible to achieve improvements between 6% and 30% in retrieval accuracy over the best results that can be obtained with the more traditional applications of MRf to representations based on term and term-term frequencies.
Mapping Bug Reports to Relevant Source Code Files Based on the Vector Space Model and Word Embedding
- Computer ScienceIEEE Access
- 2019
This paper proposes a new method that reduces the time required for bug localization by using the word vector to address the lexical gap between the programming language and natural language and shows that it outperforms classical IR-based methods in locating relevant source code files based on several indicators.
Helping developers search and locate task-relevant information in natural language documents
- Computer ScienceESEC/SIGSOFT FSE
- 2019
It is hypothesized that it is possible to design a more generalizable approach that can identify, for a particular task, relevant text across different artifact types establishing relationships between them and facilitating how developers search and locate task-relevant information.
Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding
- Computer ScienceIEEE Transactions on Software Engineering
- 2020
This study proposes Word2API to effectively estimate relatedness of words and APIs, and a shuffling strategy is used to transform related words and API into tuples to address the alignment challenge.
Poster: Which Similarity Metric to Use for Software Documents?: A Study on Information Retrieval Based Software Engineering Tasks
- Computer Science2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion)
- 2018
This work analyzes the performance of different similarity metrics on various SE documents and observes that, in general, the context-aware IR models achieve better performance on textual artifacts and simple keyword-based bag-of-words models perform better in code artifacts.
Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports
- Computer Science2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE)
- 2016
The approach combines a traditional information retrieval technique and a word embedding technique, and takes bug titles and descriptions as well as bug product and component information into consideration, and improves the performance of NextBug statistically significantly and substantially for both projects.
References
SHOWING 1-10 OF 49 REFERENCES
Automatically mining software-based, semantically-similar words from comment-code mappings
- Computer Science2013 10th Working Conference on Mining Software Repositories (MSR)
- 2013
This paper presents an automatic technique to mine semantically similar words, particularly in the software context, and leverages the role of leading comments for methods and programmer conventions in writing them.
Automated construction of a software-specific word similarity database
- Computer Science2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE)
- 2014
An automated approach is proposed that builds a software-specific WordNet like resource, named WordSimSEDB, by leveraging the textual contents of posts in StackOverflow by computing the similarities of the weighted co-occurrences of these words with three types of words in the textual corpus.
Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging
- Computer Science2012 28th IEEE International Conference on Software Maintenance (ICSM)
- 2012
A similarity metric to infer semantically related terms, each of which is a tag, is proposed and a taxonomy that could further describe the relationships among these terms is built, which is reasonably good.
Inferring semantically related words from software context
- Computer Science2012 9th IEEE Working Conference on Mining Software Repositories (MSR)
- 2012
This paper proposes a simple and general technique to automatically infer semantically related words in software by leveraging the context of words in comments and code and achieves a reasonable accuracy in seven large and popular code bases written in C and Java.
Automatic query reformulations for text retrieval in software engineering
- Computer Science2013 35th International Conference on Software Engineering (ICSE)
- 2013
A recommender (called Refoqus) based on machine learning is proposed, which is trained with a sample of queries and relevant results and automatically recommends a reformulation strategy that should improve its performance, based on the properties of the query.
SEWordSim: software-specific word similarity database
- Computer ScienceICSE Companion
- 2014
In recent work, a word similarity resource based on information collected automatically from StackOverflow is proposed, and it is found that the results are given scores on a 3-point Likert scale that are over 50% higher than the results of a resourcebased on WordNet.
Retrieval from software libraries for bug localization: a comparative study of generic and composite text models
- Computer ScienceMSR '11
- 2011
A major conclusion of this comparative study is that simple text models such as UM and VSM are more effective at correctly retrieving the relevant files from a library as compared to the more sophisticated modelssuch as LDA.
Linguistic Regularities in Continuous Space Word Representations
- Computer ScienceNAACL
- 2013
The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.
Leveraging usage similarity for effective retrieval of examples in code repositories
- Computer ScienceFSE '10
- 2010
This paper presents Structural Semantic Indexing (SSI), a technique to associate words to source code entities based on similarities of API usage to show that entities that show similar uses of APIs are semantically related because they do similar things.
Corpus-based and Knowledge-based Measures of Text Semantic Similarity
- Computer ScienceAAAI
- 2006
This paper shows that the semantic similarity method out-performs methods based on simple lexical matching, resulting in up to 13% error rate reduction with respect to the traditional vector-based similarity metric.