From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering

@article{Ye2016FromWE,
  title={From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering},
  author={Xin Ye and Hui Shen and Xiao Ma and Razvan C. Bunescu and Chang Liu},
  journal={2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)},
  year={2016},
  pages={404-415}
}
  • Xin Ye, Hui Shen, Chang Liu
  • Published 14 May 2016
  • Computer Science
  • 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)
The application of information retrieval techniques to search tasks in software engineering is made difficult by the lexical gap between search queries, usually expressed in natural language (e.g. English), and retrieved documents, usually expressed in code (e.g. programming languages. [] Key Method In the proposed architecture, word embeddings are rst trained on API documents, tutorials, and reference documents, and then aggregated in order to estimate semantic similarities between documents. Empirical…
Improving software text retrieval using conceptual knowledge in source code
TLDR
A novel approach for improving the retrieval of API learning resources through leveraging software-specific conceptual knowledge in software source code, which leads to at least 13.77% improvement with respect to mean average precision (MAP).
EmbSE: A Word Embeddings Model Oriented Towards Software Engineering Domain
TLDR
The results are promising, presenting a 48% improvement in the mAP values for the EmbSE concerning the model trained on the generic corpus, which reinforces the hypothesis that a model of this nature can bring significant improvements in the classification of texts of the area.
Crawling Wikipedia Pages to Train Word Embeddings Model for Software Engineering Domain
TLDR
A pre-trained word embeddings model for SE which captures and reflects the domain-specific meanings of SE related words and can outperform the state-of-the-art word embedDings model trained on Google news in terms of its representational power.
Complementing global and local contexts in representing API descriptions to improve API retrieval tasks
TLDR
D2Vec is a neural network model that considers two complementary contexts to better capture the semantics of API documentation and demonstrates the usefulness and good performance in three applications: API code search, API tutorial fragment search, and mining API mappings between software libraries (code-to-code retrieval).
SCOR: Source Code Retrieval with Semantics and Order
  • Shayan A. Akbar, A. Kak
  • Computer Science
    2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
  • 2019
TLDR
This work demonstrates that by combining word2vec with the power of MRF, it is possible to achieve improvements between 6% and 30% in retrieval accuracy over the best results that can be obtained with the more traditional applications of MRf to representations based on term and term-term frequencies.
Mapping Bug Reports to Relevant Source Code Files Based on the Vector Space Model and Word Embedding
TLDR
This paper proposes a new method that reduces the time required for bug localization by using the word vector to address the lexical gap between the programming language and natural language and shows that it outperforms classical IR-based methods in locating relevant source code files based on several indicators.
Helping developers search and locate task-relevant information in natural language documents
TLDR
It is hypothesized that it is possible to design a more generalizable approach that can identify, for a particular task, relevant text across different artifact types establishing relationships between them and facilitating how developers search and locate task-relevant information.
Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding
TLDR
This study proposes Word2API to effectively estimate relatedness of words and APIs, and a shuffling strategy is used to transform related words and API into tuples to address the alignment challenge.
Poster: Which Similarity Metric to Use for Software Documents?: A Study on Information Retrieval Based Software Engineering Tasks
TLDR
This work analyzes the performance of different similarity metrics on various SE documents and observes that, in general, the context-aware IR models achieve better performance on textual artifacts and simple keyword-based bag-of-words models perform better in code artifacts.
Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports
TLDR
The approach combines a traditional information retrieval technique and a word embedding technique, and takes bug titles and descriptions as well as bug product and component information into consideration, and improves the performance of NextBug statistically significantly and substantially for both projects.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 49 REFERENCES
Automatically mining software-based, semantically-similar words from comment-code mappings
TLDR
This paper presents an automatic technique to mine semantically similar words, particularly in the software context, and leverages the role of leading comments for methods and programmer conventions in writing them.
Automated construction of a software-specific word similarity database
  • Yuan Tian, D. Lo, J. Lawall
  • Computer Science
    2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE)
  • 2014
TLDR
An automated approach is proposed that builds a software-specific WordNet like resource, named WordSimSEDB, by leveraging the textual contents of posts in StackOverflow by computing the similarities of the weighted co-occurrences of these words with three types of words in the textual corpus.
Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging
TLDR
A similarity metric to infer semantically related terms, each of which is a tag, is proposed and a taxonomy that could further describe the relationships among these terms is built, which is reasonably good.
Inferring semantically related words from software context
  • Jinqiu Yang, Lin Tan
  • Computer Science
    2012 9th IEEE Working Conference on Mining Software Repositories (MSR)
  • 2012
TLDR
This paper proposes a simple and general technique to automatically infer semantically related words in software by leveraging the context of words in comments and code and achieves a reasonable accuracy in seven large and popular code bases written in C and Java.
Automatic query reformulations for text retrieval in software engineering
TLDR
A recommender (called Refoqus) based on machine learning is proposed, which is trained with a sample of queries and relevant results and automatically recommends a reformulation strategy that should improve its performance, based on the properties of the query.
SEWordSim: software-specific word similarity database
TLDR
In recent work, a word similarity resource based on information collected automatically from StackOverflow is proposed, and it is found that the results are given scores on a 3-point Likert scale that are over 50% higher than the results of a resourcebased on WordNet.
Retrieval from software libraries for bug localization: a comparative study of generic and composite text models
TLDR
A major conclusion of this comparative study is that simple text models such as UM and VSM are more effective at correctly retrieving the relevant files from a library as compared to the more sophisticated modelssuch as LDA.
Linguistic Regularities in Continuous Space Word Representations
TLDR
The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.
Leveraging usage similarity for effective retrieval of examples in code repositories
TLDR
This paper presents Structural Semantic Indexing (SSI), a technique to associate words to source code entities based on similarities of API usage to show that entities that show similar uses of APIs are semantically related because they do similar things.
Corpus-based and Knowledge-based Measures of Text Semantic Similarity
TLDR
This paper shows that the semantic similarity method out-performs methods based on simple lexical matching, resulting in up to 13% error rate reduction with respect to the traditional vector-based similarity metric.
...
1
2
3
4
5
...