Corpus ID: 233289449

BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?

  title={BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?},
  author={Abdullah Al Ishtiaq and Masum Hasan and Md. Mahim Anjum Haque and Kazi Sajeed Mehrab and Tanveer Muttaqueen and Tahmid Hasan and Anindya Iqbal and Rifat Shahriyar},
Millions of repetitive code snippets are submitted to code repositories every day. To search from these large codebases using simple natural language queries would allow programmers to ideate, prototype, and develop easier and faster. Although the existing methods have shown good performance in searching codes when the natural language description contains keywords from the code [21], they are still far behind in searching codes based on the semantic meaning of the natural language query and… Expand

Figures and Tables from this paper


Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented. Expand
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
The methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task are described. Expand
code2vec: learning distributed representations of code
A neural model for representing snippets of code as continuous distributed vectors as a single fixed-length code vector which can be used to predict semantic properties of the snippet, making it the first to successfully predict method names based on a large, cross-project corpus. Expand
Billion-Scale Similarity Search with GPUs
This paper proposes a novel design for an inline-formula that enables the construction of a high accuracy, brute-force, approximate and compressed-domain search based on product quantization, and applies it in different similarity search scenarios. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
When deep learning met code search
This paper assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora, and introduced a new design point that is a minimal supervision extension to an existing unsupervised technique. Expand
Retrieval on source code: a neural code search
This paper investigates the use of natural language processing and information retrieval techniques to carry out natural language search directly over source code, i.e. without having a curated Q&A forum such as Stack Overflow at hand. Expand
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. Expand
Semantic code search using Code2Vec: A bag-of-paths model
This thesis uses Code2Vec, a model that learns distributed representations of source code called code embeddings, to evaluate its performance against the task of semantically searching code snippets to create a hybrid model that outperforms previous benchmark baseline models developed in the CodeSearchNet challenge. Expand
The illustrated bert, elmo, and co
  • (how nlp cracked transfer learning) – jay alammar – visualizing machine learning one concept at a time. http:// (2018),
  • 2020