• Corpus ID: 239009824

Cascaded Fast and Slow Models for Efficient Semantic Code Search

  title={Cascaded Fast and Slow Models for Efficient Semantic Code Search},
  author={Akhilesh Deepak Gotmare and Junnan Li and Shafiq R. Joty and Steven C. H. Hoi},
The goal of natural language semantic code search is to retrieve a semantically relevant code snippet from a fixed set of candidates using a natural language query. Existing approaches are neither effective nor efficient enough towards a practical semantic code search system. In this paper, we propose an efficient and accurate semantic code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval… 

Figures and Tables from this paper


Deep Code Search
A novel deep neural network named CODEnn (Code-Description Embedding Neural Network) is proposed, which jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors.
When deep learning met code search
This paper assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora, and introduced a new design point that is a minimal supervision extension to an existing unsupervised technique.
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
This work equips transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving scalability, and introduces a generic approach for combining a Fast dual encoder model with a Slow but accurate transformer- based model via distillation and reranking.
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL.
SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
This paper proposes SYNCOBERT, a Syntax-guided multi-modal contrastive pre-training approach for better Code representations, and designs two novel pre- training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP).
Retrieval on source code: a neural code search
This paper investigates the use of natural language processing and information retrieval techniques to carry out natural language search directly over source code, i.e. without having a curated Q&A forum such as Stack Overflow at hand.
CoSQA: 20, 000+ Web Queries for Code Search and Question Answering
The CoSQA dataset, which includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators, is introduced and a contrastive learning method dubbed CoCLR is introduced to enhance query-code matching.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering
This paper proposes bridging the lexical gap by projecting natural language statements and code snippets as meaning vectors in a shared representation space and shows that the learned vector space embeddings lead to improvements in a previously explored bug localization task and a newly introduced task of linking API documents to computer programming questions.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.