Multimodal Representation for Neural Code Search

  title={Multimodal Representation for Neural Code Search},
  author={Jian Gu and Zimin Chen and Monperrus Martin},
  journal={2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single… 

Cross-Modal Contrastive Learning for Code Search

This paper proposes CrossCS, a cross-modal contrastive learning method for code search to improve the representations of code and query by explicit fine-grained contrastive objectives and designs a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities.

Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation

A new approach with multimodal contrastive learning and soft data augmentation for code search by dynamically masking and replacing some tokens in code sequences to generate code snippets that are similar but not necessarily semantic-preserving as positive samples for paired queries.

On the Effectiveness of Pretrained Models for API Learning

This work uses a dataset that contains 7 million annotations collected from GitHub to evaluate the effectiveness of recent Pre-trained Transformer based Models (PTMs) for the API learning task and identifies two different tokenization approaches that can contribute to a significant boost in PTMs' performance for theAPI sequence generation task.

On the Effectiveness of Transfer Learning for Code Search

It is demonstrated that natural language processing models based on the Transformer architecture can be directly applied to source code analysis tasks, such as code search, and the combined use of an information retrieval-based approach followed by a Transformer leads to the best results overall.

How to better utilize code graphs in semantic code search?

Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs, so that the effectiveness of code search can be greatly improved.

Exploring Representation-Level Augmentation for Code Search

This paper ex-plore augmentation methods that augment data at representation level which does not require additional data processing and training, and proposes a general format of representation-level augmentation that unifies existing methods.


Results from automated as well as human evaluation suggest that the inclusion of code context in search significantly improves the retrieval of the correct code snippet but slightly impairs ranking quality among code snippets.

Assemble Foundation Models for Automatic Code Summarization

This work assembles available foundation models, such as CodeBERT and GPT-2, into a single model named AdaMo, and utilizes Gaussian noise as the simulation of contextual information to optimize the latent representation.

Is neural machine translation approach accurate enough for coding assistance?

A transcompiler-based back-translation, a data augmentation method that generates parallel corpora from numerous source code repositories and the resulting BLEU indicates that the proposed model is accurate enough to allow coding assistance in the future.

A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

The results suggest that Transformer based on abstract syntax trees (ASTs) shows more robust performance than the model based on only code sequence under most code transformations, and the design of positional encoding can impact the robustness of Transformer under code transformation.



Deep Code Search

A novel deep neural network named CODEnn (Code-Description Embedding Neural Network) is proposed, which jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors.

Improving Code Search with Co-Attentive Representation Learning

Experimental results show that the proposed co-attentive representation learning model, CARLCS-CNN, significantly outperforms DeepCS by 26.72% in terms of MRR (mean reciprocal rank) and is five times faster than DeepCS in model training and four times in testing.

Multi-modal Attention Network Learning for Semantic Source Code Retrieval

Comprehensive experiments and analysis on a large-scale real-world dataset show that the proposed MMAN model can accurately retrieve code snippets and outperforms the state-of-the-art methods.

Retrieval on source code: a neural code search

This paper investigates the use of natural language processing and information retrieval techniques to carry out natural language search directly over source code, i.e. without having a curated Q&A forum such as Stack Overflow at hand.

When deep learning met code search

This paper assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora, and introduced a new design point that is a minimal supervision extension to an existing unsupervised technique.

code2vec: learning distributed representations of code

A neural model for representing snippets of code as continuous distributed vectors as a single fixed-length code vector which can be used to predict semantic properties of the snippet, making it the first to successfully predict method names based on a large, cross-project corpus.

Learning Code-Query Interaction for Enhancing Code Searches

CQIL learns code-query interactions and uses a CNN (Convolutional Neural Network) to compute semantic correlations between queries and code snippets, which solves the OOV problem and the independent similarity matching and the small training dataset problems.

Two-Stage Attention-Based Model for Code Search with Textual and Structural Features

  • Ling XuHuanhuan Yang Zhou Xu
  • Computer Science
    2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
  • 2021
A code search model TabCS (Two-stage Attention-Based model for Code Search) is proposed, which leverages attention mechanisms to extract semantics from code and query considering their semantic gap and learns better code/ query representation.

Summarizing Source Code using a Neural Attention Model

This paper presents the first completely datadriven approach for generating high level summaries of source code, which uses Long Short Term Memory (LSTM) networks with attention to produce sentences that describe C# code snippets and SQL queries.