CodeAttack: Code-based Adversarial Attacks for Pre-Trained Programming Language Models

  title={CodeAttack: Code-based Adversarial Attacks for Pre-Trained Programming Language Models},
  author={Akshita Jha and Chandan K. Reddy},
Pre-trained programming language (PL) models (such as CodeT5, CodeBERT, GraphCodeBERT, etc.,) have the potential to automate software engineering tasks involving code understanding and code generation. However, these models operate in the natural channel of code, i.e. , they are primarily concerned with the human understanding of the code. They are not robust to changes in the input and thus, are poten- tially susceptible to adversarial attacks in the natural channel. We propose, CodeAttack , a… 

Figures and Tables from this paper

Stealthy Backdoor Attack for Code Models

The proposed A FRAIDOOR exposes security weaknesses in code models under stealthy backdoor attacks and shows that the state-of-the-art defense method cannot provide sufficient protection.

Transformers Meet Directed Graphs

This work proposes two direction- and structure-aware positional encodings for directed graphs: the eigenvectors of the Magnetic Laplacian - a direction-aware generalization of the combinatorial LaPLacian; and directional random walkencodings.

ReCode: Robustness Evaluation of Code Generation Models

This paper proposes ReCode, a comprehensive robustness evaluation benchmark for code generation models, and customizable over 30 transformations for code on docstrings, function and variable names, code syntax, and code format, which provide multifaceted assessments of a model’s robustness performance.



Generating Adversarial Examples for Holding Robustness of Source Code Processing Models

A Metropolis-Hastings sampling-based identifier renaming technique, named \fullmethod (\method), is proposed, which generates adversarial examples for DL models specialized for source code processing, and confirms the usefulness of DL models-based method for future fully automated source codeprocessing.

Natural Attack for Pre-trained Models of Code

This paper proposes ALERT (Naturalness Aware Attack), a black-box attack that adversarially transforms inputs to make victim models produce wrong outputs and investigates the value of the generated adversarial examples to harden victim models through an adversarial fine-tuning procedure.

Semantic Robustness of Models of Source Code

This work defines a powerful and generic adversary that can employ sequences of parametric, semantics-preserving program transformations to facilitate training robust models, and explores how one can train models that are robust to adversarial program transformations.

NatGen: generative pre-training by “naturalizing” source code

This paper proposes a new pre-training objective, “Naturalizing” of source code, exploiting code’s bimodal, dual-channel (formal & natural channels) nature, and introduces six classes of semantic preserving transformations to introduce unnatural forms of code, and forces the model to produce more natural original programs written by developers.

Adversarial Robustness of Deep Code Comment Generation

This paper proposes ACCENT (Adversarial Code Comment gENeraTor), an identifier substitution approach to craft adversarial code snippets, which are syntactically correct and semantically close to the original code snippet, but may mislead the DNNs to produce completely irrelevant code comments.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

A new pre-training objective, DOBF, is introduced that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code and shows that models pre-trained with DOBF outperform existing approaches on multiple downstream tasks.

GraphCodeBERT: Pre-training Code Representations with Data Flow

Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks and it is shown that the model prefers structure-level attentions over token- level attentions in the task of code search.

StructCoder: Structure-Aware Transformer for Code Generation

This work develops an encoder-decoder Transformer model where both the encoder and decoder are trained to recognize the syntax and data flow in the source and target codes, respectively, and achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL.