Understanding neural code intelligence through program simplification

@article{Rabin2021UnderstandingNC,
  title={Understanding neural code intelligence through program simplification},
  author={Md Rafiqul Islam Rabin and Vincent J. Hellendoorn and Mohammad Amin Alipour},
  journal={Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
  year={2021}
}
A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these… 
Extracting Label-specific Key Input Features for Neural Code Intelligence Models
TLDR
Extracting key input features from reduced programs reveals that the syntax-guided reduced programs contain more label-specific key input Features that may help to understand the reasoning of models’ prediction from different perspectives and increase the trustworthiness to correct classification given by CI models.
Syntax-guided program reduction for understanding neural code intelligence models
TLDR
A syntax-guided program reduction technique that considers the grammar of the input programs during reduction that is faster and provides smaller sets of key tokens in reduced programs is applied.
Learning to Represent Programs with Code Hierarchies
TLDR
A novel network architecture, HIRGAST, is designed, which combines the strengths of Heterogeneous Graph Transformer Networks and Tree-based Convolutional Neural Networks to learn Abstract Syntax Trees enriched with code dependency information and a novel pretraining objective called Missing Subtree Prediction is proposed.
Memorization and Generalization in Neural Code Intelligence Models
TLDR
This work evaluates the memorization and generalization tendencies in neural code intelligence models through a case study across several benchmarks and model families by leveraging established approaches from other fields that use DNNs, such as introducing targeted noise into the training dataset.
Counterfactual Explanations for Models of Code
TLDR
This paper integrates counterfactual explanation generation to models of source code in a real-world setting and investigates the efficacy of the approach on three different models, each based on a BERT-like architecture operating over source code.
Data-Driven AI Model Signal-Awareness Enhancement and Introspection
TLDR
This paper combines the SE concept of code complexity with the AI technique of curriculum learning, and incorporates SE assistance into AI models by customizing Delta Debugging to generate simplified signal-preserving programs, augmenting them to the training dataset.
Towards Reliable AI for Source Code Understanding
TLDR
This work highlights the need for concerted efforts from the research community to ensure credibility, accountability, and traceability for AI-for-code and outlines three stages of an AI pipeline- data collection, model training, and prediction analysis.
Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code
TLDR
This paper investigates Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs, and evaluates several variations of this representation and compares its performance with state-of-the-art representations that utilize the rich syntactic and semantic features ofinput programs.
Encoding Program as Image: Evaluating Visual Representation of Source Code
TLDR
This paper investigates Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs, and evaluates several variations of this representation and compares its performance with state-of-the-art representations that utilize the rich syntactic and semantic features ofinput programs.
Data-Driven and SE-assisted AI Model Signal-Awareness Enhancement and Introspection
TLDR
This paper combines the SE concept of code complexity with the AI technique of curriculum learning, and incorporates SE assistance into AI models by customizing Delta Debugging to generate simplified signal-preserving programs, augmenting them to the training dataset.
...
...

References

SHOWING 1-10 OF 45 REFERENCES
Testing Neural Program Analyzers
TLDR
In a preliminary experiment on a neural model recently proposed in the literature, it is found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction.
Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations
TLDR
A large-scale evaluation of the generalizability of two popular neural program analyzers using seven semantically-equivalent transformations of programs to provide the initial stepping stones for quantifying robustness in neural program Analyzers.
Neural Program Repair by Jointly Learning to Localize and Repair
TLDR
It is beneficial to train a model that jointly and directly localizes and repairs variable-misuse bugs, and the experimental results show that the joint model significantly outperforms an enumerative solution that uses a pointer based model for repair alone.
Learning to Represent Programs with Graphs
TLDR
This work proposes to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures, and suggests that these models learn to infer meaningful names and to solve the VarMisuse task in many cases.
On the "naturalness" of buggy code
TLDR
It is found that code with bugs tends to be more entropic (i.e. unnatural), becoming less so as bugs are fixed, suggesting that entropy may be a valid, simple way to complement the effectiveness of PMD or FindBugs, and that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes.
Toward Deep Learning Software Repositories
TLDR
This work motivate deep learning for software language modeling, highlighting fundamental differences between state-of-the-practice software language models and connectionist models, and proposes avenues for future work, where deep learning can be brought to bear to support model-based testing, improve software lexicons, and conceptualize software artifacts.
A Survey of Machine Learning for Big Code and Naturalness
TLDR
This article presents a taxonomy based on the underlying design principles of each model and uses it to navigate the literature and discuss cross-cutting and application-specific challenges and opportunities.
AutoFocus: Interpreting Attention-Based Neural Networks by Code Perturbation
TLDR
Based on evaluation on more than 1000 programs for 10 different sorting algorithms, it is observed that the attention scores are highly correlated to the effects of the perturbed code elements, which provides a strong basis for the uses of attention scores to interpret the relations between code elements and the algorithm classification results of a neural network.
Cause reduction: delta debugging, even without bugs
TLDR
Suites produced by cause reduction provide effective quick tests for real‐world programs, including improving seeded symbolic execution, where using reduced tests can often double the number of additional branches explored.
...
...