Shellcode_IA32: A Dataset for Automatic Shellcode Generation

  title={Shellcode\_IA32: A Dataset for Automatic Shellcode Generation},
  author={Pietro Liguori and Erfan Al-Hossami and Domenico Cotroneo and Roberto Natella and Bojan Cukic and Samira Shaikh},
We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with standard methods in neural machine translation (NMT) to establish baseline performance levels on this… 

Figures and Tables from this paper

A Survey on Artificial Intelligence for Source Code: A Dialogue Systems Perspective

This survey paper overviews major deep learning methods used in Natural Language Processing (NLP) and source code over the last 35 years and presents a software-engineering centered taxonomy for CI placing each of the works into one category describing how it best assists the software development cycle.

DualSC: Automatic Generation and Summarization of Shellcode via Transformer and Dual Learning

This study formalizes automatic shellcode generation and summarization as dual tasks, uses a shallow Transformer for model construction, and design a normalization method Adjust_QKNorm to adapt these low-resource tasks, and proposes a rule-based repair component to improve the performance of automatic shell code generation.

Can we generate shellcodes via natural language? An empirical study

The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.

Textual Query Translation into Python Source Code using Transformers

Neural Machine Translation (NMT) can be used to convert the specific query from English to an equivalent Python code, and english language to python source code generation using transformers has been achieved in this work.

BashExplainer: Retrieval-Augmented Bash Code Comment Generation based on Fine-tuned CodeBERT

This study study Bash code comment generation problem and proposed an automatic method B ASH E XPLAINER based on two-stage training strategy, which can outperform all baselines by at least 8.75%, 9.29%, 4.77% and 3.86%.

Can NMT Understand Me? Towards Perturbation-based Evaluation of NMT Models for Code Generation

This work identifies a set of perturbations and metrics tailored for the robustness assessment of NMT models, and presents a preliminary experimental evaluation, showing what type of perturbedations affect the model the most and deriving useful insights for future directions.

Recent Advances in Neural Text Generation: A Task-Agnostic Survey

A task-agnostic survey of recent advances in neural text generation is presented, which group under the following four headings: data construction, neural frameworks, training and inference strategies, and evaluation metrics.

EVIL: Exploiting Software via Natural Language

This work proposes an approach (EVIL) to automatically generate exploits in assembly/Python language from descriptions in natural language, which leverages Neural Machine Translation techniques and a dataset that was developed for this work.

Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods

This work presents a high-level and concise survey of TQA methods, including both manual judgement criteria and automated evaluation metrics, which it classify into further detailed sub-categories and hopes that this work will be an asset for both translation model researchers and quality assessment researchers.



NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System

This work presents new data and semantic parsing methods for the problem of mapping English sentences to Bash commands (NL2Bash), and takes a first step in enabling any user to perform operations by simply stating their goals in English.

Neural Machine Translation by Jointly Learning to Align and Translate

It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

Assembly language step-by-step : programming with DOS and Linux

Another Pleasant Valley Saturday Alien Bases Lifting the Hood The Right to Assemble NASM-IDE: A Place to Stand An Uneasy Alliance Following Your Instructions Our Object All Sublime Dividing and

A syntactic neural model for general-purpose code generation

  • CoRR, abs/1704.01696.
  • 2017

Latent Predictor Networks for Code Generation

A novel neural network architecture is presented which generates an output sequence conditioned on an arbitrary number of input functions and allows both the choice of conditioning context and the granularity of generation, for example characters or tokens, to be marginalised, thus permitting scalable and effective training.

Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

Evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.

Reranking for Neural Semantic Parsing

This paper presents a simple approach to quickly iterate and improve the performance of an existing neural semantic parser by reranking an n-best list of predicted MRs, using features that are designed to fix observed problems with baseline models.

Assembly Language Step-by-Step: Programming with Linux

The eagerly anticipated new edition of the bestselling introduction to x86 assembly language has been completely rewritten to focus on 32-bit protected-mode Linux and the free NASM assembler and is tailored for use by programming beginners.

Machine Translation Evaluation Resources and Methods: A Survey.

The Machine Translation (MT) evaluation survey is introduced that contains both manual and automatic evaluation methods and the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content.