• Corpus ID: 227127587

Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned

  title={Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned},
  author={Dongkwan Kim and Eunsoo Kim and Sang Kil Cha and Sooel Son and Yongdae Kim},
Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize… 
One-to-One or One-to-many? What function inlining brings to binary2source similarity analysis
The extent of function inlined, the factors affecting function inlining, and the impact offunction inlining on existing binary2source similarity methods are investigated.
How Machine Learning Is Solving the Binary Function Similarity Problem
This paper sets out to perform the first measurement study on the state of the art of binary code similarity, systematizing the existing body of research and identifying a number of relevant approaches, which are representative of a wide range of solutions recently proposed by three different research communities.
XFL: eXtreme Function Labeling
EXtreme Function Labeling (XFL) is introduced, an extreme multi-label learning approach to selecting appropriate labels for binary functions that outperforms state-of-the-art approaches to function labeling on a dataset of over 10,000 binaries from the Debian project and DEXTER, a novel function embedding that combines static analysis-based features with local context from the call graph.
iCallee: Recovering Call Graphs for Binaries
This paper proposes a new solution ICALLEE based on the Siamese Neural Network, inspired by the advances in question-answering applications, and applies it to two specific applications binary code similarity detection and binary program hardening, finding that it could greatly improve state-of-the-art solutions.
1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis
By precisely recovering when function inlining happens, it is discovered that inlining is usually cumulative when optimization increases, and conditional inlining and incremental inlining are suggested to design a low-cost and high-coverage inlining-simulation strategy.
When Similarity Digest Meets Vector Management System: A Survey on Similarity Hash Function
A systematical survey on the existent wellknown similarity hash functions to tease out the satisfied ones and concludes that the similarity hash function MinHash and Nilsimsa can be directly marshaled into the pipeline of similarity analysis using vector manage system.
ReSIL: Revivifying Function Signature Inference using Deep Learning with Domain-Specific Knowledge
This paper performs a systematic study to quantify the extent to which compiler optimizations (negatively) impact the accuracy of existing deep learning techniques for function signature recovery and proposes an enhanced deep learning approach named \sysname to incorporate compiler-optimization-specific domain knowledge into the learning process.
PERFUME: Programmatic Extraction and Refinement for Usability of Mathematical Expression
This paper presents PERFUME, a framework that extracts symbolic math expressions from low-level binary representations of an algorithm by translating a symbolic output representation of a binary function to a high-level mathematical expression.


Binary Similarity Detection Using Machine Learning
This paper presents a cross-compiler-and-architecture approach for detecting similarity between binary procedures, which achieves both high accuracy and peerless throughput.
Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization
An assembly code representation learning model that can find and incorporate rich semantic relationships among tokens appearing in assembly code and significantly outperforms existing methods against changes introduced by obfuscation and optimizations is developed.
Leveraging semantic signatures for bug search in binary programs
A method to automatically identify binary code regions that are "similar" to code regions containing a reference bug to find bugs both in the same binary as the reference bug and in completely unrelated binaries (even compiled for different operating systems).
Learning Program-Wide Code Representations for Binary Diffing
A novel learning based code representation generation approach to solve the binary diffing problem that relies only on the code semantic information as well as the program-wide control flow structural information to generate block embeddings without supporting of any debug information.
VulSeeker: A Semantic Learning Based Vulnerability Seeker for Cross-Platform Binary
VulSeeker is a semantic learning based vulnerability seeker for cross-platform binary that outperforms the state-of-the-art approaches in terms of accuracy and embedding vector.
In-memory fuzzing for binary code similarity analysis
  • Shuai Wang, Dinghao Wu
  • Computer Science
    2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)
  • 2017
This paper presents a novel method that leverages in-memory fuzzing for binary code similarity analysis, and shows that IMF-SIM notably outperforms existing tools with higher accuracy and broader application scopes.
$\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN
  • Bingchang Liu, Wei Huo, Wei Zou
  • Computer Science
    2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE)
  • 2018
This paper proposes a solution, employing three semantic features, to address the cross-version BCSD challenge, and shows that $\alpha$ Diff outperforms state-of-the-art static solutions by over 10 percentages on average in different B CSD settings.
BinGo: cross-architecture cross-OS binary search
The experimental results show that BINGO can find semantic similar functions across architecture and OS boundaries, even with the presence of program structure distortion, in a scalable manner, and is proposed to dramatically reduce the irrelevant target functions.
Accurate and Scalable Cross-Architecture Cross-OS Binary Code Search with Emulation
This study empirically compares the tool, BinGo-E, with the pervious tool BinGo and the available state-of-the-art tools of binary code search in terms of search accuracy and performance, and proposes to incorporate features from different categories (e.g., structural features and high-level semantic features) for accuracy improvement and emulation for efficiency improvement.
A Survey of Binary Code Similarity
This article analyzes 70 binary code similarity approaches and analyzes them on four aspects: the applications they enable, their approach characteristics, how the approaches are implemented, and the benchmarks and methodologies used to evaluate them.