Extracting compiler provenance from program binaries


We present a novel technique that identifies the source compiler of program binaries, an important element of <i>program provenance</i>. Program provenance answers fundamental questions of malware analysis and software forensics, such as whether programs are generated by similar tool chains; it also can allow development of debugging, performance analysis, and instrumentation tools specific to particular compilers. We formulate compiler identification as a structured learning problem, automatically building models to recognize sequences of binary code generated by particular compilers. We evaluate our techniques on a large set of real-world test binaries, showing that our models identify the source compiler of binary code with over 90% accuracy, even in the presence of interleaved code from multiple compilers. A case study demonstrates the use of inferred compiler provenance to augment stripped binary parsing, reducing parsing errors by 18%.

DOI: 10.1145/1806672.1806678

Extracted Key Phrases

8 Figures and Tables

Cite this paper

@inproceedings{Rosenblum2010ExtractingCP, title={Extracting compiler provenance from program binaries}, author={Nathan E. Rosenblum and Barton P. Miller and Xiaojin Zhu}, booktitle={PASTE}, year={2010} }