Automatic protocol feature word construction based on machine learning

Abstract

Automatic protocol reverse engineering for application protocol is becoming more and more important for many applications such as application protocol analyzer, penetration testing, intrusion prevention and detection. Unfortunately, many techniques for extracting the protocol message format specifications of unknown applications often have some limitations for few priori information or the time-consuming problem. Protocol feature words are byte subsequences within traffic payload that could help distinguish application protocols. In this paper, a new approach is proposed for extracting the protocol message format specifications of unknown applications which is based on the Latent Dirichlet Allocation (LDA) model and Huffman Tree Support Vector Machine (HT-SVM). Firstly, the key words are extracted by utilizing the LDA model, which is a kind of machine learning in document library to extract the theme structure named topic. Secondly, the HT-SVM method is applied to constructing the feature words based on the above process. The proposed approach is implemented and evaluated to infer message format specifications of SMTP binary protocol. Experimental results show that the approach accurately parses and infers SMTP protocol with highly recall rate.

4 Figures and Tables

Cite this paper

@article{Li2015AutomaticPF, title={Automatic protocol feature word construction based on machine learning}, author={Haifeng Li and Bin Zhang and Bo Shuai and Jian Wang and Chaojing Tang}, journal={2015 IEEE International Conference on Progress in Informatics and Computing (PIC)}, year={2015}, pages={93-97} }