Network intrusion detection based on heterogeneous distance." Cyberspace Technology
- Du, Hongle
Automatic protocol reverse engineering for application protocol is becoming more and more important for many applications such as application protocol analyzer, penetration testing, intrusion prevention and detection. Unfortunately, many techniques for extracting the protocol message format specifications of unknown applications often have some limitations for few priori information or the time-consuming problem. Protocol feature words are byte subsequences within traffic payload that could help distinguish application protocols. In this paper, a new approach is proposed for extracting the protocol message format specifications of unknown applications which is based on the Latent Dirichlet Allocation (LDA) model and Huffman Tree Support Vector Machine (HT-SVM). Firstly, the key words are extracted by utilizing the LDA model, which is a kind of machine learning in document library to extract the theme structure named topic. Secondly, the HT-SVM method is applied to constructing the feature words based on the above process. The proposed approach is implemented and evaluated to infer message format specifications of SMTP binary protocol. Experimental results show that the approach accurately parses and infers SMTP protocol with highly recall rate.