Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (3): 1053-1060.doi: 10.23940/ijpe.19.03.p35.10531060

Previous Articles    

Return Instruction Identification in Binary Code with Machine Learning

Jing Qiu and Guanglu Sun*   

  1. School of Computer Science and Technology, Harbin University of Science and Technology, Harbin, 150080, China
  • Submitted on ; Revised on ;
  • Contact: sunguanglu@hrbust.edu.cn
  • About author:Jing Qiu received his B.S. degree from Harbin Institute of Technology in 2005 and his M.S. degree from Harbin University of Science and Technology in 2009. He received his Ph.D. from Harbin Institute of Technology in 2015. He currently works at Harbin University of Science and Technology. His main interests include binary code analysis and binary code deobfuscation.Guanglu Sun received his bachelor's degree, master's degree, and Ph.D. from the School of Computer Science and Technology at Harbin Institute of Technology. He was an assistant researcher at the Post-doctoral Mobile Station in Tsinghua University's Computer Science Department from 2008 to 2011. He was a visiting scholar at Northwestern University from 2014 to 2015. He is currently a professor in the School of Computer Science and Technology and the director of the Center of Information Security and Intelligent Technology at Harbin University of Science and Technology. He is also a senior member of the China Computer Federation and a member of IEEE. His current research interests include computer networks and security, machine learning, and intelligent information processing.

Abstract: Binary code analysis is the main method for malware analysis. In this paper, the analysis is started by identifying return instructions to disassemble binary code. The return instruction identification problem is converted into a binary classification problem is a byte in binary code the first byte of a return instruction? The 32 bytes around a byte in binary code are considered the feature of the byte. A multilayer perceptron is employed to build the classification model. Then, the model is trained with 1,383 binaries from Windows XP SP3. The evaluation results on several open sources show that our approach is feasible and has high accuracy.

Key words: return instruction, binary code analysis, machine learning, reverse engineering