Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (10): 2683-2691.doi: 10.23940/ijpe.19.10.p14.26832691

• Orginal Article • Previous Articles     Next Articles

Code Similarity Detection using AST and Textual Information

Wu Wena, Xiaobo Xueb, Ya Lia*, Peng Gua, and Jianfeng Xub   

  1. aGuangzhou University, Guangzhou, 510006, China
    bNanjing Mooctest Information and Technology Co. Ltd, Nanjing, 210000, China
  • Submitted on ; Revised on ; Accepted on
  • Contact: Li Ya
  • About author:

    * Corresponding author. E-mail address: liya@gzhu.edu.cn

  • Supported by:
    This work was partially supported by the Ministry of Education - Nanjing Mooctest Industry-University Cooperation, Collaborative Education (No 201702083003)

Abstract:

In the teaching process of computer language courses, a large amount of programming experimental content needs to be supplemented. Students sometimes copy codes from each other, which seriously reduces the teaching quality of computer language courses and makes it difficult to improve students' programming abilities. To solve this problem, this paper proposes a novel code similarity detection algorithm based on code text and AST. By removing comments, blank characters, and other "cleaning" processes from the code text, the normalized code text is obtained. Then, word segmentation, word frequency statistics, weight calculation, and other operations are carried out. The code fingerprint is obtained by using the Simhash algorithm. According to the specification of computer language grammar, lexical analysis and syntax analysis are conducted to extract the AST (abstract syntax tree), and redundant information is eliminated. According to the Zhang-Shasha algorithm, the AST edit distance is calculated and then compared to the AST. Finally, the similar degree between the text similarity and AST similarity is calculated. In order to verify the effectiveness of this method, taking Python code as an example, the code on the open source programming platform and LeetCode is used to build the test data set according to the common code plagiarism method. Experimental results show that this method is capable at detecting several common means of plagiarism, and low similarity can be obtained for the experimental detection of unrelated codes and non-plagiarized codes. Therefore, we believe that this algorithm can effectively be used for the code similarity detection of experimental code in computer language courses.

Key words: code plagiarism, code similarity, text similarity, abstract syntax tree