Code Similarity Detection using AST and Textual Information

doi:10.23940/ijpe.19.10.p14.26832691

Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (10): 2683-2691.doi: 10.23940/ijpe.19.10.p14.26832691

• Orginal Article • Previous Articles Next Articles

Code Similarity Detection using AST and Textual Information

Wu Wen^a, Xiaobo Xue^b, Ya Li^a^*, Peng Gu^a, and Jianfeng Xu^b

^aGuangzhou University, Guangzhou, 510006, China
^bNanjing Mooctest Information and Technology Co. Ltd, Nanjing, 210000, China

Submitted on ; Revised on ; Accepted on
Contact: Li Ya
About author:
* Corresponding author. E-mail address: liya@gzhu.edu.cn
Supported by:
This work was partially supported by the Ministry of Education - Nanjing Mooctest Industry-University Cooperation, Collaborative Education (No 201702083003)

Abstract

Abstract:

In the teaching process of computer language courses, a large amount of programming experimental content needs to be supplemented. Students sometimes copy codes from each other, which seriously reduces the teaching quality of computer language courses and makes it difficult to improve students' programming abilities. To solve this problem, this paper proposes a novel code similarity detection algorithm based on code text and AST. By removing comments, blank characters, and other "cleaning" processes from the code text, the normalized code text is obtained. Then, word segmentation, word frequency statistics, weight calculation, and other operations are carried out. The code fingerprint is obtained by using the Simhash algorithm. According to the specification of computer language grammar, lexical analysis and syntax analysis are conducted to extract the AST (abstract syntax tree), and redundant information is eliminated. According to the Zhang-Shasha algorithm, the AST edit distance is calculated and then compared to the AST. Finally, the similar degree between the text similarity and AST similarity is calculated. In order to verify the effectiveness of this method, taking Python code as an example, the code on the open source programming platform and LeetCode is used to build the test data set according to the common code plagiarism method. Experimental results show that this method is capable at detecting several common means of plagiarism, and low similarity can be obtained for the experimental detection of unrelated codes and non-plagiarized codes. Therefore, we believe that this algorithm can effectively be used for the code similarity detection of experimental code in computer language courses.

Key words: code plagiarism, code similarity, text similarity, abstract syntax tree

Wu Wen, Xiaobo Xue, Ya Li, Peng Gu, and Jianfeng Xu. Code Similarity Detection using AST and Textual Information [J]. Int J Performability Eng, 2019, 15(10): 2683-2691.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References 11.

1.	G. Cosma and M. Joy, “Source-Code Plagiarism: A UK Academic Perspective”, in Research Report RR-422, Department of Computer Science, University of Warwick, 2006
2.	J. Sheard, M. Dick, S. Markham, L. MacDonald,M. Walsh.“Cheating and Plagiarism: Perception and Practices of First Year IT Students,” in Proceedings of the 7th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education, pp. 183-187, ACM Press, New York, 2002
3.	J. A. W.Faidhi and S. K. Robinson, “An Empirical Approach for Detecting Program Similarity and Plagiarism within a University Programming Environment,” Computers and Education, pp. 11-19, 1987
4.	M. H. Halstead, “Elements of Software Science (Operating and Programming Systems Series),” 1978
5.	K. J. Ottenstein, “An Algorithmic Approach to the Detection and Prevention of Plagiarism,” ACM SIGCSE Bulletin, Vol. 8, No. 4, pp. 30-41, 1976
6.	M. J. Wise, “YAP3: Improved Detection of Similarities in Computer Program and Other Texts,”ACM SIGCSE Bulletin, pp. 130-134, 1996
7.	P. Clough, “Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies,” Research Memoranda Cs, 2000
8.	G. S. Manku, A. Jain,A. D. Sarma, “Detecting Near-Duplicates for Web Crawling,” inProceedings of the 16th International Conference on World Wide Web, pp. 141-150, ACM, New York, 2007
9.	D. C.Atkinson and W. G. Griswold, “Effective Pattern Matching of Source Code using Abstract Syntax Patterns,” Software Practice and Experience, Vol. 36, No. 4, pp. 413-447, 2006
10.	K. Z.Zhang and D. Shasha, “Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems,” in SIAM Journal on Computing, pp. 1245-1262, 1989
11.	K. L.Verco and M. J. Wise, “Software for Detecting Suspected Plagiarism: Comparing Structure and Attribute-Counting Systems,” in Computer Science, University of Sydney, pp. 3-5, 1996

Code Similarity Detection using AST and Textual Information

PDF

Knowledge

Abstract

Cite this article

share this article

References 11.

Related Articles 0

Recommended 0