Int J Performability Eng ›› 2022, Vol. 18 ›› Issue (4): 251-262.doi: 10.23940/ijpe.22.04.p3.251262

Previous Articles     Next Articles

Automatic Categorization of Software with Document Clustering Methods and Voting Mechanism

Kai-Wen Chen and Chin-Yu Huang*   

  1. Department of Computer Science, National Tsing Hua University, Hsinchu, 300, Taiwan
  • Submitted on ; Revised on ; Accepted on
  • Contact: * E-mail address: cyhuang@cs.nthu.edu.tw

Abstract: Manual software categorization depends on considerable understanding of the categorized software to enable software managers to categorize the software based on certain criteria (e.g., functionality). Unfortunately, the rapid growth of software makes manual software categorization almost impossible and expensive. Therefore, automatic software categorization has become necessary. In this study, we utilized three different unsupervised document clustering methods, namely k-means, non-negative matrix factorization (NMF), and spectral clustering, to analyse source code and to implement automatic software categorization. For evaluation, we selected a well-known unsupervised model, LACT, as our comparison candidate. In general, our proposed methods required at most approximately 1/4 of the execution time of LACT, whereas the fastest method was hundreds of times faster than LACT, achieving at most 26% and 100% better performance based on two criteria: the BCubed F1-measure and the Adjusted Rand Index, respectively. Additionally, we also proposed a voting mechanism inspired by N-version programming and ensemble learning. We selected certain states of NMF and spectral clustering as the referees to improve the average performances of k-means. Our results indicated that combining different clustering techniques to achieve better results is feasible.

Key words: automatic software categorization, document clustering, k-means, nonnegative matrix factorization, spectral clustering