Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (4): 1171-1180.doi: 10.23940/ijpe.19.04.p12.11711180

Previous Articles     Next Articles

Similarity based on the Importance of Common Features in Random Forest

Xiao Chena, b, *, Li Hana, Meng Lenga, and Xiao Panc   

  1. a Network Technology Center, Hebei Normal University of Science and Technology, Qinhuangdao, 066004, China;
    b Qianan College, North China University of Science and Technology, Qianan, 064400, China;
    c College of Economic and Management, Shijiazhuang Tiedao University, Shijiazhuang, 050043, China
  • Revised on ; Accepted on
  • Contact: E-mail address: chenxiao0604@163.com
  • About author:Xiao Chen graduated from the College of Information Science and Engineering at Yanshan University with a master’s degree and Ph.D. She is a lecturer at Hebei Normal University of Science and Technology. Her research interests include graph mining, social network analysis, and machine learning. Li Han graduated from the College of Information Science and Engineering at Yanshan University with a bachelor’s degree, master’s degree, and Ph.D. She is a lecturer at Hebei Normal University of Science and Technology. Her research interests include wireless sensor network and machine learning. Meng Leng graduated from the College of Information Science and Engineering at Yanshan University, China with a master’s degree. He is an associate lecturer at Hebei Normal University of Science and Technology. His research interests include wireless sensor network and machine learning. Xiao Pan is an associate professor at Shijiazhuang Tiedao University. She was a visiting scholar in the Department of Computer Science at the University of Illinois. She received her Ph.D. in computer science from Renmin University of China in 2010. Her research interests include data management on moving objects, location based social networks, and privacy-aware computing.

Abstract: In the existing methods for calculating the similarity between samples in random forests, the only case considered is where different samples fall on the same leaf node of the decision tree. The cases where there are leaf nodes in different positions of the decision tree or the sample falls on different leaves are neglected, thus affecting the accuracy of the similarity. In this paper, firstly, according to the difference of the leaf nodes in different positions of the decision tree, the importance of the sample features to which the leaf nodes belong are used as an attribute to describe the similarity. Secondly, for the case that the samples fall on different leaf nodes, the common features between samples are taken as another attribute to describe the similarity. Therefore, the measure method SICF (similarity between samples based on the importance of common features) is proposed. Finally, it is applied to the K-nearest neighbor classification algorithm, and the validity and correctness of the similarity are verified by the OOB index. The experimental results show that for the UCI data set, compared with two classical methods, the similarity SICF achieves better classification results.

Key words: random forest, similarity between samples, sample feature, feature importance, k-nearest neighbor, classification