Similarity based on the Importance of Common Features in Random Forest

doi:10.23940/ijpe.19.04.p12.11711180

Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (4): 1171-1180.doi: 10.23940/ijpe.19.04.p12.11711180

Previous Articles Next Articles

Similarity based on the Importance of Common Features in Random Forest

Xiao Chen^{a, b, *}, Li Han^a, Meng Leng^a, and Xiao Pan^c

a Network Technology Center, Hebei Normal University of Science and Technology, Qinhuangdao, 066004, China;
b Qianan College, North China University of Science and Technology, Qianan, 064400, China;
c College of Economic and Management, Shijiazhuang Tiedao University, Shijiazhuang, 050043, China

Revised on ; Accepted on
Contact: E-mail address: chenxiao0604@163.com
About author:Xiao Chen graduated from the College of Information Science and Engineering at Yanshan University with a master’s degree and Ph.D. She is a lecturer at Hebei Normal University of Science and Technology. Her research interests include graph mining, social network analysis, and machine learning. Li Han graduated from the College of Information Science and Engineering at Yanshan University with a bachelor’s degree, master’s degree, and Ph.D. She is a lecturer at Hebei Normal University of Science and Technology. Her research interests include wireless sensor network and machine learning. Meng Leng graduated from the College of Information Science and Engineering at Yanshan University, China with a master’s degree. He is an associate lecturer at Hebei Normal University of Science and Technology. His research interests include wireless sensor network and machine learning. Xiao Pan is an associate professor at Shijiazhuang Tiedao University. She was a visiting scholar in the Department of Computer Science at the University of Illinois. She received her Ph.D. in computer science from Renmin University of China in 2010. Her research interests include data management on moving objects, location based social networks, and privacy-aware computing.

Abstract

Abstract: In the existing methods for calculating the similarity between samples in random forests, the only case considered is where different samples fall on the same leaf node of the decision tree. The cases where there are leaf nodes in different positions of the decision tree or the sample falls on different leaves are neglected, thus affecting the accuracy of the similarity. In this paper, firstly, according to the difference of the leaf nodes in different positions of the decision tree, the importance of the sample features to which the leaf nodes belong are used as an attribute to describe the similarity. Secondly, for the case that the samples fall on different leaf nodes, the common features between samples are taken as another attribute to describe the similarity. Therefore, the measure method SICF (similarity between samples based on the importance of common features) is proposed. Finally, it is applied to the K-nearest neighbor classification algorithm, and the validity and correctness of the similarity are verified by the OOB index. The experimental results show that for the UCI data set, compared with two classical methods, the similarity SICF achieves better classification results.

Key words: random forest, similarity between samples, sample feature, feature importance, k-nearest neighbor, classification

Xiao Chen, Li Han, Meng Leng, and Xiao Pan. Similarity based on the Importance of Common Features in Random Forest [J]. Int J Performability Eng, 2019, 15(4): 1171-1180.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

1 S. Shan, “Decision Tree Learning,” New York: Springer US, pp. 1-28, February 2016
2 A. S. Nugroho, A. B. Witarto,D. Handoko, “Support Vector Machine,” New York: Springer US, pp. 24-52, 2016
3 K. Adi, C. E. Widodo, A. P. Widodo, et al., “Naïve Bayes Algorithm for Lung Cancer Diagnosis using Image Processing Techniques,” Advanced Science Letters, Vol. 23, No. 3, pp. 2296-2298, March 2017
4 L. Breiman, “Random Forest,” Machine Learning, Vol. 45, No. 1, pp. 5-32, January 2001
5 T. K. Ho, “The Random Subspace Method for Constructing Decision Forests,” IEEE Transactions on Pattern Analysis & Machine Intelligence, Vol. 20, No. 8, pp. 832-844, August 1998
6 D. Wang, Y. L. Chen, X. D. Cai, et al., “Person Re-Identification based on Random Forest and RankSVM Optimization,” Video Engineering, Vol. 39, No. 18, pp. 90-93, September 2015
7 Y. H. Qiu, “Customer Loss Prediction in Telecom Industry based on Pruning Random Forest,” Journal of Xiamen University (Natural Science Edition), Vol. 53, No. 6, pp. 817-823, June 2014
8 Q. F. Zhou, W. C. Hong,F. Yang, “Feature Selection based on Difference Random Forest Similarity Matrix,” Journal of Huazhong University of Science and Technology (Natural Science Edition), Vol. 38, No. 4, pp. 58-61, April 2010
9 H. Wang and H. Z. Yan, “Similar Performance Intrusion Detection Algorithm based on Random Forest Computing,” Information Security and Communication Secrecy, Vol. 2009, No. 8, pp. 70-73, August 2009
10 Y. Dong, B. Du,L. Zhang, “Target Detection based on Random Forest Metric Learning,” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing, Vol. 8, No. 4, pp. 1830-1838, April 2017
11 L. Huang, Y. Jin,Y. Gao, “Longitudinal Clinical Score Prediction in Alzheimer’s Disease with Soft Split Sparse Regression based on Random Forest,” Neurobiology of Aging, Vol. 46, No. 10, pp. 180-183, October 2016
12 S. S.Matin and S. C. Chelgani, “Estimation of Coal Gross Calorific Value based on Various Analyses by Random Forest Method,” Fuel, Vol. 177, No. 8, pp. 274-278, August 2016
13 K. R. Gray, P. Aljabar,R. A. Heckemann, “Random Forest-based Similarity Measures for Multimodal Classification of Alzheimer’s Disease,” Neuroimage, Vol. 65, No. 1, pp. 167-175, January 2013
14 Y. Qi, J. K. Seetharaman,Z. B. Joseph, “Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources,”Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing, Vol. 10, pp. 531-542, 2005
15 H. Y. Lu, M. Zhang,Y. Q. Liu, “Feature Importance Analysis and Enhanced Feature Selection Model of Convolutional Neural Networks,” Journal of Software, Vol. 28, No. 11, pp. 2879-2890, November 2017
16 D. Zhang, Q. Wang, B. Zhu, et al., “Pedestrian Recognition using the Importance of Human Body Features,” Journal of Wuhan University (Information Science Edition), Vol. 42, No. 1, pp. 84-90, January 2017
17 Z. G. Li, “Several Studies on The Improvement of Random Forest,”Xiamen: Master’s Thesis of Xiamen University, pp. 18-27, 2014
18 Y. Y. Chen, J. Q. Wu,K. J. Xu, “Attribute Splitting Method based on Gini Index in Decision Tree,” Microcomputer Development, Vol. 14, No. 15, pp. 66-68, July 2004

[1]	Sanjay M, Deepashree P. Vaideeswar, Kalapraveen Bagadi, Visalakshi Annepu, and Beebi Naseeba. Hyperspectral Image Classification: A Hybrid Approach Integrating Random Forest Feature Selection and Convolutional Neural Networks for Enhanced Accuracy [J]. Int J Performability Eng, 2024, 20(5): 263-270.
[2]	Vikas Verma, Arun Malik, and Isha Batra. Analyzing and Classifying Malware Types on Windows Platform using an Ensemble Machine Learning Approach [J]. Int J Performability Eng, 2024, 20(5): 312-318.
[3]	Manu Jyoti Gupta and Parveen Sehgal. Optimizing Credit Card Fraud Detection: Classifier Performance and Feature Selection Empowered by Grasshopper Algorithm [J]. Int J Performability Eng, 2024, 20(3): 177-185.
[4]	Ovais Bashir Gashroo and Monica Mehrotra. DetectHATE: Detecting Targeted Hate - A Framework for Classifying Online Abuse on X [J]. Int J Performability Eng, 2024, 20(11): 699-711.
[5]	Janarthanan Sekar and Ganesh Kumar T. Hyperparameter Tuning in Deep Learning-Based Image Classification to Improve Accuracy using Adam Optimization [J]. Int J Performability Eng, 2023, 19(9): 579-586.
[6]	Aashita Rajput, Muskan Yadav, Sachin Yadav, Megha Chhabra, and Arun Prakash Agarwal. Patch-Based Breast Cancer Histopathological Image Classification using Deep Learning [J]. Int J Performability Eng, 2023, 19(9): 607-623.
[7]	C. Rohith Bhat and Madhusundar Nelson. Artificial Intelligence Based Credit Card Fraud Detection for Online Transactions Optimized with Sparrow Search Algorithm [J]. Int J Performability Eng, 2023, 19(9): 624-632.
[8]	Rakesh Kumar, Sunny Arora, Ashima Arya, Neha Kohli, Vaishali Arya, and Ekta Singh. Ensemble Learning for Appraising English Text Readability using Gompertz Function [J]. Int J Performability Eng, 2023, 19(6): 388-396.
[9]	Pranshu Kumar Soni and Leema Nelson. PCP: Profit-Driven Churn Prediction using Machine Learning Techniques in Banking Sector [J]. Int J Performability Eng, 2023, 19(5): 303-311.
[10]	Vaishali Arya and Tapas Kumar. Boosting X-Ray Scans Feature for Enriched Diagnosis of Pediatric Pneumonia using Deep Learning Models [J]. Int J Performability Eng, 2023, 19(3): 175-183.
[11]	Harshita Batra and Leema Nelson. DCADS: Data-Driven Computer Aided Diagnostic System using Machine Learning Techniques for Polycystic Ovary Syndrome [J]. Int J Performability Eng, 2023, 19(3): 193-202.
[12]	Shalaka Prasad Deore. SongRec: A Facial Expression Recognition System for Song Recommendation using CNN [J]. Int J Performability Eng, 2023, 19(2): 115-121.
[13]	Shikha Choudhary and Bhawna Saxena. Image-Based Crop Disease Detection using Machine Learning Approaches: A Survey [J]. Int J Performability Eng, 2023, 19(2): 122-132.
[14]	Kamireddy Vijay Chandra, Kala Praveen Bagadi, Kalapala Vidya Sagar, R. Manjula Sri, and K. Sudha Rani. Deep Learning-Powered Corneal Endothelium Image Segmentation with Attention U-Net [J]. Int J Performability Eng, 2023, 19(11): 736-743.
[15]	Jitender Tanwar, Sanjay Kumar Sharma, Mandeep Mittal, and Ashok Kumar Yadav. Classification of Web Services for Efficient Performability [J]. Int J Performability Eng, 2023, 19(10): 654-662.

Similarity based on the Importance of Common Features in Random Forest

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended 0