Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (3): 756-762.doi: 10.23940/ijpe.19.03.p4.756762

Previous Articles     Next Articles

Clustering Algorithm of Ethnic Cultural Resources based on Spark

Ming Leia, b, Bin Wena, *, Jianhou Ganb, and Jun Wangb   

  1. a School of Information Science and Technology, Yunnan Normal University, Kunming, 650500, China;
    b Key Laboratory of Educational Informatization for Nationalities of Ministry of Education, Yunnan Normal University, Kunming, 650500, China
  • Submitted on ; Revised on ;
  • Contact: wenbin@ynnu.edu.cn
  • About author:Ming Lei is a Master's student in the School of Information Science and Technology at Yunnan Normal University. His research interests include machine learning and data mining. Bin Wen received his Ph.D. in computer application technology from China University of Mining & Technology in 2013. In 2005, he was a faculty member at Yunnan Normal University. Currently, he is an associate professor at Yunnan Normal University. His research interest covers intelligent information processing and emergency management.Jianhou Gan received his Ph.D. in metallurgical physical chemistry from Kunming University of Science and Technology in 2016. In 1998, he was a faculty member at Yunnan Normal University. Currently, he is a professor at Yunnan Normal University. His research interests cover education informalization for nationalities, Semantic Web, databases, and intelligent information processing.Jun Wang received his Master's degree in modern education technology from Yunnan Normal University in 2012. In 2013, he was a faculty member at Yunnan Normal University. Currently, he is an assistant research fellow at Yunnan Normal University. His research interests cover education informalization for nationalities and knowledge engineering.

Abstract: Extracting valuable information from ethnic cultural resources is the key to current data mining research on ethnic cultural resources. The K-means algorithm can effectively process large-scale data sets due to simple and efficient iterative calculations. The uncertainty of the k-value affects the efficiency and accuracy of the algorithm. The particle swarm optimization (PSO) algorithm and global coarse-grained search can quickly determine the k-value of the cluster center, while the retrieval efficiency is low. In order to solve the problem of the initial clustering center of the K-means algorithm and the low efficiency of the PSO algorithm, this paper proposes a Spark-based PSO-k-means algorithm, which primarily introduces ethnic cultural text resources into the Hadoop Distributed File System (HDFS) and then uses Han Language Processing (HanLP) word segmentation. The Term Frequency-Inverse Document Frequency (TF-IDF) algorithm generates the word frequency vector. Finally, the particle swarm optimization algorithm performs initial pre-clustering on the data set, obtains the K-means algorithm cluster center k, and then obtains the final classification result through K-means algorithm cluster analysis. The experimental results show that the clustering accuracy and stability of the PSO-k-means algorithm are better than those of the existing K-means algorithm on serial stand-alone.

Key words: ethnic culture, particle swarm optimization algorithm, K-means clustering