Username   Password       Forgot your password?  Forgot your username? 


Clustering Algorithm of Ethnic Cultural Resources based on Spark

Volume 15, Number 3, March 2019, pp. 756-762
DOI: 10.23940/ijpe.19.03.p4.756762

Ming Leia,b, Bin Wena, Jianhou Ganb, and Jun Wangb

aSchool of Information Science and Technology, Yunnan Normal University, Kunming, 650500, China
bKey Laboratory of Educational Informatization for Nationalities of Ministry of Education, Yunnan Normal University, Kunming, 650500, China

(Submitted on October 19, 2018; Revised on November 21, 2018; Accepted on December 23, 2018)


Extracting valuable information from ethnic cultural resources is the key to current data mining research on ethnic cultural resources. The K-means algorithm can effectively process large-scale data sets due to simple and efficient iterative calculations. The uncertainty of the k-value affects the efficiency and accuracy of the algorithm. The particle swarm optimization (PSO) algorithm and global coarse-grained search can quickly determine the k-value of the cluster center, while the retrieval efficiency is low. In order to solve the problem of the initial clustering center of the K-means algorithm and the low efficiency of the PSO algorithm, this paper proposes a Spark-based PSO-k-means algorithm, which primarily introduces ethnic cultural text resources into the Hadoop Distributed File System (HDFS) and then uses Han Language Processing (HanLP) word segmentation. The Term Frequency-Inverse Document Frequency (TF-IDF) algorithm generates the word frequency vector. Finally, the particle swarm optimization algorithm performs initial pre-clustering on the data set, obtains the K-means algorithm cluster center k, and then obtains the final classification result through K-means algorithm cluster analysis. The experimental results show that the clustering accuracy and stability of the PSO-k-means algorithm are better than those of the existing K-means algorithm on serial stand-alone.


References: 21

  1. H. C. Li, X. P. Wu, and Y. Chen, “K-Means Clustering Method Supporting Differential Privacy Protection under MapReduce Framework,” Journal on Communications, Vol. 37, No. 2, pp. 124-130, 2016
  2. A. Bolfazlis, S. Anaeiz, and A. Hmede, “Cloud-based Augmentation for Mobile Devices: Motivation, Taxonomies, and Open challenges,” IEEE Communications Surveys and Tutorials, Vol. 16, No. 1, pp. 337-368, 2014
  3. Y. Shen, D. H. Yu, and W. L. Wang, “Improvement of Particle Swarm K-means Clustering Algorithm,” Computer Engineering and Applications, Vol. 50, No. 21, pp. 125-128, 2014
  4. B. Wang and X. J. Yu, “Parallel K-Means Clustering Algorithm for Adaptive Cuckoo Search,” Application Research of Computers, Vol. 3503, pp. 675-679, 2018
  5. G. H. Zhu, S. B. Huang, C. F. Yuan, and Y. H. Huang, “SCoS: Design and Implementation of Parallel Spectral Clustering Algorithm based on Spark,” Chinese Journal of Computers, Vol. 41, No. 4, pp. 868-885, 2018
  6. X. Y. Li, L. Y. Yu, H. Lei, and X. F. Tang, “A Parallel Implementation and Application of an Improved K-Means Algorithm,” Journal of University of Electronic Science and Technology of China, Vol. 4601, pp. 61-68, 2017
  7. L. Y. Li, Y. M. Dong, and Y. Kong, “Improved MapReduce Parallelization of K-Means Algorithm,” Journal of Harbin University of Science and Technology, pp. 31-35, 2016
  8. Y. H. Cui, W. Song, Z. B. Wang, S. C. Shi, and F. Q. Cheng, “A Grid-based Privacy Protection Clustering Data Mining Method,” Journal of Software, Vol. 28, No. 9, pp. 2293-2308, 2017
  9. R. Feldman, O. Netzer, and B. Rosenfeld, “Utilizing Text Mining on Online Medical Forums to Predict Label Change due to Adverse Drug Reactions,” in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 1779-1788, 2015
  10. F. Qiao, Y. Z. Ge, and W. C. Kong, “Research on Distributed Improvement of Random Forest Student Employment Data Classification Model based on MapReduce,” Systems Engineering - Theory & Practice, Vol. 37, No. 5, pp. 1383-1392, 2017
  11. K. Sun, “Research and Implementation of Machine Learning Application Framework based on Spark,” Shanghai Jiaotong University, 2015
  12. P. Cao, “Optimization and Implementation of Clustering Algorithm based on Spark Platform,” Beijing Jiaotong University, 2016
  13. B. Zhang, “Parallelization and Optimization of K-Means Algorithm based on Spark,” Huazhong University of Science and Technology, 2015
  14. Y. Liang, “Parallelization of Data Mining Algorithms based on Distributed Platforms Spark and YARN,” Sun Yat-Sen University, 2014
  15. Y. H. Zhang and F. G. Li, “Parallelization of KMeans Clustering Algorithm based on MapReduce,” Journal of Jiujiang University (Natural Science Edition), pp. 73-75, 2017
  16. Y. Yang, S. X. Ren, J. Yan, and C. Q. Li, “Improved Log-based Optimization based on K-Means Algorithm for Web Log Mining,” Journal of Computer Applications, Vol. 36, No. S1, pp. 29-32+36, 2016
  17. D. F. Wang and L. Meng, “Performance Analysis and Parameter Selection of Particle Swarm Optimization Algorithm,” Acta Automatica Sinica, Vol. 42, No. 10, pp. 1552-1561, 2016
  18. Y. N. Liao, M. J. Li, and Y. Q. Zhang, “K-Means Clustering-Particle Swarm Optimization Multi-Target Localization Algorithm,” Electronic Design Engineering, Vol. 26, No. 2, pp. 56-60, 2018
  19. X. X. Lin and M. X. Zhao, “A K-Means Algorithm based on Improved Particle Swarm Optimization Algorithm,” Journal of Shandong University of Technology (Natural Science), Vol. 29, No. 5, pp. 16-20, 2015
  20. X. D. Wu and S. Q. Qi, “Comparison of MapReduce and Spark for Big Data Analysis,” Journal of Software, 2018
  21. C. Bian, W. Yu, and C. T. Ying, “Adaptive Cache Management Strategy for Parallel Computing Framework Spark,” Chinese Journal of Electronics, Vol. 45, No. 2, pp. 24-30, 2017


Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

This site uses encryption for transmitting your passwords.