Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (6): 1518-1527.doi: 10.23940/ijpe.19.06.p3.15181527

Previous Articles     Next Articles

Feature Dimension Reduction Optimization Algorithm for Massive Micro-Blog Data based on Hadoop

Haodong Zhu*, Wenqi Li, and Hongchan Li   

  1. School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, 450002, China
  • Submitted on ;
  • Contact: * E-mail address: zhuhaodong80@163.com
  • About author:Haodong Zhu received his B.S. degree from Lanzhou Jiaotong University in 2004, M.S. degree from Sichuan University of Science & Engineering in 2008, and Ph.D. from the Graduate University of Chinese Academy of Sciences in 2011. As a postdoctoral scholar, he studied image big data processing in the Postdoctoral Mobile Station of Computer Science and Technology at Tongji University from 2014 to 2016. As a visiting scholar, he studied micro-blog big data processing at Griffith University from 2017 to 2018. Since 2010, he has been an associate professor and master's tutor in the School of Computer and Communication Engineering at Zhengzhou University of Light Industry. His major research interests include cloud computation, intelligence information processing, computing intelligence, and data mining;Wenqi Li received her B.S. degree from Zhengzhou University of Light Industry in 2017. Since 2017, she has been studying for her master's degree in the Computer and Communication Engineering College at Zhengzhou University of Light Industry. Her main research directions are intelligence information processing, micro-blog sentiment analysis, cloud computing, and data mining;Hongchan Li received her B.S. degree from Heilongjiang Bayi Agricultural University in 2007 and her M.S. degree from Sichuan University of Science & Engineering in 2010. Since 2010, she has been a lecturer in the School of Computer and Communication Engineering at Zhengzhou University of Light Industry. Her major research interests include cloud computation, intelligence information processing, computing intelligence, and data mining.
  • Supported by:
    This work was supported in part by the Key Science Research Project of Colleges and Universities in Henan Province of China (No. 19A520009) and the National Science Foundation of China (No. 81501548).

Abstract: For the micro-blog sentiment analysis problem in big data environments, the "dimension disaster" caused by the continuous increase in text information data brings great challenges to the emotional analysis of micro-blogs. To solve this problem, this paper proposes a fusion of the advantages of three feature dimensionality reduction algorithms, based on the traditional document frequency (DF), mutual information (MI), and chi-square test (CHI). Firstly, the document frequency factor is added to the mutual information (MI) algorithm to solve the problem of low-frequency word defects. Then, the standard score factor is added to the chi-square test (CHI) algorithm to solve the negative correlation problem. Finally, the average value is calculated and the advantages of the three algorithms are fused. An improved Proposed DF-MI-CHI fusion algorithm is proposed. The simulation results show that after using this algorithm to process the micro-blog data, the accuracy of sentiment analysis is improved and maintained at 95%. The recall rate is more than 90%, and the F value is maintained between 92% and 94%. In the % interval, it is higher than other improved algorithms and tends to be stable, which indicates that the algorithm can effectively improve the accuracy and efficiency of micro-blog emotional sentiment analysis when dealing with massive micro-blog text data.

Key words: feature dimension reduction, micro-blogging emotion, feature selection, hadoop, HDFS