Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (11): 3031-3041.doi: 10.23940/ijpe.19.11.p23.30313041

Previous Articles     Next Articles

An Improved Optimal Method for Classification Problem

Wei Huanga, Xiao Dongb, Wenqian Shangb,*, Weiguo Linb, and Menghan Yanb   

  1. aDivision of Scientific Research, Communication University of China, Beijing, 100000, China;
    bSchool of Computer Science and Cybersecurity, Communication University of China, Beijing, 100000, China
  • Submitted on ; Revised on ; Accepted on
  • Contact: * E-mail address: shangwenqian@163.com

Abstract: In order to better mine and analyze the massive data generated by search engine companies, this paper proposes a search traffic classification and dimension reduction method based on a logistic regression algorithm. Combined with distributed Hadoop technology, a text classification model is designed and implemented by data research, data analysis, and contrast experiments. In the process of feature extraction of word units, the feature combination method is used, and auxiliary information such as URL is introduced as a semaphore and optimized for the problem of low quality of training samples. The experimental results show that the model optimization effectively improves the quality of the training set. The addition of auxiliary information to train the training set can solve the under-fitting to a certain extent and improve the classification effect. The accuracy of the search traffic classification method and other indicators can reach an artificially accepted range.

Key words: search traffic, text categorization, feature reduction, feature extraction, data analysis