Int J Performability Eng ›› 2018, Vol. 14 ›› Issue (5): 1078-1087.doi: 10.23940/ijpe.18.05.p27.10781087

• Original articles • Previous Articles     Next Articles

A Mining Model of Network Log Data based on Hadoop

Yun Wua, Xin Maa, Guangqian Konga, Bin Wangb, c, and Xinwei Niuc   

  1. aCollege of Computer Science and Technology, Guizhou University, Guiyang, 550025, China
    bSchool of Mechanical Engineering, Yancheng Institute of Technology, Yancheng, 224000, China
    cSchool of Engineering, Penn State Behrend, Erie, Pennsylvania, 19019, United States

Abstract:

With the increasing amount of data in the information age, traditional Web log data mining method has been unable to deal with large-scale text data. Aiming at these problems, we design a high reliability Web log data mining scheme and put forward a kind of text similarity simulation detection model based on Hadoop. Firstly, we design a data mining scheme for user behavior log, which considering the heterogeneity, diversity and complexity of network log data. The design of the platform is divided into three layers: Data storage layer, Business logic layer, and Application layer. In this part, we design the data cleaning algorithm and KPI, and then use Hive to complete mining. Secondly, a text log data similarity mining model based on Hadoop is proposed, and the algorithm of text similarity mining model is designed. This mining model including the Shingling algorithm and NewMinhash algorithm for the design of MapReduce. Using the improved Shingling algorithm based on the MapReduce programming model, the document is converted to a collection. The distributed New Minhash algorithm is used to solve the signature matrix, and the Jaccard coefficients are used to calculate the similarity. We conduct experimental analysis based on data set SogouCS. The experimental results show the effectiveness of the NewMinhash algorithm, and prove that the model can not only find the similarity of text accurately, but also can better adapt to the distributed platform, and have good expansibility.


Submitted on January 25, 2018; Revised on March 13, 2018; Accepted on April 17, 2018
References: 15