Username   Password       Forgot your password?  Forgot your username? 


A Mining Model of Network Log Data based on Hadoop

Volume 14, Number 5, May 2018, pp. 1078-1087
DOI: 10.23940/ijpe.18.05.p27.10781087

Yun Wua, Xin Maa, Guangqian Konga, Bin Wangb,c, and Xinwei Niuc

aCollege of Computer Science and Technology, Guizhou University, Guiyang, 550025, China
bSchool of Mechanical Engineering, Yancheng Institute of Technology, Yancheng, 224000, China
cSchool of Engineering, Penn State Behrend, Erie, Pennsylvania, 19019, United States

(Submitted on January 25, 2018; Revised on March 13, 2018; Accepted on April 17, 2018)


With the increasing amount of data in the information age, traditional Web log data mining method has been unable to deal with large-scale text data. Aiming at these problems, we design a high reliability Web log data mining scheme and put forward a kind of text similarity simulation detection model based on Hadoop. Firstly, we design a data mining scheme for user behavior log, which considering the heterogeneity, diversity and complexity of network log data. The design of the platform is divided into three layers: Data storage layer, Business logic layer, and Application layer. In this part, we design the data cleaning algorithm and KPI, and then use Hive to complete mining. Secondly, a text log data similarity mining model based on Hadoop is proposed, and the algorithm of text similarity mining model is designed. This mining model including the Shingling algorithm and NewMinhash algorithm for the design of MapReduce. Using the improved Shingling algorithm based on the MapReduce programming model, the document is converted to a collection. The distributed New Minhash algorithm is used to solve the signature matrix, and the Jaccard coefficients are used to calculate the similarity. We conduct experimental analysis based on data set SogouCS. The experimental results show the effectiveness of the NewMinhash algorithm, and prove that the model can not only find the similarity of text accurately, but also can better adapt to the distributed platform, and have good expansibility.


References: 15

  1.  E.J. Chen, E. B. Jiang. "Review of Studies on Text Similarity Measures," [J]. Data Analysis and Knowledge Discovery,201 7,1(06):1-11.
  2. F. J. Feng, J. P. Yao, X. S. Li, J. C. Ma. "Research on the Data Cleaning Framework," [J]. COMPUTER ENGINEERING&SOFTWARE,2017,38(12):193-196.
  3. Gartner IT Glossary. "Dark Data," [EB/OL]. [2015-03-16]. http: / / www. gartner . com / it - glossary / dark - data.
  4. G. M. Hu, L. Zhou, L. X. Ke. "Research on Hadoop-base Network Log Analysis System," [J]. Computer Knowledge and Technology,2010,6(22):6163-6164+6185.
  5. D. X. Li. "Web Log Analysis Based on Data Mining," [J]. Computer Knowledge and Technology,2011,7(25):6074-6075+6078.
  6. Kumar N. "Approximate String Matching Algorithm," [J]. International Journal on Computer Science and Engineering,2010,2(3):641-644.
  7. K. L. Shen, B. Shao, J. Du. "The Realization of Digital Resource Monitoring System Based on Network Log Analysis," [J]. RESEARCH ON LIBRARY SCIENCE,2015(16):21-25.
  8. S. M. Xie. "Forum Log Analysis Based on The Big Data Processing Technology Hadoop," [D]. Jiangxi Agricultural University,2014.
  9. Z. M. Xia, X. Liu. "A Similarity Algorithm for Chinese Text Based on Semantics," [J]. JI SUAN JI YU XIAN DAI HUA,2015(04):6-9.
  10. X. J. Xiang, Y. Gao, L. Shang, Y. B. Yang. "Parallel Text Categorization of Massive Text Based on Hadoop," [J]. Computer Science,2011,38(10):184-188.
  11. D. H. Yang, N. N. Li, H. Z. Wang, J. Z. Li, H. Gao. "The Optimization of the Big Data Cleaning Based on Task Merging," [J]. CHINESE JOURNAL OF COMPUTER,2016,39(01):97-108.
  12. F. Y. Yang, H. C. Liu. "Research on Hadoop Base Online Network Log Analysis System," [J]. Computer Application and Software,2014,31(08):311-316.
  13. Q. L. Yang. "Internet User Behavior Analysis Based on Web Log," [D]. Huazhong University of Science &Technology,2013.
  14. L. L. Zhang. "Research and Implementation of Chinese Text Categorization Based on Hadoop and SVM Algorithms," [D]. Kunming University of Science and Technology,2015.
  15. P. Z. Zou. "Website Evaluation Index and Construction Status Analysis," [J]. Computer CD Software and Application,2012,20:151-155.


    Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

    This site uses encryption for transmitting your passwords.