Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (10): 2645-2656.doi: 10.23940/ijpe.19.10.p10.26452656

• Orginal Article • Previous Articles     Next Articles

An Improved Focused Web Crawler based on Hybrid Similarity

Songtao Shang*, Huaiguang Wu, and Jiangtao Ma   

  1. School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, 450002, China
  • Submitted on ; Revised on ; Accepted on
  • Contact: Shang Songtao
  • About author:

    * Corresponding author. E-mail address: songtao.shang@zzuli.edu.cn

  • Supported by:
    This work is supported by the Research Fund for the Doctoral Program of Zhengzhou University of Light Industry (No 2017BSJJ046 and 2018BSJJ039), Second Education Fund for Industry and Education Project "Digital Science and Technology, Wisdom for the Future" (No 2018A01094), and Henan Province Educational Committee (No 17A520064)

Abstract:

Web crawler is an efficient strategy for downloading data automatically from the Internet. Focused web crawler is a special kind of web crawler that is responsible for getting certain information from webpages and making them available to users. The most important problem of focused web crawler is to confirm the similarity between the target webpages and the topics. Therefore, this paper proposes an improved focused web crawler algorithm, whose similarity calculating methods derive from three aspects: anchor text, content, and structure of the webpages. This improved algorithm is called hybrid similarity. If the anchor text similarity is bigger than the threshold, the target webpages are downloaded directly; otherwise, the target webpages' similarity is analyzed by using the TF-Gini feature weighting algorithm and the improved cosine similarity algorithm. The experimental results in this paper have proven that the hybrid similarity algorithm is more effective than the traditional algorithm. The precision increases by nearly 10% compared with the traditional algorithm.

Key words: focused web crawler, TF-Gini, similarity, hybrid similarity