Int J Performability Eng ›› 2018, Vol. 14 ›› Issue (10): 2449-2460.doi: 10.23940/ijpe.18.10.p21.24492460

• Original articles • Previous Articles     Next Articles

An Automatic Web Data Extraction Approach based on Path Index Trees

Yan Wena, Qingtian Zengb, Hua Duanc, Feng Zhanga, and Xin Chena   

  1. aCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, 266590, China
    bCollege of Electronic, Communication and Physics, Shandong University of Science and Technology, Qingdao, 266590, China
    cCollege of Mathematics and System Science, Shandong University of Science and Technology, Qingdao, 266590, China

Abstract:

This paper proposes a novel approach called ITE to extract web data records in a fully automatic way. The approach effectively utilizes the tag index information in different layers of the HTML DOM tree and abstracts the concept of index tree together with its repetitiveness and consecutiveness, which can characterize the key structural information in a web page. The concept of repetitiveness indicates the structural similarities among data records, and the concept of consecutiveness represents the sequential features of multiple records. Then, the complex DOM tree can be compressed to a set of index trees based on these concepts. We also provide a series of properties as theoretical support. The extraction process is divided into three steps, namely, repetitiveness discovery, consecutiveness discovery, and index tree merging. To handle data field missing, multiple record roots, and other complicated situations, we propose a digital sequence similarity measurement and a hierarchical clustering approach to find the repeating patterns. Then, data records are identified based on the consecutiveness discovery method, and the data blocks containing full data records are restored by merging the index trees. Experiments demonstrate the effectiveness and efficiency of the proposed approach. It outperforms existing classic work in accuracy and has a satisfying execution time, which means it is applicable to large datasets. The time complexity is linear to the number of leaf nodes in the DOM tree of a web page.


Submitted on July 11, 2018; Revised on August 15, 2018; Accepted on September 16, 2018
References: 12