Username   Password       Forgot your password?  Forgot your username? 


An Automatic Web Data Extraction Approach based on Path Index Trees

Volume 14, Number 10, October 2018, pp. 2449-2460
DOI: 10.23940/ijpe.18.10.p21.24492460

Yan Wena, Qingtian Zengb, Hua Duanc, Feng Zhanga, and Xin Chena

aCollege of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, 266590, China
bCollege of Electronic, Communication and Physics, Shandong University of Science and Technology, Qingdao, 266590, China
cCollege of Mathematics and System Science, Shandong University of Science and Technology, Qingdao, 266590, China

(Submitted on July 11, 2018; Revised on August 15, 2018; Accepted on September 16, 2018)


This paper proposes a novel approach called ITE to extract web data records in a fully automatic way. The approach effectively utilizes the tag index information in different layers of the HTML DOM tree and abstracts the concept of index tree together with its repetitiveness and consecutiveness, which can characterize the key structural information in a web page. The concept of repetitiveness indicates the structural similarities among data records, and the concept of consecutiveness represents the sequential features of multiple records. Then, the complex DOM tree can be compressed to a set of index trees based on these concepts. We also provide a series of properties as theoretical support. The extraction process is divided into three steps, namely, repetitiveness discovery, consecutiveness discovery, and index tree merging. To handle data field missing, multiple record roots, and other complicated situations, we propose a digital sequence similarity measurement and a hierarchical clustering approach to find the repeating patterns. Then, data records are identified based on the consecutiveness discovery method, and the data blocks containing full data records are restored by merging the index trees. Experiments demonstrate the effectiveness and efficiency of the proposed approach. It outperforms existing classic work in accuracy and has a satisfying execution time, which means it is applicable to large datasets. The time complexity is linear to the number of leaf nodes in the DOM tree of a web page.


References: 12

                1. R. Kumar and A.Tomkins “A Characterization of Online Browsing Behavior” in Proceedings of the 19th ACM International Conference on World Wide Web, pp. 561-570, 2010
                2. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Web-Tables: Exploring the Power of Tables on the Web,” in Proceedings of the VLDB Endowment, Vol. 1, No. 1, pp. 538-549, 2008
                3. H. Elmeleegy, J. Madhavan, and A. Halevy, “Harvesting Relational Tables from Lists on the Web,” in Proceedings of the VLDB Endowment, Vol. 2, No. 1, pp. 1078-1089, 2009
                4. J. Madhavan, S. Jeffery, S. Cohen, et al., “Web-Scale Data Integration: You Can Only Afford to Pay as You Go,” in Proceedings of Third Biennial Conference on Innovative Data Systems Research (CIDR), pp. 342-350, Online Proceedings, Asilomar, CA, USA, 2007
                5. T. Furche, G. Gottlob, G. Grasso, X. N. Guo, G. Orsi, C. Schallhart, et al., “DIADEM: Thousands of Websites to A Single Database,” in Proceedings of the VLDB Endowment, Vol. 7, No. 14, 2014
                6. Y. Zhai and B. Liu, “Web Data Extraction based on Partial Tree Alignment,” in Proceedings of the 14th International Conference on the World Wide Web, pp. 76-85, 2005
                7. G. Miao, J. Tatemura, W. P. Hsiung, A. Sawires, and L. E. Moser, “Extracting Data Records from the Web using Tag Path Clustering,” in Proceedings of the 18th ACM International Conference on World Wide Web, pp. 981-990, 2009
                8. Q. Hao, R. Cai, Y. Pang, et al., “From One Tree to a Forest: A Unified Solution for Structured Web Data Extraction,” in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 775-784, 2011
                9. Y. Yamada, N. Craswell, T. Nakatoh, and S. Hirokawa, “Testbed for Information Extraction from Deep Web,” in Proceedings of the International Conference of World Wide Web, pp. 346-347, 2004
                10. C. Long, X. Geng, C. Xu, and S. Keerthi, “A Simple Approach to the Design of Site-Level Extractors using Domain-Centric Principles,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1517-1521, 2012
                11. D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” in Proceedings of the 21st IEEE International Conference on Distributed Computing Systems, pp 361-370, 2001
                12. J. Wang and F. H. Lochovsky, “Data Extraction and Label Assignment for Web Databases,” in Proceedings of the 12th International Conference on the World Wide Web, pp. 187-196, 2003


                              Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

                              This site uses encryption for transmitting your passwords.