Web data extraction system based on label library

Authors:
Shoubiao Tan;Chao Xu;Yuan Jiang
Affiliations:
School of Electronic Science and Technology, Anhui University, Hefei;School of Electronic Science and Technology, Anhui University, Hefei;School of Electronic Science and Technology, Anhui University, Hefei
Venue:
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Year:
2009

Citing 8
Cited 0

Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic information extraction from large websites

Journal of the ACM (JACM)
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

A Web Information Extraction System based on label library is proposed for extracting information from data intensive web pages in this paper. It downloads dynamic web pages based on a knowledge database, changes them to XML documents after a preprocessing, mines data regions by using MDR repeated patterns discovery algorithm, recognizes their structure and extracts data from them through a novel hierarchic pattern recognition and data extraction algorithm based on label library, and stores the data into the knowledge database to support further information extraction. Experiments showed that the system has high precision and is adaptive to web pages in different domains and with different structures.