Web data extraction system based on label library

  • Authors:
  • Shoubiao Tan;Chao Xu;Yuan Jiang

  • Affiliations:
  • School of Electronic Science and Technology, Anhui University, Hefei;School of Electronic Science and Technology, Anhui University, Hefei;School of Electronic Science and Technology, Anhui University, Hefei

  • Venue:
  • FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

A Web Information Extraction System based on label library is proposed for extracting information from data intensive web pages in this paper. It downloads dynamic web pages based on a knowledge database, changes them to XML documents after a preprocessing, mines data regions by using MDR repeated patterns discovery algorithm, recognizes their structure and extracts data from them through a novel hierarchic pattern recognition and data extraction algorithm based on label library, and stores the data into the knowledge database to support further information extraction. Experiments showed that the system has high precision and is adaptive to web pages in different domains and with different structures.