Tag tree template for Web information and schema extraction

  • Authors:
  • Xiangwen Ji;Jianping Zeng;Shiyong Zhang;Chengrong Wu

  • Affiliations:
  • School of Computer Science, Fudan University, Shanghai 200433, China;School of Computer Science, Fudan University, Shanghai 200433, China;School of Computer Science, Fudan University, Shanghai 200433, China;School of Computer Science, Fudan University, Shanghai 200433, China

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2010

Quantified Score

Hi-index 12.05

Visualization

Abstract

The process of information extraction from Web is both interesting and challenging, which could be helpful in Web Searching, Information Retrieval and Web Mining. Web pages on many sites are produced dynamically as structural records based on a HTML template from a background database. To efficiently extract meaningful information including records and data schema from the kind of pages, a new method based on Tag tree template is proposed. Web pages from different Web sites are parsed into Tag trees, and then templates of each site are generated from the trees by using a cost-based tree similarity measurement. The exclusive content in each page is then extracted by using the templates to parse the page. Finally, the records in pages and the schema of the records can be extracted from the exclusive content by finding repeating patterns and using some heuristic rules. The extraction experiments on 360 pages from 12 Web sites are performed, and the result shows that the proposed method is an effective way to extract meaningful information.