Tag tree template for Web information and schema extraction

Authors:
Xiangwen Ji;Jianping Zeng;Shiyong Zhang;Chengrong Wu
Affiliations:
School of Computer Science, Fudan University, Shanghai 200433, China;School of Computer Science, Fudan University, Shanghai 200433, China;School of Computer Science, Fudan University, Shanghai 200433, China;School of Computer Science, Fudan University, Shanghai 200433, China
Venue:
Expert Systems with Applications: An International Journal
Year:
2010

Citing 16
Cited 2

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Applying Pattern Mining to Web Information Extraction

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A survey on tree edit distance and related problems

Theoretical Computer Science
Visual Similarity Comparison for Web Page Retrieval

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Exploiting structural similarity for effective Web information extraction

Data & Knowledge Engineering
Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

World Wide Web
Extracting Web Data Using Instance-Based Learning

World Wide Web
Web Information Extraction by HTML Tree Edit Distance Matching

ICCIT '07 Proceedings of the 2007 International Conference on Convergence Information Technology
Using clustering and edit distance techniques for automatic web data extraction

WISE'07 Proceedings of the 8th international conference on Web information systems engineering

Automatic web information extraction based on rules

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Web objectionable text content detection using topic modeling technique

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

The process of information extraction from Web is both interesting and challenging, which could be helpful in Web Searching, Information Retrieval and Web Mining. Web pages on many sites are produced dynamically as structural records based on a HTML template from a background database. To efficiently extract meaningful information including records and data schema from the kind of pages, a new method based on Tag tree template is proposed. Web pages from different Web sites are parsed into Tag trees, and then templates of each site are generated from the trees by using a cost-based tree similarity measurement. The exclusive content in each page is then extracted by using the templates to parse the page. Finally, the records in pages and the schema of the records can be extracted from the exclusive content by finding repeating patterns and using some heuristic rules. The extraction experiments on 360 pages from 12 Web sites are performed, and the result shows that the proposed method is an effective way to extract meaningful information.