Facilitating wrapper generation with page analysis

Authors:
Bo Wu;Xueqi Cheng;Yu Wang;Gang Zhang;Guodong Ding
Affiliations:
Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China;Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China;Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China;Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China;Institute of computing technology, Chinese academy of sciences, Beijing, P. R. China
Venue:
ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Year:
2009

Citing 10
Cited 0

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Wrapper induction for information extraction

Wrapper induction for information extraction
Adaptive record extraction from web pages

Proceedings of the 16th international conference on World Wide Web
Extracting Web Data Using Instance-Based Learning

World Wide Web
Coreex: content extraction from online news articles

Proceedings of the 17th ACM conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current approaches for generating wrappers for web page extraction suffer from the requirement of huge amount of labeled training pages to obtain satisfying results. On the other hand, the quality of data extracted by fully automatic methods is not reliable. In this paper, we propose a novel method to facilitate wrapper generation by combining wrapper induction and page analysis approaches. In addition to manually labeled data, we also take advantage of a set of unlabeled pages to improve the quality of induced wrappers. Our experiments demonstrate that our system achieves a satisfying result with fewer manually labeled training pages.