Bootstrapping Information Extraction from Semi-structured Web Pages

Authors:
Andrew Carlson;Charles Schafer
Affiliations:
Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA PA 15213;Google, Inc., Pittsburgh, USA PA 15213
Venue:
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Year:
2008

Citing 13
Cited 6

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Bootstrapping for example-based data extraction

Proceedings of the tenth international conference on Information and knowledge management
Wrapping-oriented classification of web pages

Proceedings of the 2002 ACM symposium on Applied computing
Multistrategy Learning for Information Extraction

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Issues in stacked generalization

Journal of Artificial Intelligence Research

Synthesizing products for online catalogs

Proceedings of the VLDB Endowment
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Information extraction from semi-structured resources: a two-phase finite state transducers approach

CIAA'11 Proceedings of the 16th international conference on Implementation and application of automata
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Unsupervised wrapper induction using linked data

Proceedings of the seventh international conference on Knowledge capture
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. Thus, after a tiny investment of human effort, our method allows automatic extraction from potentially thousands of other sites within the same domain. Our approach extends previous methods for detecting data fields in semi-structured web pages by matching those fields to domain schema columns using robust models of data values and contexts. Annotating 2---5 pages for 4---6 web sites yields an extraction accuracy of 83.8% on job offer sites and 91.1% on vacation rental sites. These results significantly outperform a baseline approach.