Extracting relational data from HTML repositories

Authors:
Ruth Yuee Zhang;Laks V. S. Lakshmanan;Ruben H. Zamar
Affiliations:
Univ. of British Columbia, Vancouver, BC Canada;Univ. of British Columbia, Vancouver, BC Canada;Univ. of British Columbia, Vancouver, BC Canada
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2004

Citing 9
Cited 4

Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Monadic datalog and the expressive power of languages for web information extraction

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A brief survey of web data extraction tools

ACM SIGMOD Record
Supervised Wrapper Generation with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Information extraction from syllabi for academic e-Advising

Expert Systems with Applications: An International Journal
Record extraction based on user feedback and document selection

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Extracting XML data from the web

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is a vast amount of valuable information in HTML documents, widely distributed across the World Wide Web and across corporate intranets. Unfortunately, HTML is mainly presentation oriented and hard to query. In this paper, we develop a system to extract desired information (records) from thousands of HTML documents, starting from a small set of examples. Duplicates in the result are automatically detected and eliminated. We propose a novel method to estimate the current coverage of results by the system, based on capture-recapture models with unequal capture probabilities. We also propose techniques for estimating the error rate of the extracted information and an interactive the technique for enhancing information quality. To evaluate the method and ideas proposed in this paper, we conducted an extensive set of experiments. Our experimental results validate the effectiveness and utility of our system, and demonstrate interesting tradeoffs between running time of information extraction and coverage of results.