AutoFeed: an unsupervised learning system for generating webfeeds

Authors:
Bora Gazen;Steven Minton
Affiliations:
Fetch Technologies, El Segundo, CA;Fetch Technologies, El Segundo, CA
Venue:
Proceedings of the 3rd international conference on Knowledge capture
Year:
2005

Citing 16
Cited 3

Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Solving crossword puzzles as probabilistic constraint satisfaction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
A simple, fast, and effective rule learner

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Learning Probabilistic Models of Relational Structure

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Accurately and reliably extracting data from the Web: a machine learning approach

Intelligent exploration of the web
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Fine-grain web site structure discovery

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research

Overview of autofeed: an unsupervised learning system for generating webfeeds

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Automatically Constructing Semantic Web Services from Online Sources

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

BIS'07 Proceedings of the 10th international conference on Business information systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our goal is to automatically extract data from semi-structured webn sites. Previously, researchers have developed two types of supervised learning approaches for extracting web data: methods that create precise, site-specific extraction rules and methods that learn less-precise site-independent extraction rules. In either case, significant training is required. In this paper, we describe a third, more ambitious approach, where we use unsupervised learning to analyze sites and discover their structure. Our method relies on a set of heterogeneous "experts", each of which is capable of identifying certain types of generic structure. Each expert represents its discoveries as "hints". Based on these hints, our system clusters the pages and identifies semi-structured data that can be extracted. To identify a good clustering, we use a probabilistic model of the hint-generation process. The paper describes our formulation of the fully-automatic web-extraction problem, our clustering approach, and our results on a set of experiments.