Accurately and reliably extracting data from the Web: a machine learning approach

Authors:
Craig A. Knoblock;Kristina Lerman;Steven Minton;Ion Muslea
Affiliations:
University of Southern California, 4676 Admiralty Way, Marina del Rey, CA and Fetch Technologies, 4676 Admiralty Way, Marina del Rey, CA;University of Southern California, 4676 Admiralty Way, Marina del Rey, CA;Fetch Technologies, 4676 Admiralty Way, Marina del Rey, CA;University of Southern California, 4676 Admiralty Way, Marina del Rey, CA
Venue:
Intelligent exploration of the web
Year:
2003

Citing 12
Cited 13

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Regression testing for wrapper maintenance

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Active Learning for Natural Language Parsing and Information Extraction

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Learning Stochastic Regular Grammars by Means of a State Merging Method

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Selective Sampling with Redundant Views

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Learning the Common Structure of Data

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence

To buy or not to buy: mining airfare data to minimize ticket purchase price

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Retrieving and Semantically Integrating Heterogeneous Data from the Web

IEEE Intelligent Systems
AutoFeed: an unsupervised learning system for generating webfeeds

Proceedings of the 3rd international conference on Knowledge capture
Building data integration queries by demonstration

Proceedings of the 12th international conference on Intelligent user interfaces
Extracting Web Data Using Instance-Based Learning

World Wide Web
Building Mashups by example

Proceedings of the 13th international conference on Intelligent user interfaces
Deploying information agents on the web

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Extracting product descriptions from polish e-commerce websites using classification and clustering

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Exploiting semantics of web services for geospatial data fusion

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Spatial Semantics and Ontologies
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Chapter 6: web data extraction for service creation

Search Computing
Indexing and retrieval of medical resources for a telemedical platform

ITIB'12 Proceedings of the Third international conference on Information Technologies in Biomedicine
Optimizing queries for web generated sensor data

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115

Quantified Score

Hi-index	0.00

Visualization

Abstract

A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.