Using Grammatical Inference to Automate Information Extraction from the Web

Authors:
Theodore W. Hong;Keith L. Clark
Affiliations:
-;-
Venue:
PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2001

Citing 9
Cited 5

Recent advances of grammatical inference

Theoretical Computer Science - Special issue on algorithmic learning theory
A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Semi-Automatic Wrapper Generation for Internet Information Sources

COOPIS '97 Proceedings of the Second IFCIS International Conference on Cooperative Information Systems
Visualizing Real Estate Property Information on the Web

IV '99 Proceedings of the 1999 International Conference on Information Visualisation
Grammatical inference by Hill Climbing

Information Sciences: an International Journal

Information Extraction in Structured Documents Using Tree Automata Induction

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Fine-grain web site structure discovery

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Querying parse trees of stochastic context-free grammars

Proceedings of the 13th International Conference on Database Theory
TreeWrapper: automatic data extraction based on tree representation

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Data extraction from web pages based on structural-semantic entropy

Proceedings of the 21st international conference companion on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

The World-Wide Web contains a wealth of semistructured information sources that often give partial/overlapping views on the same domains, such as real estate listings or book prices. These partial sources could be used more effectively if integrated into a single view; however, since they are typically formatted in diverse ways for human viewing, extracting their data for integration is a difficult challenge. Existing learning systems for this task generally use hardcoded ad hoc heuristics, are restricted in the domains and structures they can recognize, and/or require manual training. We describe a principled method for automatically generating extraction wrappers using grammatical inference that can recognize general structures and does not rely on manually-labelled examples. Domain-specific knowledge is explicitly separated out in the form of declarative rules. The method is demonstrated in a test setting by extracting real estate listings from web pages and integrating them into an interactive data visualization tool based on dynamic queries.