WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Authors:
Valter Crescenzi;Paolo Merialdo
Affiliations:
Dipartimento di Informatica e Automazione, Università degli Studi Roma Tre, Roma, Italy;Dipartimento di Informatica e Automazione, Università degli Studi Roma Tre, Roma, Italy
Venue:
Applied Artificial Intelligence
Year:
2008

Citing 24
Cited 3

Inference of regular grammars via skeletons

IEEE Transactions on Systems, Man and Cybernetics
Solving airline crew scheduling problems by branch-and-cut

Management Science
Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Inference of Reversible Languages

Journal of the ACM (JACM)
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
A brief survey of web data extraction tools

ACM SIGMOD Record
DEByE - Date extraction by example

Data & Knowledge Engineering
The power of languages for the manipulation of complex values

The VLDB Journal — The International Journal on Very Large Data Bases
Information Extraction in Structured Documents Using Tree Automata Induction

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Extraction Techniques for Mining Services from Web Sources

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic information extraction from large websites

Journal of the ACM (JACM)
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

ALFRED: crowd assisted data extraction

Proceedings of the 22nd international conference on World Wide Web companion
A framework for learning web wrappers from the crowd

Proceedings of the 22nd international conference on World Wide Web
Extraction and integration of partially overlapping web sources

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.01

Visualization

Abstract

Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques.