Interactive Tuples Extraction from Semi-Structured Data

Authors:
Remi Gilleron;Patrick Marty;Marc Tommasi;Fabien Torre
Affiliations:
INRIA Futurs and Lille University, France;INRIA Futurs and Lille University, France;INRIA Futurs and Lille University, France;INRIA Futurs and Lille University, France
Venue:
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Year:
2006

Citing 13
Cited 3

Fast algorithms for finding nearest common ancestors

SIAM Journal on Computing
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction

Wrapper induction for information extraction
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Active learning with strong and weak views: a case study on wrapper induction

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Learning (k,l)-contextual tree languages for information extraction

ECML'05 Proceedings of the 16th European conference on Machine Learning

Schema-Guided Induction of Monadic Queries

ICGI '08 Proceedings of the 9th international colloquium on Grammatical Inference: Algorithms and Applications
Learning queries for relational, semi-structured, and graph databases

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Query induction with schema-guided pruning strategies

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tuples of length i - 1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper.