Integrating data from the web by machine-learning tree-pattern queries

Authors:
Benjamin Habegger;Denis Debarbieux
Affiliations:
Dipartimento di Informatica e Sistemistica, Università di Roma 1 – “La Sapienza”, Roma, Italy;LIFL, UMR 8022 CNRS, Lille University (France), Mostrare project, RU INRIA Futurs
Venue:
ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Year:
2006

Citing 7
Cited 2

Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto

LPNMR '01 Proceedings of the 6th International Conference on Logic Programming and Nonmonotonic Reasoning
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Context Generalization for Information Extraction from the Web

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Information extraction from web documents based on local unranked tree automaton inference

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Towards a wrapper-driven ontology-based framework for knowledge extraction

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Characterizing structural relationships in scenes using graph kernels

ACM SIGGRAPH 2011 papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Effienct and reliable integration of web data requires building programs called wrappers Hand writting wrappers is tedious and error prone Constant changes in the web, also implies that wrappers need to be constantly refactored Machine learning has proven to be useful, but current techniques are either limited in expressivity, require non-intuitive user interaction or do not allow for n-ary extraction We study using tree-patterns as an n-ary extraction language and propose an algorithm learning such queries It calculates the most information-conservative tree-pattern which is a generalization of two input trees A notable aspect is that the approach allows to learn queries containing both child and descendant relationships between nodes More importantly, the proposed approach does not require any labeling other than the data which the user effectively wants to extract The experiments reported show the effectiveness of the approach.