Sub Node Extraction with Tree Based Wrappers

Authors:
Stefan Raeymaekers;Maurice Bruynooghe
Affiliations:
K.U. Leuven, Dept. of Computer Science, Celestijnenlaan 200A, 3001 Leuven, Belgium, email: stefan.raeymaekers@cs.kuleuven.be;K.U. Leuven, Dept. of Computer Science, Celestijnenlaan 200A, 3001 Leuven, Belgium, email: maurice.bruynooghe@cs.kuleuven.be
Venue:
Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Year:
2008

Citing 10
Cited 0

Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Interactive learning of node selecting tree transducer

Machine Learning
Learning (k,l)-contextual tree languages for information extraction from web pages

Machine Learning
Information extraction from web documents based on local unranked tree automaton inference

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Learning (k,l)-contextual tree languages for information extraction

ECML'05 Proceedings of the 16th European conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

String based as well as tree based methods have been used to learn wrappers for extraction from semi-structured documents (e.g., HTML documents). Previous work has shown that tree based approaches perform better while needing less examples than string based approaches. A disadvantage is that they can only extract complete text nodes, whereas string based approaches can extract within text nodes. This paper proposes a hybrid approach that combines the advantages of both systems and compares it experimentally with a string based approach on some sub node extraction tasks.