Information extraction from web documents based on local unranked tree automaton inference

Authors:
Raymond Kosala;Maurice Bruynooghe;Jan Van Den Bussche;Hendrik Blocked
Affiliations:
K.U.Leuven, Dept. of Computer Science, Celestijnenlaan, Leuven;K.U.Leuven, Dept. of Computer Science, Celestijnenlaan, Leuven;University of Limburg, Dept. WNI, Diepenbcek;K.U.Leuven, Dept. of Computer Science, Celestijnenlaan, Leuven
Venue:
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Year:
2003

Citing 14
Cited 16

Predicting Protein Secondary Structure Using Stochastic Tree Grammars

Machine Learning - Special issue on learning with probabilistic representations
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Probabilistic k-Testable Tree Languages

ICGI '00 Proceedings of the 5th International Colloquium on Grammatical Inference: Algorithms and Applications
Towards a Declarative Query and Transformation Language for XML and Semistructured Data: Simulation Unification

ICLP '02 Proceedings of the 18th International Conference on Logic Programming
Information Extraction in Structured Documents Using Tree Automata Induction

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Knowledge Discovery from Semistructured Texts

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project

Logic-based web information extraction

ACM SIGMOD Record
The Lixto data extraction project: back and forth between theory and practice

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Web data extraction based on structural similarity

Knowledge and Information Systems
Information extraction from structured documents using k-testable tree automaton inference

Data & Knowledge Engineering
Interactive learning of node selecting tree transducer

Machine Learning
Mining key information of web pages: A method and its application

Expert Systems with Applications: An International Journal
Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences
Web page title extraction and its application

Information Processing and Management: an International Journal
Learning (k,l)-contextual tree languages for information extraction from web pages

Machine Learning
Sub Node Extraction with Tree Based Wrappers

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Learning (k,l)-contextual tree languages for information extraction

ECML'05 Proceedings of the 16th European conference on Machine Learning
Integrating data from the web by machine-learning tree-pattern queries

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Learning multiplicity tree automata

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences
Certain and possible XPath answers

Proceedings of the 16th International Conference on Database Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on 10 from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.