Information extraction from structured documents using k-testable tree automaton inference

Authors:
Raymond Kosala;Hendrik Blockeel;Maurice Bruynooghe;Jan Van den Bussche
Affiliations:
Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium;Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium;Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium;Department WNI, Limburgs Universitair Centrum, Diepenbeek, Belgium
Venue:
Data & Knowledge Engineering
Year:
2006

Citing 37
Cited 5

A theory of the learnable

Communications of the ACM
Efficient learning of context-free grammars from positive structural examples

Information and Computation
Information extraction

Communications of the ACM
Cut and paste

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Recent advances of grammatical inference

Theoretical Computer Science - Special issue on algorithmic learning theory
Inferring structure in semistructured data

ACM SIGMOD Record
Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
WHIRL: a word-based information representation language

Artificial Intelligence - Special issue on Intelligent internet systems
Inductive Inference: Theory and Methods

ACM Computing Surveys (CSUR)
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Monadic datalog and the expressive power of languages for web information extraction

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Discovering Structural Association of Semistructured Data

IEEE Transactions on Knowledge and Data Engineering
Learning Logical Definitions from Relations

Machine Learning
Queries and Concept Learning

Machine Learning
Queries and Concept Learning

Machine Learning
Probabilistic k-Testable Tree Languages

ICGI '00 Proceedings of the 5th International Colloquium on Grammatical Inference: Algorithms and Applications
Towards a Declarative Query and Transformation Language for XML and Semistructured Data: Simulation Unification

ICLP '02 Proceedings of the 18th International Conference on Logic Programming
Stochastic Inference of Regular Tree Languages

ICGI '98 Proceedings of the 4th International Colloquium on Grammatical Inference
Information Extraction in Structured Documents Using Tree Automata Induction

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Jedi: Extracting and Synthesizing Information from the Web

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Knowledge Discovery from Semistructured Texts

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Information Extraction - Tree Alignment Approach to Pattern Discovery in Web Documents

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Looking at the Web through XML Glasses

COOPIS '99 Proceedings of the Fourth IECIS International Conference on Cooperative Information Systems
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Wrapper induction for information extraction

Wrapper induction for information extraction
Information extraction from web documents based on local unranked tree automaton inference

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Learning (k,l)-contextual tree languages for information extraction from web pages

Machine Learning
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
AMBER: turning annotations into knowledge

Proceedings of the 21st international conference companion on World Wide Web
Finding optimal probabilistic generators for XML collections

Proceedings of the 15th International Conference on Database Theory
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from structured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. These methods do not exploit the tree structure of the documents. A natural way to do this is to induce tree automata, which are like finite state automata but parse trees instead of strings. In this work, we explore induction of k-testable ranked tree automata from a small set of annotated examples. We describe three variants which differ in the way they generalize the inferred automaton. Experimental results on a set of benchmark data sets show that our approach compares favorably to string-based approaches. However, the quality of the extraction is still suboptimal.