Post-Supervised Template Induction for Information Extraction from Lists and Tables in Dynamic Web Sources

Authors:
Z. Shi;E. Milios;N. Zincir-Heywood
Affiliations:
Faculty of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5;Faculty of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5;Faculty of Computer Science, Dalhousie University, Halifax, Canada B3H 1W5
Venue:
Journal of Intelligent Information Systems
Year:
2005

Citing 14
Cited 1

Introduction to algorithms

Introduction to algorithms
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
TINTIN: a system for retrieval in text tables

DL '97 Proceedings of the second ACM international conference on Digital libraries
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Elements of the Theory of Computation

Elements of the Theory of Computation
Internet and World Wide Web How to Program

Internet and World Wide Web How to Program
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Information Integration

IEEE Intelligent Systems
Learning Stochastic Regular Grammars by Means of a State Merging Method

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Learning the Common Structure of Data

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Learning to recognize tables in free text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dynamic web sites commonly return information in the form of lists and tables. Although hand crafting an extraction program for a specific template is time-consuming but straightforward, it is desirable to automatically generate template extraction programs from examples of lists and tables in html documents. Supervised approaches have been shown to achieve high accuracy, but they require manual labelling of training examples, which is also time consuming. Fully unsupervised approaches, which extract rows and columns by detecting regularities in the data, cannot provide sufficient accuracy for practical domains. We describe a novel technique, Post-supervised Learning, which exploits unsupervised learning to avoid the need for training examples, while minimally involving the user to achieve high accuracy. We have developed unsupervised algorithms to extract the number of rows and adopted a dynamic programming algorithm for extracting columns. Our method achieves high performance with minimal user input compared to fully supervised techniques.