Sources of Success for Boosted Wrapper Induction

Authors:
David Kauchak;Joseph Smarr;Charles Elkan
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2004

Citing 25
Cited 4

Measuring the VC-dimension of a learning machine

Neural Computation
WordNet: a lexical database for English

Communications of the ACM
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
A simple, fast, and effective rule learner

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Learning Logical Definitions from Relations

Machine Learning
The CN2 Induction Algorithm

Machine Learning
A Theory-Refinement Approach to Information Extraction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information Extraction: Techniques and Challenges

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
A Brief Introduction to Boosting

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Pruning Adaptive Boosting

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Relational learning techniques for natural language information extraction

Relational learning techniques for natural language information extraction
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
SRI International FASTUS system: MUC-6 test results and analysis

MUC6 '95 Proceedings of the 6th conference on Message understanding
Immediate-head parsing for language models

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
The foundations of cost-sensitive learning

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Representing sentence structure in hidden Markov models for information extraction

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
CRYSTAL inducing a conceptual dictionary

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Automatically generating extraction patterns from untagged text

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2

Gleaner: Creating ensembles of first-order clauses to improve recall-precision curves

Machine Learning
Combining Information Extraction Systems Using Voting and Stacked Generalization

The Journal of Machine Learning Research
Cooperative CG-Wrappers for Web Content Extraction

ICCS '07 Proceedings of the 15th international conference on Conceptual Structures: Knowledge Architectures for Smart Applications
Mining employment market via text block detection and adaptive cross-domain information extraction

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI) by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We investigate systematically how each algorithmic component of BWI, in particular boosting, contributes to its success. We show that the benefit of boosting arises from the ability to reweight examples to learn specific rules (resulting in high precision) combined with the ability to continue learning rules after all positive examples have been covered (resulting in high recall). As a quantitative indicator of the regularity of an extraction task, we propose a new measure that we call the SWI ratio. We show that this measure is a good predictor of IE success and a useful tool for analyzing IE tasks. Based on these results, we analyze the strengths and limitations of BWI. Specifically, we explain limitations in the information made available, and in the representations used. We also investigate the consequences of the fact that confidence values returned during extraction are not true probabilities. Next, we investigate the benefits of including grammatical and semantic information for natural text documents, as well as parse tree and attribute-value information for XML and HTML documents. We show experimentally that incorporating even limited grammatical information can increase the regularity of natural text extraction tasks, resulting in improved performance. We conclude with proposals for enriching the representational power of BWI and other IE methods to exploit these and other types of regularities.