Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction

Authors:
Sebastian Blohm;Philipp Cimiano
Affiliations:
Institute AIFB, University of Karlsruhe, Germany;Institute AIFB, University of Karlsruhe, Germany
Venue:
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Year:
2007

Citing 15
Cited 2

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Searching the workplace web

WWW '03 Proceedings of the 12th international conference on World Wide Web
Kernel methods for relation extraction

The Journal of Machine Learning Research
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Semantic Wikipedia

Proceedings of the 15th international conference on World Wide Web
Extracting regulatory gene expression networks from PubMed

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Dependency tree kernels for relation extraction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Espresso: leveraging generic patterns for automatically harvesting semantic relations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Relation extraction using label propagation based semi-supervised learning

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
Harvesting relations from the web: quantifiying the impact of filtering functions

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Semantic annotation for knowledge management: Requirements and a survey of the state of the art

Web Semantics: Science, Services and Agents on the World Wide Web
Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

Analysis and improvement of minimally supervised machine learning for relation extraction

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems
Term extraction from sparse, ungrammatical domain-specific documents

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Textual patterns have been used effectively to extract information from large text collections. However they rely heavily on textual redundancy in the sense that facts have to be mentioned in a similar manner in order to be generalized to a textual pattern. Data sparseness thus becomes a problem when trying to extract information from hardly redundant sources like corporate intranets, encyclopedic works or scientific databases.We present results on applying a weakly supervised pattern induction algorithm to Wikipedia to extract instances of arbitrary relations. In particular, we apply different configurations of a basic algorithm for pattern induction on seven different datasets. We show that the lack of redundancy leads to the need of a large amount of training data but that integrating Web extraction into the process leads to a significant reduction of required training data while maintaining the accuracy of Wikipedia. In particular we show that, though the use of the Web can have similar effects as produced by increasing the number of seeds, it leads overall to better results. Our approach thus allows to combine advantages of two sources: The high reliability of a closed corpus and the high redundancy of the Web.