Exploiting background knowledge to build reference sets for information extraction

Authors:
Matthew Michelson;Craig A. Knoblock
Affiliations:
Fetch Technologies, El Segundo, CA;University of Southern California, Information Sciences Institute, Marina del Rey, CA
Venue:
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Year:
2009

Citing 11
Cited 4

Deriving concept hierarchies from text

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
KnowItNow: fast, scalable information extraction from the web

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web

International Journal on Document Analysis and Recognition
Automatic Taxonomy Extraction Using Google and Term Dependency

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Organizing and searching the world wide web of facts - step one: the one-million fact extraction challenge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Unsupervised information extraction approach using graph mutual reinforcement

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Learning concept hierarchies from text corpora using formal concept analysis

Journal of Artificial Intelligence Research
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Principal components for automatic term hierarchy building

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

Data-driven computational linguistics at FaMAF-UNC, Argentina

YIWCALA '10 Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Constructing reference sets from unstructured, ungrammatical text

Journal of Artificial Intelligence Research
Materializing multi-relational databases from the web using taxonomic queries

Proceedings of the fourth ACM international conference on Web search and data mining
Building a lightweight semantic model for unsupervised information extraction on short listings

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous work on information extraction from unstructured, ungrammatical text (e.g. classified ads) showed that exploiting a set of background knowledge, called a "reference set," greatly improves the precision and recall of the extractions. However, finding a source for this reference set is often difficult, if not impossible. Further, even if a source is found, it might not overlap well with the text for extraction. In this paper we present an approach to building the reference set directly from the text itself. Our approach eliminates the need to find the source for the reference set, and ensures better overlap between the text and reference set. Starting with a small amount of background knowledge, our technique constructs tuples representing the entities in the text to form a reference set. Our results show that our method outperforms manually constructed reference sets, since hand built reference sets may not overlap with the entities in the unstructured, ungrammatical text. We also ran experiments comparing our method to the supervised approach of Conditional Random Fields (CRFs) using simple, generic features. These results show our method achieves an improvement in F1-measure for 6/9 attributes and is competitive in performance on the others, and this is without training data.