Learning field compatibilities to extract database records from unstructured text

Authors:
Michael Wick;Aron Culotta;Andrew McCallum
Affiliations:
University of Massachusetts, Amherst, MA;University of Massachusetts, Amherst, MA;University of Massachusetts, Amherst, MA
Venue:
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Year:
2006

Citing 17
Cited 6

Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Information Extraction: Techniques and Challenges

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents

Selected papers from the Third International Workshop WebDB 2000 on The World Wide Web and Databases
Kernel methods for relation extraction

The Journal of Machine Learning Research
A novel use of statistical parsing to extract information from text

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Correlation Clustering

Machine Learning
Learning to extract information from semi-structured text using a discriminative context free grammar

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Dependency tree kernels for relation extraction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Simple algorithms for complex relation extraction with applications to biomedical IE

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Interactive information extraction with constrained conditional random fields

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Practical Markov logic containing first-order quantifiers with application to identity uncertainty

CHSLP '06 Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing
BLOG: probabilistic models with unknown objects

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Canonicalization of database records using adaptive similarity measures

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Information Extraction

Foundations and Trends in Databases
Structural, transitive and latent models for biographic fact extraction

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Combining multiple sources of evidence in web information extraction

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
SCAD: collective discovery of attribute values

Proceedings of the 20th international conference on World wide web
Exploiting evidence from unstructured data to enhance master data management

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Named-entity recognition systems extract entities such as people, organizations, and locations from unstructured text. Rather than extract these mentions in isolation, this paper presents a record extraction system that assembles mentions into records (i.e. database tuples). We construct a probabilistic model of the compatibility between field values, then employ graph partitioning algorithms to cluster fields into cohesive records. We also investigate compatibility functions over sets of fields, rather than simply pairs of fields, to examine how higher representational power can impact performance. We apply our techniques to the task of extracting contact records from faculty and student homepages, demonstrating a 53% error reduction over baseline approaches.