Models and indices for integrating unstructured data with a relational database

Authors:
Sunita Sarawagi
Affiliations:
IIT Bombay
Venue:
KDID'04 Proceedings of the Third international conference on Knowledge Discovery in Inductive Databases
Year:
2004

Citing 6
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Top-k query evaluation with probabilistic guarantees

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Quantified Score

Hi-index	0.00

Visualization

Abstract

Database systems are islands of structure in a sea of unstructured data sources. Several real-world applications now need to create bridges for smooth integration of semi-structured sources with existing structured databases for seamless querying. This integration requires extracting structured column values from the unstructured source and mapping them to known database entities. Existing methods of data integration do not effectively exploit the wealth of information available in multi-relational entities. We present statistical models for co-reference resolution and information extraction in a database setting. We then go over the performance challenges of training and applying these models efficiently over very large databases. This requires us to break open a black box statistical model and extract predicates over indexable attributes of the database. We show how to extract such predicates for several classification models, including naive Bayes classifiers and support vector machines. We extend these indexing methods for supporting similarity predicates needed during data integration.