Exploiting evidence from unstructured data to enhance master data management

Authors:
Karin Murthy;Prasad M. Deshpande;Atreyee Dey;Ramanujam Halasipuram;Mukesh Mohania;P. Deepak;Jennifer Reed;Scott Schumacher
Affiliations:
IBM Research - India;IBM Research - India;IBM Research - India;IBM Research - India;IBM Research - India;IBM Research - India;IBM Software Group;IBM Software Group
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 15
Cited 1

Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Identity resolution: 23 years of practical experience and observations at scale

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficiently linking text documents with relevant structured information

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Simple algorithms for complex relation extraction with applications to biomedical IE

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Information Extraction

Foundations and Trends in Databases
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Learning field compatibilities to extract database records from unstructured text

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Extending dictionary-based entity extraction to tolerate errors

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases

User guidance for document-driven processes in enterprise systems

DESRIST'13 Proceedings of the 8th international conference on Design Science at the Intersection of Physical and Virtual Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

Master data management (MDM) integrates data from multiple structured data sources and builds a consolidated 360-degree view of business entities such as customers and products. Today's MDM systems are not prepared to integrate information from unstructured data sources, such as news reports, emails, call-center transcripts, and chat logs. However, those unstructured data sources may contain valuable information about the same entities known to MDM from the structured data sources. Integrating information from unstructured data into MDM is challenging as textual references to existing MDM entities are often incomplete and imprecise and the additional entity information extracted from text should not impact the trustworthiness of MDM data. In this paper, we present an architecture for making MDM text-aware and showcase its implementation as IBM Info-Sphere MDM Extension for Unstructured Text Correlation, an add-on to IBM InfoSphere Master Data Management Standard Edition. We highlight how MDM benefits from additional evidence found in documents when doing entity resolution and relationship discovery. We experimentally demonstrate the feasibility of integrating information from unstructured data sources into MDM.