Database enrichment environment to identify duplicate tuples

Authors:
Juliano Augusto Carreira;Carlos Roberto Valêncio;Rogéria C. Gratão de Souza
Affiliations:
DCCE, IBILCE, UNESP, São José do Rio Preto, SP;DCCE, IBILCE, UNESP, São José do Rio Preto, SP;DCCE, IBILCE, UNESP, São José do Rio Preto, SP
Venue:
FDIA'11 Proceedings of the Fourth BCS-IRSG conference on Future Directions in Information Access
Year:
2011

Citing 4
Cited 0

Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Knowledge Discovery and Data Mining: Challenges and Realities

Knowledge Discovery and Data Mining: Challenges and Realities
Introduction to Information Retrieval

Introduction to Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the significant problems and inherent to current large databases is the incidence of duplicate tuples. This problem refers to the repetition of records that, in most cases, are represented differently in databases but refer to the same real world entity, which makes the task of identifying those tuples a hard work. Considering that each language has its peculiarities, it is believed that the use of text operations techniques from the area of Information Retrieval can enrich the content of the records for a specific language and thus maximize the amount of identified duplicate tuples and/or improve the confidence level of their classification in relation to current tools. The main contribution of this paper is to provide a language independent environment able to approximate the spelling of the records in a database and thus identify duplicate tuples more efficiently than the isolated application of traditional methods. In addition to only improve database quality this tool can also improve the process of Knowledge Discovery in Databases (KDD).