Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources

Authors:
Sergio Luján-Mora;Manuel Palomar
Affiliations:
-;-
Venue:
WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Year:
2001

Citing 7
Cited 1

Applications of approximate word matching in information retrieval

CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Information Retrieval

Information Retrieval
Clustering Algorithms

Clustering Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Clustering of Similar Values, in Spanish, for the Improvement of Search Systems

International Joint Conference, 7th Ibero-American Conference, 15th Brazilian Symposium on AI, IBERAMIA-SBIA 2000, Open Discussion Track Proceedings on AI
Estimating the Quality of Databases

FQAS '98 Proceedings of the Third International Conference on Flexible Query Answering Systems

A Term-Based Driven Clustering Approach for Name Disambiguation

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.