Applications of approximate word matching in information retrieval
CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
Information Retrieval
Clustering Algorithms
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Clustering of Similar Values, in Spanish, for the Improvement of Search Systems
International Joint Conference, 7th Ibero-American Conference, 15th Brazilian Symposium on AI, IBERAMIA-SBIA 2000, Open Discussion Track Proceedings on AI
Estimating the Quality of Databases
FQAS '98 Proceedings of the Third International Conference on Flexible Query Answering Systems
A Term-Based Driven Clustering Approach for Name Disambiguation
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Hi-index | 0.00 |
The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.