Algorithms for clustering data
Algorithms for clustering data
Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Phonetic string matching: lessons from information retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
ACM Computing Surveys (CSUR)
A knowledge-based approach for duplicate elimination in data cleaning
Information Systems - Data extraction, cleaning and reconciliation
Information Retrieval
Data Mining: Introductory and Advanced Topics
Data Mining: Introductory and Advanced Topics
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
TAILOR: A Record Linkage Tool Box
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Translating collocations for use in bilingual lexicons
HLT '94 Proceedings of the workshop on Human Language Technology
Hi-index | 0.00 |
The problem of matching data has as one of its major bottlenecks the rapid deterioration in performance of time and accuracy, as the amount of data to be processed increases. One reason for this deterioration in performance is the cost incurred by data matching systems when comparing data records to determine their similarity (or dissimilarity). Approaches such as blocking and concatenation of data attributes have been used to minimize the comparison cost. In this paper, we analyse and present Keyword and Digram clustering as alternatives for enhancing the performance of data matching systems. We compare the performance of these clustering techniques in terms of potential savings in performing comparisons and their accuracy in correctly clustering similar data. Our results on a sampled London Stock Exchange listed companies database show that using the clustering techniques can lead to improved accuracy as well as time savings in data matching systems.