Clustering with Proximity Graphs: Exact and Efficient Algorithms

Authors:
Michail Kazimianec;Nikolaus Augsten
Affiliations:
Faculty of Economics, Vilnius University, Vilnius, Lithuania;Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
Venue:
International Journal of Knowledge-Based Organizations
Year:
2013

Citing 18
Cited 0

Fast parallel and serial approximate string matching

Journal of Algorithms
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Unsupervised Clustering of Symbol Strings and Context Recognition

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach

IEEE Transactions on Knowledge and Data Engineering
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Online algorithm for the self-organizing map of symbol strings

Neural Networks - 2004 Special issue: New developments in self-organizing systems
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graph Proximity Cleansing GPC is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.