Fast parallel and serial approximate string matching
Journal of Algorithms
Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Information Retrieval
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Unsupervised Clustering of Symbol Strings and Context Recognition
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach
IEEE Transactions on Knowledge and Data Engineering
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Online algorithm for the self-organizing map of symbol strings
Neural Networks - 2004 Special issue: New developments in self-organizing systems
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Approximate Joins for Data-Centric XML
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
Hi-index | 0.00 |
Graph Proximity Cleansing GPC is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.