Parallel algorithms for hierarchical clustering
Parallel Computing
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
Proceedings of the third annual conference on Autonomous Agents
Foundations of statistical natural language processing
Foundations of statistical natural language processing
ACM Computing Surveys (CSUR)
Large Margin Classification Using the Perceptron Algorithm
Machine Learning - The Eleventh Annual Conference on computational Learning Theory
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Hardening soft information sources
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Modern Information Retrieval
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning the Kernel Matrix with Semidefinite Programming
The Journal of Machine Learning Research
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Online and batch learning of pseudo-metrics
ICML '04 Proceedings of the twenty-first international conference on Machine learning
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Object identification with attribute-mediated dependences
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
A two-step classification approach to unsupervised record linkage
AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Efficient top-k count queries over imprecise duplicates
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Query result clustering for object-level search
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic blocking for Record Linkage
Proceedings of the 2007 conference on Artificial Intelligence Research and Development
Table extraction using spatial reasoning on the CSS2 visual box model
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
An unsupervised approach for product record normalization across different web sites
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Learning similarity metrics for event identification in social media
Proceedings of the third ACM international conference on Web search and data mining
Scaling record linkage to non-uniform distributed class sizes
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Matching unstructured product offers to structured product specifications
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Similarity function recommender service using incremental user knowledge acquisition
ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Aggregate queries on probabilistic record linkages
Proceedings of the 15th International Conference on Extending Database Technology
Matching product titles using web-based enrichment
Proceedings of the 21st ACM international conference on Information and knowledge management
Hybrid entity clustering using crowds and data
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
The problem of record linkage focuses on determining whether two object descriptions refer to the same underlying entity. Addressing this problem effectively has many practical applications, e.g., elimination of duplicate records in databases and citation matching for scholarly articles. In this paper, we consider a new domain where the record linkage problem is manifested: Internet comparison shopping. We address the resulting linkage setting that requires learning a similarity function between record pairs from streaming data. The learned similarity function is subsequently used in clustering to determine which records are co-referent and should be linked. We present an online machine learning method for addressing this problem, where a composite similarity function based on a linear combination of basis functions is learned incrementally. We illustrate the efficacy of this approach on several real-world datasets from an Internet comparison shopping site, and show that our method is able to effectively learn various distance functions for product data with differing characteristics. We also provide experimental results that show the importance of considering multiple performance measures in record linkage evaluation.