Learning to match and cluster large high-dimensional data sets for data integration

Authors:
William W. Cohen;Jacob Richman
Affiliations:
WhizBang Labs, Pittsburgh, PA;WhizBang Labs, Pittsburgh, PA
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 11
Cited 86

Automatic text processing

Automatic text processing
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
AJAX: an extensible data cleaning tool

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
WHIRL: a word-based information representation language

Artificial Intelligence - Special issue on Intelligent internet systems
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Reasoning about Textual Similarity in a Web-Based Information Access System

Autonomous Agents and Multi-Agent Systems
Digital Libraries and Autonomous Citation Indexing

Computer
Learning to order things

Journal of Artificial Intelligence Research

Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Correlation Clustering

Machine Learning
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Relational clustering for multi-type entity resolution

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Semantic integration in text: from ambiguous names to identifiable entities

AI Magazine - Special issue on semantic integration
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Heterogeneous Field Matching Method for Record Linkage

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Profile-Based Object Matching for Information Integration

IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Learning to deduplicate

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

The Journal of Machine Learning Research
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Discover the semantic topology in high-dimensional data

Expert Systems with Applications: An International Journal
Integration of Ontology Data through Learning Instance Matching

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A novel approach to clustering merchandise records

Journal of Computer Science and Technology
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Replica identification using genetic programming

Proceedings of the 2008 ACM symposium on Applied computing
A two-step classification approach to unsupervised record linkage

AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Identification of time-varying objects on the web

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Automatic record linkage using seeded nearest neighbour and support vector machine classification

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
An ontology data matching method for web information integration

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitais

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
The impact of parameter setup on a genetic programming approach to record deduplication

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Learning to Extract Relations for Relational Classification

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Exploiting context analysis for combining multiple entity resolution systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Combining a Logical and a Numerical Method for Data Reconciliation

Journal on Data Semantics XII
Optimal Stopping: A Record-Linkage Approach

Journal of Data and Information Quality (JDIQ)
Identification and tracing of ambiguous names: discriminative and generative approaches

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Learnable similarity functions and their applications to clustering and record linkage

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Bounding and comparing methods for correlation clustering beyond ILP

ILP '09 Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing
Learning to match names across languages

MMIES '08 Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization
Similarity-aware indexing for real-time entity resolution

Proceedings of the 18th ACM conference on Information and knowledge management
Record linkage performance for large data sets

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter
Reasoning about record matching rules

Proceedings of the VLDB Endowment
Discriminative training of clustering functions: theory and experiments with entity identification

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Learning similarity metrics for event identification in social media

Proceedings of the third ACM international conference on Web search and data mining
A constrained clustering approach to duplicate detection among relational data

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Scaling record linkage to non-uniform distributed class sizes

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Average correlation clustering algorithm (ACCA) for grouping of co-regulated genes with similar pattern of variation in their expression values

Journal of Biomedical Informatics
Improved consensus clustering via linear programming

ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Correlation clustering with noisy input

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
An efficient duplicate record detection using q-grams array inverted index

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
An unsupervised heuristic-based approach for bibliographic metadata deduplication

Information Processing and Management: an International Journal
Public record aggregation using semi-supervised entity resolution

Proceedings of the 13th International Conference on Artificial Intelligence and Law
Entity matching: how similar is similar

Proceedings of the VLDB Endowment
Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

The Journal of Machine Learning Research
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Learning top-k transformation rules

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Adjusting Fuzzy Similarity Functions for use with standard data mining tools

Journal of Systems and Software
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Identifying co-referential names across large corpora

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Object identification with attribute-mediated dependences

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Probabilistic data generation for deduplication and data linkage

IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Active duplicate detection

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Probabilistic iterative duplicate detection

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Unsupervised duplicate detection using sample non-duplicates

Journal on Data Semantics VII
Similarity function recommender service using incremental user knowledge acquisition

ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Cross-lingual knowledge linking across wiki knowledge bases

Proceedings of the 21st international conference on World Wide Web
Aggregate queries on probabilistic record linkages

Proceedings of the 15th International Conference on Extending Database Technology
Integrating community matching and outlier detection for mining evolutionary community outliers

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data

Computer Methods and Programs in Biomedicine
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
A taxonomy of privacy-preserving record linkage techniques

Information Systems
Learning to extract cross-session search tasks

Proceedings of the 22nd international conference on World Wide Web
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage

Proceedings of the 6th Balkan Conference in Informatics
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
Evaluation of instance matching tools: The experience of OAEI

Web Semantics: Science, Services and Agents on the World Wide Web
Linkage of compound objects for supporting maintenance of large-scale web sites

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Efficient entity matching using materialized lists

Information Sciences: an International Journal
Towards a Protein-Protein Interaction information extraction system: Recognizing named entities

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.