Multiprocessor transitive closure algorithms
DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
The breakdown of the information model in multi-database systems
ACM SIGMOD Record
Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
AlphaSort: a RISC machine sort
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Automatic correction to misspelled names: a fourth-generation language approach
Communications of the ACM
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems
Proceedings of the Fifth International Conference on Data Engineering
An Evaluation of Non-Equijoin Algorithms
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Physical database design in multiprocessor database systems
Physical database design in multiprocessor database systems
PERF join: an alternative to two-way semijoin and bloomjoin
CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Substring selectivity estimation
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Duplicate detection using k-way sorting method
SAC '00 Proceedings of the 2000 ACM symposium on Applied computing - Volume 1
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Hardening soft information sources
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
IntelliClean: a knowledge-based intelligent data cleaner
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploration mining in diabetic patients databases: findings and conclusions
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Information retrieval on the web
ACM Computing Surveys (CSUR)
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Expressive retrieval from XML documents
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Advanced grouping and aggregation for data integration
Proceedings of the tenth international conference on Information and knowledge management
An expressive and efficient language for XML information retrieval
Journal of the American Society for Information Science and Technology - XML
Learning missing values from summary constraints
ACM SIGKDD Explorations Newsletter
A fast filtering scheme for large database cleansing
Proceedings of the eleventh international conference on Information and knowledge management
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Improving Data Quality in Practice: A Case Study in the Italian Public Administration
Distributed and Parallel Databases
Warehouse Creation-A Potential Roadblock to Data Warehousing
IEEE Transactions on Knowledge and Data Engineering
A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases
IEEE Transactions on Knowledge and Data Engineering
NLDB '02 Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems-Revised Papers
Telcordia's Database Reconciliation and Data Quality Analysis Tool
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Dynamic Similarity for Fields with NULL Values
DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Fuzzy Rule-Based Framework for Medical Record Validation
IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Cleansing Data for Mining and Warehousing
DEXA '99 Proceedings of the 10th International Conference on Database and Expert Systems Applications
A New Efficient Data Cleansing Method
DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Heterogeneous Data Source Integration and Evolution
DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
One-dimensional and multi-dimensional substring selectivity estimation
The VLDB Journal — The International Journal on Very Large Data Bases
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Handbook of massive data sets
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Source integration for data warehousing
Multidimensional databases
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Cleaning the Spurious Links in Data
IEEE Intelligent Systems
Efficient similarity-based operations for data integration
Data & Knowledge Engineering
Information-theoretic tools for mining database structure from large data sets
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Detecting duplicate objects in XML documents
Proceedings of the 2004 international workshop on Information quality in information systems
A hierarchical graphical model for record linkage
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Comparative study of name disambiguation problem using a scalable blocking-based framework
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data cleaning in microsoft SQL server 2005
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
IEEE Transactions on Knowledge and Data Engineering
Exploiting relationships for object consolidation
Proceedings of the 2nd international workshop on Information quality in information systems
Blocking-aware private record linkage
Proceedings of the 2nd international workshop on Information quality in information systems
Effective and scalable solutions for mixed and split citation problems in digital libraries
Proceedings of the 2nd international workshop on Information quality in information systems
A hit-miss model for duplicate detection in the WHO drug safety database
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Selectivity estimation for fuzzy string predicates in large data sets
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing mixed types for approximate retrieval
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Relational clustering for multi-type entity resolution
MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Automatically utilizing secondary sources to align information across sources
AI Magazine - Special issue on semantic integration
Semantic integration in text: from ambiguous names to identifiable entities
AI Magazine - Special issue on semantic integration
Semantic-integration research in the database community
AI Magazine - Special issue on semantic integration
Establishing value mappings using statistical models and user feedback
Proceedings of the 14th ACM international conference on Information and knowledge management
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Heterogeneous Field Matching Method for Record Linkage
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Enhancing Data Analysis with Noise Removal
IEEE Transactions on Knowledge and Data Engineering
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Profile-Based Object Matching for Information Integration
IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Describing differences between databases
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
The pairwise attribute noise detection algorithm
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Leveraging aggregate constraints for deduplication
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive sorted neighborhood methods for efficient record linkage
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Adaptive graphical approach to entity resolution
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Management of probabilistic data: foundations and challenges
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Systematic development of data mining-based data quality tools
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Merging the results of approximate match operations
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Proceedings of the 9th annual ACM international workshop on Web information and data management
Management of data with uncertainties
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Measuring the structural similarity of semistructured documents using entropy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Entity matching in heterogeneous databases: A logistic regression approach
Decision Support Systems
Increasing the performance of an application for duplication detection
CompSysTech '07 Proceedings of the 2007 international conference on Computer systems and technologies
Survey on test collections and techniques for personal name matching
International Journal of Metadata, Semantics and Ontologies
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Video linkage: group based copied video detection
CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
SEPIA: estimating selectivities of approximate string predicates in large Databases
The VLDB Journal — The International Journal on Very Large Data Bases
Lexicon randomization for near-duplicate detection with I-Match
The Journal of Supercomputing
Structured entity identification and document categorization: two tasks with one joint model
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A Graph Partitioning Approach to Entity Disambiguation Using Uncertain Information
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Approximate lineage for probabilistic databases
Proceedings of the VLDB Endowment
Industry-scale duplicate detection
Proceedings of the VLDB Endowment
Scaling up duplicate detection in graph data
Proceedings of the 17th ACM conference on Information and knowledge management
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Disambiguating authors in academic publications using random forests
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Exploiting context analysis for combining multiple entity resolution systems
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Entity resolution with iterative blocking
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Disambiguating Personal Names on the Web using Automatically Extracted Key Phrases
Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Optimal Stopping: A Record-Linkage Approach
Journal of Data and Information Quality (JDIQ)
Identification and tracing of ambiguous names: discriminative and generative approaches
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Constraint-based entity matching
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Journal of Artificial Intelligence Research
Unsupervised methods for determining object and relation synonyms on the web
Journal of Artificial Intelligence Research
The trichotomy of HAVING queries on a probabilistic database
The VLDB Journal — The International Journal on Very Large Data Bases
Context-sensitive document ranking
Proceedings of the 18th ACM conference on Information and knowledge management
Similarity-aware indexing for real-time entity resolution
Proceedings of the 18th ACM conference on Information and knowledge management
Record linkage performance for large data sets
Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
ACM SIGKDD Explorations Newsletter
Generic entity resolution with negative rules
The VLDB Journal — The International Journal on Very Large Data Bases
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Reasoning about record matching rules
Proceedings of the VLDB Endowment
Power-law based estimation of set similarity join size
Proceedings of the VLDB Endowment
Software—Practice & Experience
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
Learning similarity metrics for event identification in social media
Proceedings of the third ACM international conference on Web search and data mining
HARRA: fast iterative hashed record linkage for large-scale data collections
Proceedings of the 13th International Conference on Extending Database Technology
Using similarity-based operations for resolving data-level conflicts
BNCOD'03 Proceedings of the 20th British national conference on Databases
Declarative XML data cleaning with XClean
CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Efficient evaluation of HAVING queries on a probabilistic database
DBPL'07 Proceedings of the 11th international conference on Database programming languages
Self-tuning in graph-based reference disambiguation
DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Scaling record linkage to non-uniform distributed class sizes
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Detecting near-duplicates in large-scale short text databases
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On memory and I/O efficient duplication detection for multiple self-clean data sources
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
A graphical method for reference reconciliation
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
An efficient duplicate record detection using q-grams array inverted index
DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Feature-based entity matching: the FBEM model, implementation, evaluation
CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
A multilevel and domain-independent duplicate detection model for scientific database
WAIM'10 Proceedings of the 11th international conference on Web-age information management
On Graph-Based Name Disambiguation
Journal of Data and Information Quality (JDIQ)
Evaluating entity resolution results
Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
Entity resolution with evolving rules
Proceedings of the VLDB Endowment
Efficient entity resolution for large heterogeneous information spaces
Proceedings of the fourth ACM international conference on Web search and data mining
Entity Resolution and Information Quality
Entity Resolution and Information Quality
Context-sensitive document ranking
Journal of Computer Science and Technology
Identity matching using personal and social identity features
Information Systems Frontiers
SemGen: towards a semantic data generator for benchmarking duplicate detectors
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Eliminating the redundancy in blocking-based entity resolution methods
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Detecting and exploiting stability in evolving heterogeneous information spaces
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient
Proceedings of the International Workshop on Semantic Web Information Management
Differential dependencies: Reasoning and discovery
ACM Transactions on Database Systems (TODS)
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Matching unstructured product offers to structured product specifications
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Entity matching: how similar is similar
Proceedings of the VLDB Endowment
Privacy preserving group linkage
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Dynamic constraints for record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Applied Intelligence
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Instance-based 'one-to-some' assignment of similarity measures to attributes
OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part I
Identifying co-referential names across large corpora
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Object identification with attribute-mediated dependences
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Probabilistic data generation for deduplication and data linkage
IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Attribute and object selection queries on objects with probabilistic attributes
ACM Transactions on Database Systems (TODS)
Quality-aware similarity assessment for entity matching in Web data
Information Systems
Identifying value mappings for data integration: an unsupervised approach
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Extracting key phrases to disambiguate personal names on the web
CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
XML duplicate detection using sorted neighborhoods
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data
Proceedings of the fifth ACM international conference on Web search and data mining
Multi-pass sorted neighborhood blocking with MapReduce
Computer Science - Research and Development
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
A transparent and transportable methodology for evaluating Data Linkage software
Journal of Biomedical Informatics
Similarity and duplicate detection system for an OAI compliant federated digital library
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
A graph theoretic approach to key equivalence
MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
Probability and equality: a probabilistic model of identity uncertainty
AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
A self-monitoring system to satisfy data quality requirements
OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Cleaning web pages for effective web content mining
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Extracting mnemonic names of people from the web
ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Integrating open government data with stratosphere for more transparency
Web Semantics: Science, Services and Agents on the World Wide Web
The effect of suspicious profiles on people recommenders
UMAP'12 Proceedings of the 20th international conference on User Modeling, Adaptation, and Personalization
OtO matching system: a multi-strategy approach to instance matching
CAiSE'12 Proceedings of the 24th international conference on Advanced Information Systems Engineering
Entity resolution: theory, practice & open challenges
Proceedings of the VLDB Endowment
A discriminative hierarchical model for fast coreference at large scale
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
De-duplication of aggregation authority files
International Journal of Metadata, Semantics and Ontologies
An automatic blocking mechanism for large-scale de-duplication tasks
Proceedings of the 21st ACM international conference on Information and knowledge management
Adaptive Connection Strength Models for Relationship-Based Entity Resolution
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Indeterministic Handling of Uncertain Decisions in Deduplication
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
ACM Transactions on Database Systems (TODS)
Deep Web Information Retrieval Process: A Technical Survey
International Journal of Information Technology and Web Engineering
Towards scalable real-time entity resolution using a similarity-aware inverted index approach
AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
20 years of data quality research: themes, trends and synergies
ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
A taxonomy of privacy-preserving record linkage techniques
Information Systems
Efficient XML duplicate detection using an adaptive two-level optimization
Proceedings of the 28th Annual ACM Symposium on Applied Computing
An efficient two-party protocol for approximate matching in private record linkage
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Tuning large scale deduplication with reduced effort
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
An automatic blocking strategy for XML duplicate detection
ACM SIGAPP Applied Computing Review
Disinformation techniques for entity resolution
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Flexible and extensible generation and corruption of personal data
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
GeCo: an online personal data generator and corruptor
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
De-duplication of aggregation authority files
International Journal of Metadata, Semantics and Ontologies
Question selection for crowd entity resolution
Proceedings of the VLDB Endowment
Query-driven approach to entity resolution
Proceedings of the VLDB Endowment
Efficient entity matching using materialized lists
Information Sciences: an International Journal
Incremental entity resolution on rules and data
The VLDB Journal — The International Journal on Very Large Data Bases
Joint entity resolution on multiple datasets
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.