A comparative analysis of methodologies for database schema integration
ACM Computing Surveys (CSUR)
Multiprocessor transitive closure algorithms
DPDS '88 Proceedings of the first international symposium on Databases in parallel and distributed systems
Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
AlphaSort: a RISC machine sort
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
Automatic spelling correction in scientific and scholarly text
Communications of the ACM
The Inter-Database Instance Identification Problem in Integrating Autonomous Systems
Proceedings of the Fifth International Conference on Data Engineering
Physical database design in multiprocessor database systems
Physical database design in multiprocessor database systems
Duplicate detection using k-way sorting method
SAC '00 Proceedings of the 2000 ACM symposium on Applied computing - Volume 1
Ordinal association rules for error identification in data sets
Proceedings of the tenth international conference on Information and knowledge management
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Improving Data Quality in Practice: A Case Study in the Italian Public Administration
Distributed and Parallel Databases
Efficient transitive closure reasoning in a combined class/part/containment hierarchy
Knowledge and Information Systems
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Mediation in a dynamic context: arguing for a request-oriented approach and structuring it
Web-enabled systems integration
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A Bayesian decision model for cost optimal record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Data quality through knowledge engineering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Two supervised learning approaches for name disambiguation in author citations
Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Privacy-preserving data integration and sharing
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Information Systems - Special issue: Data quality in cooperative information systems
Schema Matching Using Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Name disambiguation in author citations using a K-way spectral clustering method
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
A cost-based model and effective heuristic for repairing constraints by value modification
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A hierarchical naive Bayes mixture model for name disambiguation in author citations
Proceedings of the 2005 ACM symposium on Applied computing
IEEE Transactions on Knowledge and Data Engineering
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Enhancing Data Analysis with Noise Removal
IEEE Transactions on Knowledge and Data Engineering
Determining noisy instances relative to attributes of interest
Intelligent Data Analysis
Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework
IEEE Intelligent Systems
Semantic matching across heterogeneous data sources
Communications of the ACM - The patent holder's dilemma: buy, sell, or troll?
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Supporting data quality management in decision-making
Decision Support Systems
The pairwise attribute noise detection algorithm
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Data quality awareness: a case study for cost optimal association rule mining
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Combining schema and instance information for integrating heterogeneous data sources
Data & Knowledge Engineering
Identifying noisy features with the Pairwise Attribute Noise Detection Algorithm
Intelligent Data Analysis
Duplicate detection in adverse drug reaction surveillance
Data Mining and Knowledge Discovery
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A Method for Estimating the Precision of Placename Matching
IEEE Transactions on Knowledge and Data Engineering
Improving data quality: consistency and accuracy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Canonicalization of graph database records using similarity measures
Proceedings of the 2nd international conference on Ubiquitous information management and communication
Conditional functional dependencies for capturing data inconsistencies
ACM Transactions on Database Systems (TODS)
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Finding frequent items in probabilistic data
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Dependencies revisited for improving data quality
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Probabilistic top-k and ranking-aggregate queries
ACM Transactions on Database Systems (TODS)
Boosting text segmentation via progressive classification
Knowledge and Information Systems
Data & Knowledge Engineering
Matching XML documents in highly dynamic applications
Proceedings of the eighth ACM symposium on Document engineering
Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement
Combining Data Integration and IE Techniques to Support Partially Structured Data
NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Probabilistic Entity Linkage for Heterogeneous Information Spaces
CAiSE '08 Proceedings of the 20th international conference on Advanced Information Systems Engineering
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Industry-scale duplicate detection
Proceedings of the VLDB Endowment
ACM Computing Surveys (CSUR)
Repair checking in inconsistent databases: algorithms and complexity
Proceedings of the 12th International Conference on Database Theory
Type-based categorization of relational attributes
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Overview and Framework for Data and Information Quality Research
Journal of Data and Information Quality (JDIQ)
Accurate Synthetic Generation of Realistic Personal Information
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Methodologies for data quality assessment and improvement
ACM Computing Surveys (CSUR)
Entity resolution with iterative blocking
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Geocode Matching and Privacy Preservation
Privacy, Security, and Trust in KDD
Semantic blocking for Record Linkage
Proceedings of the 2007 conference on Artificial Intelligence Research and Development
A Method for Automatic Discovery of Reference Data
IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
Learning blocking schemes for record linkage
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
The Normalized Compression Distance as a Distance Measure in Entity Identification
ICDM '09 Proceedings of the 9th Industrial Conference on Advances in Data Mining. Applications and Theoretical Aspects
Class noise detection using frequent itemsets
Intelligent Data Analysis
A Versatile Record Linkage Method by Term Matching Model Using CRF
DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Creating relational data from unstructured and ungrammatical data sources
Journal of Artificial Intelligence Research
Creating probabilistic databases from duplicated data
The VLDB Journal — The International Journal on Very Large Data Bases
Robust record linkage blocking using suffix arrays
Proceedings of the 18th ACM conference on Information and knowledge management
Empirical case studies in attribute noise detection
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews - Special issue on information reuse and integration
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
"Same, Same but Different" A Survey on Duplicate Detection Methods for Situation Awareness
OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Entity-aware query processing for heterogeneous data with uncertainty and correlations
Proceedings of the 2009 EDBT/ICDT Workshops
Fast approximate duplicate detection for 2D-NMR spectra
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
URI identity management for semantic web data integration and linkage
OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems - Volume Part II
Querying incomplete data with logic programs: ER strikes back
ER'07 Proceedings of the 26th international conference on Conceptual modeling
Quality-driven query answering for integrated information systems
Quality-driven query answering for integrated information systems
Consistent query answers in inconsistent probabilistic databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Reverse ranking query over imprecise spatial data
Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application
Querying incomplete data over extended er schemata
Theory and Practice of Logic Programming
Similarity joins as stronger metric operations
SIGSPATIAL Special
From web data to entities and back
CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
On-the-fly entity-aware query processing in the presence of linkage
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters
ACM Transactions on Knowledge Discovery from Data (TKDD)
Approximate entity extraction in temporal databases
World Wide Web
Interaction between record matching and data repairing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Ontology and instance matching
Knowledge-driven multimedia information extraction and ontology evolution
Learning phenotype mapping for integrating large genetic data
BioNLP '11 Proceedings of BioNLP 2011 Workshop
A constraint satisfaction cryptanalysis of bloom filters in private record linkage
PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
Dynamic constraints for record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
DWEVOLVE: a requirement based framework for data warehouse evolution
ACM SIGSOFT Software Engineering Notes
Applied Intelligence
A publication process model to enable privacy-aware data sharing
IBM Journal of Research and Development
Efficient similarity search: arbitrary similarity measures, arbitrary composition
Proceedings of the 20th ACM international conference on Information and knowledge management
Computer-based genealogy reconstruction in founder populations
Journal of Biomedical Informatics
Beauty and the beast: the theory and practice of information integration
ICDT'07 Proceedings of the 11th international conference on Database Theory
KES'06 Proceedings of the 10th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I
A dictionary-based approach to fast and accurate name matching in large law enforcement databases
ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
XML duplicate detection using sorted neighborhoods
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
A precise blocking method for record linkage
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Data cleansing for service-oriented architecture
EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
Brokering multisource data with quality constraints
ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Probabilistic iterative duplicate detection
OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Unsupervised duplicate detection using sample non-duplicates
Journal on Data Semantics VII
Similarity function recommender service using incremental user knowledge acquisition
ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Linking records in dynamic world
PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Kd-trees and the real disclosure risks of large statistical databases
Information Fusion
Flexible and efficient distributed resolution of large entities
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Entity matching for semistructured data in the Cloud
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Efficient and Practical Approach for Private Record Linkage
Journal of Data and Information Quality (JDIQ)
Improving classifier performance by knowledge-driven data preparation
ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Matching product titles using web-based enrichment
Proceedings of the 21st ACM international conference on Information and knowledge management
Computer Methods and Programs in Biomedicine
Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries
Proceedings of the 21st ACM international conference on Information and knowledge management
Detecting duplicate records in scientific workflow results
IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
Efficient privacy-aware record integration
Proceedings of the 16th International Conference on Extending Database Technology
Accuracy of aggregate data in distributed project settings: Model, analysis and implications
Journal of Data and Information Quality (JDIQ)
Keeping it real: utilizing NYC open data in an introduction to database systems course
Journal of Computing Sciences in Colleges
NADEEF: a commodity data cleaning system
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
A taxonomy of privacy-preserving record linkage techniques
Information Systems
Optimal hashing schemes for entity matching
Proceedings of the 22nd international conference on World Wide Web
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage
Proceedings of the 6th Balkan Conference in Informatics
Efficient two-party private blocking based on sorted nearest neighborhood clustering
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Information Sciences: an International Journal
Verification of query completeness over processes
BPM'13 Proceedings of the 11th international conference on Business Process Management
Linkage of compound objects for supporting maintenance of large-scale web sites
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Toward detection of aliases without string similarity
Information Sciences: an International Journal
Efficient entity matching using materialized lists
Information Sciences: an International Journal
Identity matching and information acquisition: Estimation of optimal threshold parameters
Decision Support Systems
Hi-index | 0.00 |
The problem of merging multiple databases of information aboutcommon entities is frequently encountered in KDD and decision supportapplications in large commercial and government organizations. The problemwe study is often called the Merge/Purge problem and is difficult to solveboth in scale and accuracy. Large repositories of data typically havenumerous duplicate information entries about the same entities that aredifficult to cull together without an intelligent ’’equational theory‘‘ thatidentifies equivalent items by a complex, domain-dependent matching process.We have developed a system for accomplishing this Data Cleansing task anddemonstrate its use for cleansing lists of names of potential customers in adirect marketing-type application. Our results for statistically generateddata are shown to be accurate and effective when processing the datamultiple times using different keys for sorting on each successive pass.Combing results of individual passes using transitive closure over theindependent results, produces far more accurate results at lower cost. Thesystem provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massiveamounts of data. This paper details improvements in our system, and reportson the successful implementation for a real-world database that conclusivelyvalidates our results previously achieved for statistically generateddata.