Robust and efficient fuzzy match for online data cleaning

Authors:
Surajit Chaudhuri;Kris Ganjam;Venkatesh Ganti;Rajeev Motwani
Affiliations:
Microsoft Research;Microsoft Research;Microsoft Research;Stanford University
Venue:
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Year:
2003

Citing 15
Cited 145

Randomized algorithms

Randomized algorithms
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Multidimensional access methods

ACM Computing Surveys (CSUR)
Approximating matrix multiplication for pattern recognition tasks

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Searching in metric spaces by spatial approximation

The VLDB Journal — The International Journal on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Mining complex matchings across Web query interfaces

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Iterative record linkage for cleaning and integration

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering complex matchings across web query interfaces: a correlation mining approach

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Measuring similarity between collection of values

Proceedings of the 6th annual ACM international workshop on Web information and data management
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
SPIDER: flexible matching in databases

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data cleaning in microsoft SQL server 2005

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing mixed types for approximate retrieval

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Relational clustering for multi-type entity resolution

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Automatically utilizing secondary sources to align information across sources

AI Magazine - Special issue on semantic integration
Establishing value mappings using statistical models and user feedback

Proceedings of the 14th ACM international conference on Information and knowledge management
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Mining Adaptive Ratio Rules from Distributed Data Sources

Data Mining and Knowledge Discovery
Automatic complex schema matching across Web query interfaces: A correlation mining approach

ACM Transactions on Database Systems (TODS)
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Learning to deduplicate

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Query-time entity resolution

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A deferred cleansing method for RFID data analytics

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Multi-column substring matching for database schema translation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Data quality awareness: a case study for cost optimal association rule mining

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards automated record linkage

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Towards automatic identification of completeness and consistency in digital dossiers

Proceedings of the 11th international conference on Artificial intelligence and law
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
Leveraging semantic technologies for enterprise search

Proceedings of the ACM first Ph.D. workshop in CIKM
Web based linkage

Proceedings of the 9th annual ACM international workshop on Web information and data management
Management of data with uncertainties

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A strategy for allowing meaningful and comparable scores in approximate matching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Extending q-grams to estimate selectivity of string matching with low edit distance

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Randomized algorithms for data reconciliation in wide area aggregate query processing

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Bridging the application and DBMS profiling divide for database application developers

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Robust location search from text queries

Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems
Replica identification using genetic programming

Proceedings of the 2008 ACM symposium on Applied computing
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Sampling cube: a framework for statistical olap over sampling data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Finding frequent items in probabilistic data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Building a global location search service

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SEPIA: estimating selectivities of approximate string predicates in large Databases

The VLDB Journal — The International Journal on Very Large Data Bases
Crosslingual location search

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
A dynamic data structure for top-k queries on uncertain data

Theoretical Computer Science
Learning to hash: forgiving hash functions and applications

Data Mining and Knowledge Discovery
Social recommendations of content and metadata

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Information Extraction

Foundations and Trends in Databases
Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitais

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
The impact of parameter setup on a genetic programming approach to record deduplication

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Automatic threshold estimation for data matching applications

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Approximate substring selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Generalized Mongue-Elkan Method for Approximate Text String Comparison

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
A grammar-based entity representation framework for data cleaning

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Ranking distributed probabilistic data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimal Stopping: A Record-Linkage Approach

Journal of Data and Information Quality (JDIQ)
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Phoebus: a system for extracting and integrating data from unstructured and ungrammatical sources

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Linking social networks on the web with FOAF: a semantic web case study

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Query-time entity resolution

Journal of Artificial Intelligence Research
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
Semantic annotation of unstructured and ungrammatical text

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
The trichotomy of HAVING queries on a probabilistic database

The VLDB Journal — The International Journal on Very Large Data Bases
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
Space-economical partial gram indices for exact substring matching

Proceedings of the 18th ACM conference on Information and knowledge management
Record linkage performance for large data sets

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Custom local search

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Automatic accuracy assessment via hashing in multiple-source environment

Expert Systems with Applications: An International Journal
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Mining Heterogeneous Information Networks by Exploring the Power of Links

DS '09 Proceedings of the 12th International Conference on Discovery Science
Entity-aware query processing for heterogeneous data with uncertainty and correlations

Proceedings of the 2009 EDBT/ICDT Workshops
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
Interweaving OAI-PMH data sources with the linked data cloud

International Journal of Metadata, Semantics and Ontologies
Dynamic structures for top-k queries on uncertain data

ISAAC'07 Proceedings of the 18th international conference on Algorithms and computation
Querying a super-peer in a schema-based super-peer network

DBISP2P'05/06 Proceedings of the 2005/2006 international conference on Databases, information systems, and peer-to-peer computing
Probabilistic anonymity

PinKDD'07 Proceedings of the 1st ACM SIGKDD international conference on Privacy, security, and trust in KDD
Probabilistic string similarity joins

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Reverse ranking query over imprecise spatial data

Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application
On memory and I/O efficient duplication detection for multiple self-clean data sources

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
A graphical method for reference reconciliation

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
An efficient duplicate record detection using q-grams array inverted index

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Efficient duplicate record detection based on similarity estimation

WAIM'10 Proceedings of the 11th international conference on Web-age information management
On Graph-Based Name Disambiguation

Journal of Data and Information Quality (JDIQ)
Towards certain fixes with editing rules and master data

Proceedings of the VLDB Endowment
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Processing of crisp and fuzzy measures in the fuzzy data warehouse for global natural resources

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part III
Context-sensitive document ranking

Journal of Computer Science and Technology
Automatic threshold estimation for data matching applications

Information Sciences: an International Journal
Approximate String Processing

Foundations and Trends in Databases
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Neighborhood based fast graph search in large networks

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
PG-Skip: proximity graph based clustering of long strings

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
An unsupervised heuristic-based approach for bibliographic metadata deduplication

Information Processing and Management: an International Journal
A truly dynamic data structure for top-k queries on uncertain data

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Meta similarity

Applied Intelligence
Context-based entity description rule for entity resolution

Proceedings of the 20th ACM international conference on Information and knowledge management
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
Models and indices for integrating unstructured data with a relational database

KDID'04 Proceedings of the Third international conference on Knowledge Discovery in Inductive Databases
Attribute and object selection queries on objects with probabilistic attributes

ACM Transactions on Database Systems (TODS)
Identifying value mappings for data integration: an unsupervised approach

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Virtual integration of existing web databases for the genotypic selection of cereal cultivars

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
SC spectra: a linear-time soft cardinality approximation for text comparison

MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
Towards certain fixes with editing rules and master data

The VLDB Journal — The International Journal on Very Large Data Bases
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Efficient range queries over uncertain strings

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Entity resolution: theory, practice & open challenges

Proceedings of the VLDB Endowment
Matching product titles using web-based enrichment

Proceedings of the 21st ACM international conference on Information and knowledge management
Set-Similarity joins based semi-supervised sentiment analysis

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
Adaptive Connection Strength Models for Relationship-Based Entity Resolution

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

International Journal of Data Warehousing and Mining
A semantic web based gazetteer model for VGI

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Approximate string matching by position restricted alignment

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Analysis and optimization for boolean expression indexing

ACM Transactions on Database Systems (TODS)
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
FusionDB: conflict management system for small-science databases

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment
Entity resolution for distributed probabilistic data

Distributed and Parallel Databases
Top-k entities query processing on uncertainly fused multi-sensory data

Personal and Ubiquitous Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.