COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
C4.5: programs for machine learning
C4.5: programs for machine learning
The effect of adding relevance information in a relevance feedback environment
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improving Generalization with Active Learning
Machine Learning - Special issue on structured connectionist systems
Selective Sampling Using the Query by Committee Algorithm
Machine Learning
Active learning using adaptive resampling
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Learning and making decisions when costs and probabilities are both unknown
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Less is More: Active Learning with Support Vector Machines
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Improving Short-Text Classification using Unlabeled Data for Classification Problems
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Employing EM and Pool-Based Active Learning for Text Classification
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Data Mining using MLC++, A Machine Learning Library in C++
ICTAI '96 Proceedings of the 8th International Conference on Tools with Artificial Intelligence
Efficient Evaluation of Queries with Mining Predicates
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
SVMTorch: support vector machines for large-scale regression problems
The Journal of Machine Learning Research
Support vector machine active learning with applications to text classification
The Journal of Machine Learning Research
Active learning with committees for text categorization
AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Link mining: a new data mining challenge
ACM SIGKDD Explorations Newsletter
An interactive clustering-based approach to integrating source query interfaces on the deep Web
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Information-theoretic tools for mining database structure from large data sets
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Detecting duplicate objects in XML documents
Proceedings of the 2004 international workshop on Information quality in information systems
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Comparative study of name disambiguation problem using a scalable blocking-based framework
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Exploiting relationships for object consolidation
Proceedings of the 2nd international workshop on Information quality in information systems
Effective and scalable solutions for mixed and split citation problems in digital libraries
Proceedings of the 2nd international workshop on Information quality in information systems
A hit-miss model for duplicate detection in the WHO drug safety database
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Relational clustering for multi-type entity resolution
MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Automatically utilizing secondary sources to align information across sources
AI Magazine - Special issue on semantic integration
Semantic-integration research in the database community
AI Magazine - Special issue on semantic integration
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Heterogeneous Field Matching Method for Record Linkage
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Profile-Based Object Matching for Information Integration
IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Principles of dataspace systems
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration: the teenage years
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient sampling of training set in large and noisy multimedia data
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Duplicate detection in adverse drug reaction surveillance
Data Mining and Knowledge Discovery
Internet-scale collection of human-reviewed data
Proceedings of the 16th international conference on World Wide Web
Adaptive sorted neighborhood methods for efficient record linkage
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Adaptive graphical approach to entity resolution
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Towards automated record linkage
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Proceedings of the 2007 ACM symposium on Document engineering
ALIAS: an active learning led interactive deduplication system
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Merging the results of approximate match operations
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Proceedings of the 9th annual ACM international workshop on Web information and data management
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Extending q-grams to estimate selectivity of string matching with low edit distance
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Randomized algorithms for data reconciliation in wide area aggregate query processing
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Pay-as-you-go user feedback for dataspace systems
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A two-step classification approach to unsupervised record linkage
AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Identification of time-varying objects on the web
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Purpose based access control for privacy protection in relational database systems
The VLDB Journal — The International Journal on Very Large Data Bases
Video linkage: group based copied video detection
CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
SEPIA: estimating selectivities of approximate string predicates in large Databases
The VLDB Journal — The International Journal on Very Large Data Bases
Data & Knowledge Engineering
Automatic record linkage using seeded nearest neighbour and support vector machine classification
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Matching XML documents in highly dynamic applications
Proceedings of the eighth ACM symposium on Document engineering
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Definition and Formalization of Entity Resolution Functions for Everyday Information Integration
Semantics in Data and Knowledge Bases
Refining Instance Coreferencing Results Using Belief Propagation
ASWC '08 Proceedings of the 3rd Asian Semantic Web Conference on The Semantic Web
Foundations and Trends in Databases
Efficient top-k count queries over imprecise duplicates
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Approximate substring selectivity estimation
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Exploiting context analysis for combining multiple entity resolution systems
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Entity resolution with iterative blocking
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Journal of Data and Information Quality (JDIQ)
Optimal Stopping: A Record-Linkage Approach
Journal of Data and Information Quality (JDIQ)
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
The Normalized Compression Distance as a Distance Measure in Entity Identification
ICDM '09 Proceedings of the 9th Industrial Conference on Advances in Data Mining. Applications and Theoretical Aspects
Generic Entity Resolution in Relational Databases
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
An unsupervised approach for product record normalization across different web sites
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Journal of Artificial Intelligence Research
Deploying information agents on the web
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Generic entity resolution with negative rules
The VLDB Journal — The International Journal on Very Large Data Bases
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Reasoning about record matching rules
Proceedings of the VLDB Endowment
Learning string transformations from examples
Proceedings of the VLDB Endowment
Actively Learning Ontology Matching via User Interaction
ISWC '09 Proceedings of the 8th International Semantic Web Conference
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
HARRA: fast iterative hashed record linkage for large-scale data collections
Proceedings of the 13th International Conference on Extending Database Technology
Self-tuning in graph-based reference disambiguation
DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Privacy-aware access control with generalization boundaries
ACSC '09 Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91
Detecting data misuse by applying context-based data linkage
Proceedings of the 2010 ACM workshop on Insider threats
Efficient set-correlation operator inside databases
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
From web data to entities and back
CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
On Graph-Based Name Disambiguation
Journal of Data and Information Quality (JDIQ)
On-the-fly entity-aware query processing in the presence of linkage
Proceedings of the VLDB Endowment
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Human-assisted graph search: it's okay to ask questions
Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment
Incrementally maintaining classification using an RDBMS
Proceedings of the VLDB Endowment
Efficient similarity joins for near-duplicate detection
ACM Transactions on Database Systems (TODS)
Ontology and instance matching
Knowledge-driven multimedia information extraction and ontology evolution
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Matching unstructured product offers to structured product specifications
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Entity matching: how similar is similar
Proceedings of the VLDB Endowment
Complementing data in the ETL process
DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Dynamic constraints for record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Learning top-k transformation rules
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Applied Intelligence
A publication process model to enable privacy-aware data sharing
IBM Journal of Research and Development
Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate
Proceedings of the 20th ACM international conference on Information and knowledge management
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient similarity search: arbitrary similarity measures, arbitrary composition
Proceedings of the 20th ACM international conference on Information and knowledge management
Object identification with attribute-mediated dependences
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Probabilistic data generation for deduplication and data linkage
IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Identifying value mappings for data integration: an unsupervised approach
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Secure anonymization for incremental datasets
SDM'06 Proceedings of the Third VLDB international conference on Secure Data Management
Probabilistic iterative duplicate detection
OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Unsupervised duplicate detection using sample non-duplicates
Journal on Data Semantics VII
Extracting mnemonic names of people from the web
ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Similarity function recommender service using incremental user knowledge acquisition
ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Flexible and efficient distributed resolution of large entities
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
On generating large-scale ground truth datasets for the deduplication of bibliographic records
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
On the decidability and complexity of identity knowledge representation
DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Information Visualization - Special issue on State of the Field and New Research Directions
Aggregating web offers to determine product prices
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Active sampling for entity matching
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
CrowdER: crowdsourcing entity resolution
Proceedings of the VLDB Endowment
Learning expressive linkage rules using genetic programming
Proceedings of the VLDB Endowment
The effect of suspicious profiles on people recommenders
UMAP'12 Proceedings of the 20th international conference on User Modeling, Adaptation, and Personalization
Entity resolution: theory, practice & open challenges
Proceedings of the VLDB Endowment
Journal of Biomedical Informatics
Proceedings of the 3rd Annual ACM Web Science Conference
Matching product titles using web-based enrichment
Proceedings of the 21st ACM international conference on Information and knowledge management
Computer Methods and Programs in Biomedicine
An automatic blocking mechanism for large-scale de-duplication tasks
Proceedings of the 21st ACM international conference on Information and knowledge management
Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries
Proceedings of the 21st ACM international conference on Information and knowledge management
Detecting duplicate records in scientific workflow results
IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
Cost-aware query planning for similarity search
Information Systems
PartSS: an efficient partition-based filtering for edit distance constraints
ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
Tuning large scale deduplication with reduced effort
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Active Sampling for Entity Matching with Guarantees
ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012
Learning an accurate entity resolution model from crowdsourced labels
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Toward detection of aliases without string similarity
Information Sciences: an International Journal
Efficient entity matching using materialized lists
Information Sciences: an International Journal
Incremental entity resolution on rules and data
The VLDB Journal — The International Journal on Very Large Data Bases
Joint entity resolution on multiple datasets
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.