C4.5: programs for machine learning
C4.5: programs for machine learning
Improving Generalization with Active Learning
Machine Learning - Special issue on structured connectionist systems
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Selective Sampling Using the Query by Committee Algorithm
Machine Learning
CiteSeer: an automatic citation indexing system
Proceedings of the third ACM conference on Digital libraries
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
An Optimal Algorithm for Monte Carlo Estimation
SIAM Journal on Computing
Learning and making decisions when costs and probabilities are both unknown
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Machine Learning
Modern Information Retrieval
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering all most specific sentences
ACM Transactions on Database Systems (TODS)
Support vector machine active learning with applications to text classification
The Journal of Machine Learning Research
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
A hit-miss model for duplicate detection in the WHO drug safety database
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Adaptive Blocking: Learning to Scale Up Record Linkage
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A bound on the label complexity of agnostic active learning
Proceedings of the 24th international conference on Machine learning
Noisy binary search and its applications
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Fast Indexes and Algorithms for Set Similarity Selection Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Learning top-k transformation rules
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
MaskIt: privately releasing user context streams for personalized mobile applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Active sampling for entity matching
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
EAGLE: efficient active learning of link specifications using genetic programming
ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
CrowdER: crowdsourcing entity resolution
Proceedings of the VLDB Endowment
Learning expressive linkage rules using genetic programming
Proceedings of the VLDB Endowment
Active learning of expressive linkage rules for the web of data
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Entity resolution: theory, practice & open challenges
Proceedings of the VLDB Endowment
Journal of Biomedical Informatics
Actively soliciting feedback for query answers in keyword search-based data integration
Proceedings of the VLDB Endowment
Tuning large scale deduplication with reduced effort
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Optimal hashing schemes for entity matching
Proceedings of the 22nd international conference on World Wide Web
Active Sampling for Entity Matching with Guarantees
ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012
Question selection for crowd entity resolution
Proceedings of the VLDB Endowment
Active learning for networked data based on non-progressive diffusion model
Proceedings of the 7th ACM international conference on Web search and data mining
Hybrid entity clustering using crowds and data
The VLDB Journal — The International Journal on Very Large Data Bases
Active learning of expressive linkage rules using genetic programming
Web Semantics: Science, Services and Agents on the World Wide Web
Toward detection of aliases without string similarity
Information Sciences: an International Journal
Incremental entity resolution on rules and data
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult. Previous algorithms that use active learning for record matching have serious limitations: The packages that they learn lack quality guarantees and the algorithms do not scale to large input sizes. We present new algorithms for this problem that overcome these limitations. Our algorithms are fundamentally different from traditional active learning approaches, and are designed ground up to exploit problem characteristics specific to record matching. We include a detailed experimental evaluation on realworld data demonstrating the effectiveness of our algorithms.