Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Multirelational classification: a multiple view approach
Knowledge and Information Systems
Efficient top-k count queries over imprecise duplicates
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Author name disambiguation in MEDLINE
ACM Transactions on Knowledge Discovery from Data (TKDD)
Disambiguating authors in academic publications using random forests
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Entity resolution with iterative blocking
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Creating relational data from unstructured and ungrammatical data sources
Journal of Artificial Intelligence Research
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Learning similarity metrics for event identification in social media
Proceedings of the third ACM international conference on Web search and data mining
Scaling record linkage to non-uniform distributed class sizes
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient entity resolution for large heterogeneous information spaces
Proceedings of the fourth ACM international conference on Web search and data mining
Eliminating the redundancy in blocking-based entity resolution methods
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient
Proceedings of the International Workshop on Semantic Web Information Management
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Introduction to linked data and its lifecycle on the web
RW'11 Proceedings of the 7th international conference on Reasoning web: semantic technologies for the web of data
Efficient similarity search: arbitrary similarity measures, arbitrary composition
Proceedings of the 20th ACM international conference on Information and knowledge management
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data
Proceedings of the fifth ACM international conference on Web search and data mining
Leveraging unlabeled data to scale blocking for record linkage
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Entity resolution: theory, practice & open challenges
Proceedings of the VLDB Endowment
A discriminative hierarchical model for fast coreference at large scale
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
An automatic blocking mechanism for large-scale de-duplication tasks
Proceedings of the 21st ACM international conference on Information and knowledge management
Detecting duplicate records in scientific workflow results
IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
Proceedings of the sixth ACM international conference on Web search and data mining
Cost-aware query planning for similarity search
Information Systems
Towards scalable real-time entity resolution using a similarity-aware inverted index approach
AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
Efficient XML duplicate detection using an adaptive two-level optimization
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Exploiting user clicks for automatic seed set generation for entity matching
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal hashing schemes for entity matching
Proceedings of the 22nd international conference on World Wide Web
An automatic blocking strategy for XML duplicate detection
ACM SIGAPP Applied Computing Review
A joint model for discovering and linking entities
Proceedings of the 2013 workshop on Automated knowledge base construction
Introduction to linked data and its lifecycle on the web
RW'13 Proceedings of the 9th international conference on Reasoning Web: semantic technologies for intelligent data access
Hi-index | 0.00 |
Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index-based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.