The nature of statistical learning theory
The nature of statistical learning theory
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
IEEE Transactions on Pattern Analysis and Machine Intelligence
Making large-scale support vector machine learning practical
Advances in kernel methods
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Hardening soft information sources
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Modern Information Retrieval
Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The Alternating Decision Tree Learning Algorithm
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Retrieving and Semantically Integrating Heterogeneous Data from the Web
IEEE Intelligent Systems
Iterative record linkage for cleaning and integration
Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Detecting duplicate objects in XML documents
Proceedings of the 2004 international workshop on Information quality in information systems
A probabilistic framework for semi-supervised clustering
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating constraints and metric learning in semi-supervised clustering
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Schema Matching Using Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
Near-duplicate detection for eRulemaking
dg.o '05 Proceedings of the 2005 national conference on Digital government research
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Exploiting relationships for object consolidation
Proceedings of the 2nd international workshop on Information quality in information systems
A hit-miss model for duplicate detection in the WHO drug safety database
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Mining knowledge from text using information extraction
ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Relational clustering for multi-type entity resolution
MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Automatically utilizing secondary sources to align information across sources
AI Magazine - Special issue on semantic integration
Semantic integration in text: from ambiguous names to identifiable entities
AI Magazine - Special issue on semantic integration
Semantic-integration research in the database community
AI Magazine - Special issue on semantic integration
Establishing value mappings using statistical models and user feedback
Proceedings of the 14th ACM international conference on Information and knowledge management
Learning the structure of Markov logic networks
ICML '05 Proceedings of the 22nd international conference on Machine learning
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Heterogeneous Field Matching Method for Record Linkage
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework
IEEE Intelligent Systems
Designing semantics-preserving cluster representatives for scientific input conditions
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features
ACM Transactions on Internet Technology (TOIT)
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Data quality awareness: a case study for cost optimal association rule mining
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Duplicate detection in adverse drug reaction surveillance
Data Mining and Knowledge Discovery
Integration of Ontology Data through Learning Instance Matching
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Adaptive graphical approach to entity resolution
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A Method for Estimating the Precision of Placename Matching
IEEE Transactions on Knowledge and Data Engineering
Towards automated record linkage
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Adaptive communal detection in search of adversarial identity crime
Proceedings of the 2007 international workshop on Domain driven data mining
Record matching in digital library metadata
Communications of the ACM - Alternate reality gaming
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Increasing the performance of an application for duplication detection
CompSysTech '07 Proceedings of the 2007 international conference on Computer systems and technologies
Active semi-supervised fuzzy clustering
Pattern Recognition
Active semi-supervised fuzzy clustering
Pattern Recognition
Survey on test collections and techniques for personal name matching
International Journal of Metadata, Semantics and Ontologies
Replica identification using genetic programming
Proceedings of the 2008 ACM symposium on Applied computing
The dream of a global knowledge network—A new approach
Journal on Computing and Cultural Heritage (JOCCH)
A two-step classification approach to unsupervised record linkage
AusDM '07 Proceedings of the sixth Australasian conference on Data mining and analytics - Volume 70
Identification of time-varying objects on the web
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Improving the accuracy of entity identification through refinement
Ph.D. '08 Proceedings of the 2008 EDBT Ph.D. workshop
An unsupervised framework for extracting and normalizing product attributes from multiple web sites
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Data & Knowledge Engineering
Automatic record linkage using seeded nearest neighbour and support vector machine classification
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Component Selection to Optimize Distance Function Learning in Complex Scientific Data Sets
DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
A Graph Partitioning Approach to Entity Disambiguation Using Uncertain Information
GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Integration of Semantically Annotated Data by the KnoFuss Architecture
EKAW '08 Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns
Industry-scale duplicate detection
Proceedings of the VLDB Endowment
Scaling up duplicate detection in graph data
Proceedings of the 17th ACM conference on Information and knowledge management
Multirelational classification: a multiple view approach
Knowledge and Information Systems
Refining Instance Coreferencing Results Using Belief Propagation
ASWC '08 Proceedings of the 3rd Asian Semantic Web Conference on The Semantic Web
Finding duplicates in a data stream
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
An ontology data matching method for web information integration
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitais
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
The impact of parameter setup on a genetic programming approach to record deduplication
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Generalized Mongue-Elkan Method for Approximate Text String Comparison
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Learning to Extract Relations for Relational Classification
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Catching the drift: learning broad matches from clickthrough data
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting context analysis for combining multiple entity resolution systems
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A grammar-based entity representation framework for data cleaning
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Managing Co-reference Knowledge for Data Integration
Proceedings of the 2009 conference on Information Modelling and Knowledge Bases XX
Intelligent hybrid approach to false identity detection
Proceedings of the 12th International Conference on Artificial Intelligence and Law
Combining a Logical and a Numerical Method for Data Reconciliation
Journal on Data Semantics XII
A Method for Automatic Discovery of Reference Data
IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
Improved approximate detection of duplicates for data streams over sliding windows
Journal of Computer Science and Technology
Identification and tracing of ambiguous names: discriminative and generative approaches
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Learnable similarity functions and their applications to clustering and record linkage
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Learning blocking schemes for record linkage
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Memory-efficient inference in relational domains
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
The Normalized Compression Distance as a Distance Measure in Entity Identification
ICDM '09 Proceedings of the 9th Industrial Conference on Advances in Data Mining. Applications and Theoretical Aspects
Learning to match names across languages
MMIES '08 Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization
A discriminative candidate generator for string transformations
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Generic Entity Resolution in Relational Databases
ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
A Versatile Record Linkage Method by Term Matching Model Using CRF
DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Constraint-based entity matching
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Discriminative training of Markov logic networks
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
L2R: a logical method for reference reconciliation
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Joint inference in information extraction
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Linking social networks on the web with FOAF: a semantic web case study
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
An unsupervised approach for product record normalization across different web sites
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Journal of Artificial Intelligence Research
Creating relational data from unstructured and ungrammatical data sources
Journal of Artificial Intelligence Research
Adaptive string similarity metrics for biomedical reference resolution
ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Semantic annotation of unstructured and ungrammatical text
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Semi-supervised classification method for dynamic applications
Fuzzy Sets and Systems
Latent Topic Extraction from Relational Table for Record Matching
DS '09 Proceedings of the 12th International Conference on Discovery Science
Merging and Ranking Answers in the Semantic Web: The Wisdom of Crowds
ASWC '09 Proceedings of the 4th Asian Conference on The Semantic Web
Learning term-weighting functions for similarity measures
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
An incremental clustering scheme for data de-duplication
Data Mining and Knowledge Discovery
A novel semi-supervised fuzzy C-means clustering method
CCDC'09 Proceedings of the 21st annual international conference on Chinese control and decision conference
Learning similarity metrics for event identification in social media
Proceedings of the third ACM international conference on Web search and data mining
Learning state machine-based string edit kernels
Pattern Recognition
A constrained clustering approach to duplicate detection among relational data
PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Fast approximate duplicate detection for 2D-NMR spectra
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Generation and matching of ontology data for the semantic web in a peer-to-peer framework
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Self-tuning in graph-based reference disambiguation
DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Scaling record linkage to non-uniform distributed class sizes
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Automatic training example selection for scalable unsupervised record linkage
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web
Properties of possibilistic string comparison
IEEE Transactions on Fuzzy Systems
Density-based semi-supervised clustering
Data Mining and Knowledge Discovery
Data-driven computational linguistics at FaMAF-UNC, Argentina
YIWCALA '10 Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Transliteration generation and mining with limited training resources
NEWS '10 Proceedings of the 2010 Named Entities Workshop
Efficient set-correlation operator inside databases
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A graphical method for reference reconciliation
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Efficient duplicate record detection based on similarity estimation
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Automated country name disambiguation for code set alignment
ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Disclosing false identity through hybrid link analysis
Artificial Intelligence and Law
On Graph-Based Name Disambiguation
Journal of Data and Information Quality (JDIQ)
Evaluating entity resolution results
Proceedings of the VLDB Endowment
Record linkage with uniqueness constraints and erroneous values
Proceedings of the VLDB Endowment
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Learning phenotype mapping for integrating large genetic data
BioNLP '11 Proceedings of BioNLP 2011 Workshop
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
An unsupervised heuristic-based approach for bibliographic metadata deduplication
Information Processing and Management: an International Journal
Matching unstructured product offers to structured product specifications
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Entity matching: how similar is similar
Proceedings of the VLDB Endowment
A system for adaptive information extraction from highly informal text
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Learning good edit similarities with generalization guarantees
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Learning top-k transformation rules
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
An ontology-based method for duplicate detection in web data tables
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Adjusting Fuzzy Similarity Functions for use with standard data mining tools
Journal of Systems and Software
Author Name Disambiguation in Citations
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Automatically generating data linkages using a domain-independent candidate selection approach
ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
Frequency-aware similarity measures: why Arnold Schwarzenegger is always a duplicate
Proceedings of the 20th ACM international conference on Information and knowledge management
Duplicate detection through structure optimization
Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient similarity search: arbitrary similarity measures, arbitrary composition
Proceedings of the 20th ACM international conference on Information and knowledge management
Mining query structure from click data: a case study of product queries
Proceedings of the 20th ACM international conference on Information and knowledge management
Learning-based entity resolution with MapReduce
Proceedings of the third international workshop on Cloud data management
A sequence labeling method using syntactical and textual patterns for record linkage
ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition - Volume Part I
Learning a distance metric for object identification without human supervision
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Object identification with attribute-mediated dependences
PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Probabilistic data generation for deduplication and data linkage
IDEAL'05 Proceedings of the 6th international conference on Intelligent Data Engineering and Automated Learning
Quality-aware similarity assessment for entity matching in Web data
Information Systems
Identifying value mappings for data integration: an unsupervised approach
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Indexing shared content in information retrieval systems
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Decision models for record linkage
Data Mining
On context-aware co-clustering with metadata support
Journal of Intelligent Information Systems
Multi-pass sorted neighborhood blocking with MapReduce
Computer Science - Research and Development
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
On the effects of constraints in semi-supervised hierarchical clustering
ANNPR'06 Proceedings of the Second international conference on Artificial Neural Networks in Pattern Recognition
A discriminative model of stochastic edit distance in the form of a conditional transducer
ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Similarity function recommender service using incremental user knowledge acquisition
ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Efficient Privacy Preserving Protocols for Similarity Join
Transactions on Data Privacy
Flexible and efficient distributed resolution of large entities
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Proceedings of the 15th International Conference on Extending Database Technology
Aggregate queries on probabilistic record linkages
Proceedings of the 15th International Conference on Extending Database Technology
On generating large-scale ground truth datasets for the deduplication of bibliographic records
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Cross-Guided Clustering: Transfer of Relevant Supervision across Tasks
ACM Transactions on Knowledge Discovery from Data (TKDD)
Learning to adapt cross language information extraction wrapper
Applied Intelligence
Active sampling for entity matching
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Multiple instance learning for group record linkage
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
A novel approach for measuring hyperspectral similarity
Applied Soft Computing
EAGLE: efficient active learning of link specifications using genetic programming
ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
CrowdER: crowdsourcing entity resolution
Proceedings of the VLDB Endowment
Learning expressive linkage rules using genetic programming
Proceedings of the VLDB Endowment
Journal of Biomedical Informatics
Soft cardinality: a parameterized similarity function for text comparison
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Soft cardinality + ML: learning adaptive similarity functions for cross-lingual textual entailment
SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Name phylogeny: a generative model of string variation
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Alignment-HMM-based extraction of abbreviations from biomedical text
BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Automatic SLA Matching and Provider Selection in Grid and Cloud Computing Markets
GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
An evolutionary approach to complex schema matching
Information Systems
Integrating feature analysis and background knowledge to recommend similarity functions
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Proceedings of the sixth ACM international conference on Web search and data mining
Domain-Independent Entity Coreference for Linking Ontology Instances
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Adaptive Connection Strength Models for Relationship-Based Entity Resolution
Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Cost-aware query planning for similarity search
Information Systems
Memory efficient minimum substring partitioning
Proceedings of the VLDB Endowment
GRDB: a system for declarative and interactive analysis of noisy information networks
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
String similarity measures and joins with synonyms
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A taxonomy of privacy-preserving record linkage techniques
Information Systems
Model words-driven approaches for duplicate detection on the web
Proceedings of the 28th Annual ACM Symposium on Applied Computing
A supervised learning and group linking method for historical census household linkage
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Tuning large scale deduplication with reduced effort
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Exploiting user clicks for automatic seed set generation for entity matching
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Active Sampling for Entity Matching with Guarantees
ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012
A hybrid model words-driven approach for web product duplicate detection
CAiSE'13 Proceedings of the 25th international conference on Advanced Information Systems Engineering
Combining relational learning with SMT solvers using CEGAR
CAV'13 Proceedings of the 25th international conference on Computer Aided Verification
Evaluation of instance matching tools: The experience of OAEI
Web Semantics: Science, Services and Agents on the World Wide Web
Verification of query completeness over processes
BPM'13 Proceedings of the 11th international conference on Business Process Management
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams
Proceedings of the VLDB Endowment
Large-scale linked data integration using probabilistic reasoning and crowdsourcing
The VLDB Journal — The International Journal on Very Large Data Bases
Learning an accurate entity resolution model from crowdsourced labels
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Active learning of expressive linkage rules using genetic programming
Web Semantics: Science, Services and Agents on the World Wide Web
Deduplicating a places database
Proceedings of the 23rd international conference on World wide web
Towards a Protein-Protein Interaction information extraction system: Recognizing named entities
Knowledge-Based Systems
Joint entity resolution on multiple datasets
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.