Quality-aware similarity assessment for entity matching in Web data

Authors:
Surender Reddy Yerva;Zoltán Miklós;Karl Aberer
Affiliations:
EPFL IC LSIR, Lausanne, Switzerland;EPFL IC LSIR, Lausanne, Switzerland;EPFL IC LSIR, Lausanne, Switzerland
Venue:
Information Systems
Year:
2012

Citing 34
Cited 0

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Combination of Multiple Classifiers Using Local Accuracy Estimates

IEEE Transactions on Pattern Analysis and Machine Intelligence
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
A Bayesian decision model for cost optimal record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Correlation Clustering

Machine Learning
Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation

Information Systems
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Disambiguating Web appearances of people in a social network

WWW '05 Proceedings of the 14th international conference on World Wide Web
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Ontology Matching

Ontology Matching
Leveraging aggregate constraints for deduplication

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
The combination of multiple classifiers using an evidential reasoning approach

Artificial Intelligence
Web People Search via Connection Analysis

IEEE Transactions on Knowledge and Data Engineering
PicShark: mitigating metadata scarcity through large-scale P2P collaboration

The VLDB Journal — The International Journal on Very Large Data Bases
idMesh: graph-based disambiguation of linked data

Proceedings of the 18th international conference on World wide web
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Exploiting context analysis for combining multiple entity resolution systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
TwitterStand: news in tweets

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Twitter power: Tweets as electronic word of mouth

Journal of the American Society for Information Science and Technology
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
A Conceptual Model for a Web-Scale Entity Name System

ASWC '09 Proceedings of the 4th Asian Conference on The Semantic Web
Similarity measures for short segments of text

ECIR'07 Proceedings of the 29th European conference on IR research
Short text classification in twitter to improve information filtering

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions
Outtweeting the twitterers - predicting information cascades in microblogs

WOSN'10 Proceedings of the 3rd conference on Online social networks
From web data to entities and back

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
What have fruits to do with technology?: the case of Orange, Blackberry and Apple

Proceedings of the International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the key challenges to realize automated processing of the information on the Web, which is the central goal of the Semantic Web, is related to the entity matching problem. There are a number of tools that reliably recognize named entities, such as persons, companies, geographic locations, in Web documents. The names of these extracted entities are, however, non-unique; the same name on different Web pages might or might not refer to the same entity. The entity matching problem concerns of identifying the entities, which are referring to the same real-world entity. This problem is very similar to the entity resolution problem studied in relational databases, however, there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the entities. Similarity functions try to capture the degree of belief about the equivalence of two entities, thus they play a crucial role in entity matching. The accuracy of the similarity functions highly depends on the applied assessment techniques, but also on some specific features of the entities. We propose systematic design strategies for combined similarity functions in this context. Our method relies on the combination of multiple evidences, with the help of estimated quality of the individual similarity values and with particular attention to missing information that is common in Web context. We study the effectiveness of our method in two specific instances of the general entity matching problem, namely the person name disambiguation and the Twitter message classification problem. In both cases, using our techniques in a very simple algorithmic framework we obtained better results than the state-of-the-art methods.