The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Combination of Multiple Classifiers Using Local Accuracy Estimates
IEEE Transactions on Pattern Analysis and Machine Intelligence
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
A Bayesian decision model for cost optimal record matching
The VLDB Journal — The International Journal on Very Large Data Bases
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Disambiguating Web appearances of people in a social network
WWW '05 Proceedings of the 14th international conference on World Wide Web
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Domain-independent data cleaning via analysis of entity-relationship graph
ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Ontology Matching
Leveraging aggregate constraints for deduplication
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive graphical approach to entity resolution
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Enhancing text clustering by leveraging Wikipedia semantics
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
The combination of multiple classifiers using an evidential reasoning approach
Artificial Intelligence
Web People Search via Connection Analysis
IEEE Transactions on Knowledge and Data Engineering
PicShark: mitigating metadata scarcity through large-scale P2P collaboration
The VLDB Journal — The International Journal on Very Large Data Bases
idMesh: graph-based disambiguation of linked data
Proceedings of the 18th international conference on World wide web
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Exploiting context analysis for combining multiple entity resolution systems
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Twitter power: Tweets as electronic word of mouth
Journal of the American Society for Information Science and Technology
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
A Conceptual Model for a Web-Scale Entity Name System
ASWC '09 Proceedings of the 4th Asian Conference on The Semantic Web
Similarity measures for short segments of text
ECIR'07 Proceedings of the 29th European conference on IR research
Short text classification in twitter to improve information filtering
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions
Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions
Outtweeting the twitterers - predicting information cascades in microblogs
WOSN'10 Proceedings of the 3rd conference on Online social networks
From web data to entities and back
CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
What have fruits to do with technology?: the case of Orange, Blackberry and Apple
Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Hi-index | 0.00 |
One of the key challenges to realize automated processing of the information on the Web, which is the central goal of the Semantic Web, is related to the entity matching problem. There are a number of tools that reliably recognize named entities, such as persons, companies, geographic locations, in Web documents. The names of these extracted entities are, however, non-unique; the same name on different Web pages might or might not refer to the same entity. The entity matching problem concerns of identifying the entities, which are referring to the same real-world entity. This problem is very similar to the entity resolution problem studied in relational databases, however, there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the entities. Similarity functions try to capture the degree of belief about the equivalence of two entities, thus they play a crucial role in entity matching. The accuracy of the similarity functions highly depends on the applied assessment techniques, but also on some specific features of the entities. We propose systematic design strategies for combined similarity functions in this context. Our method relies on the combination of multiple evidences, with the help of estimated quality of the individual similarity values and with particular attention to missing information that is common in Web context. We study the effectiveness of our method in two specific instances of the general entity matching problem, namely the person name disambiguation and the Twitter message classification problem. In both cases, using our techniques in a very simple algorithmic framework we obtained better results than the state-of-the-art methods.