Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Detecting nearly duplicated records in location datasets
Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Context-aware and multilingual information extraction for a tourist recommender system
i-KNOW '11 Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies
Do We Need Entity-Centric Knowledge Bases for Entity Disambiguation?
Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies
Hi-index | 0.00 |
This paper is about scraping web pages for tourism objects and resolving duplicates for a tourism recommender system Sightsplanner. Gathering information from different web portals, we end up having several versions of the same object in our database. It is very important that we can find out which objects are duplicates and merge those. Only unique objects are presented to the end user. The main focus of this paper is therefore on deduplication problem. We have implemented a duplication detection system and tuned the parameters manually to get up to 85% accuracy. In this paper we present a machine learning setup which we used to improve deduplication accuracy of tourism attractions by 13 percentage points to achieve 98% accuracy. All the steps in the process are presented along with problems we tackled.