Information retrieval and deduplication for tourism recommender sightsplanner

Authors:
Ago Luberg;Michael Granitzer;Honghan Wu;Priit Järv;Tanel Tammet
Affiliations:
Tallinn University of Technology, Ehitajate tee, Tallinn, Estonia;University of Passau, Passau, Germany;Nanjing University of Information Science & Technology, China;Tallinn University of Technology, Ehitajate tee, Tallinn, Estonia;Tallinn University of Technology, Ehitajate tee, Tallinn, Estonia
Venue:
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Year:
2012

Citing 4
Cited 1

Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Detecting nearly duplicated records in location datasets

Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Context-aware and multilingual information extraction for a tourist recommender system

i-KNOW '11 Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies

Do We Need Entity-Centric Knowledge Bases for Entity Disambiguation?

Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is about scraping web pages for tourism objects and resolving duplicates for a tourism recommender system Sightsplanner. Gathering information from different web portals, we end up having several versions of the same object in our database. It is very important that we can find out which objects are duplicates and merge those. Only unique objects are presented to the end user. The main focus of this paper is therefore on deduplication problem. We have implemented a duplication detection system and tuned the parameters manually to get up to 85% accuracy. In this paper we present a machine learning setup which we used to improve deduplication accuracy of tourism attractions by 13 percentage points to achieve 98% accuracy. All the steps in the process are presented along with problems we tackled.