Information retrieval and deduplication for tourism recommender sightsplanner

  • Authors:
  • Ago Luberg;Michael Granitzer;Honghan Wu;Priit Järv;Tanel Tammet

  • Affiliations:
  • Tallinn University of Technology, Ehitajate tee, Tallinn, Estonia;University of Passau, Passau, Germany;Nanjing University of Information Science & Technology, China;Tallinn University of Technology, Ehitajate tee, Tallinn, Estonia;Tallinn University of Technology, Ehitajate tee, Tallinn, Estonia

  • Venue:
  • Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper is about scraping web pages for tourism objects and resolving duplicates for a tourism recommender system Sightsplanner. Gathering information from different web portals, we end up having several versions of the same object in our database. It is very important that we can find out which objects are duplicates and merge those. Only unique objects are presented to the end user. The main focus of this paper is therefore on deduplication problem. We have implemented a duplication detection system and tuned the parameters manually to get up to 85% accuracy. In this paper we present a machine learning setup which we used to improve deduplication accuracy of tourism attractions by 13 percentage points to achieve 98% accuracy. All the steps in the process are presented along with problems we tackled.