Cross-Language high similarity search using a conceptual thesaurus

  • Authors:
  • Parth Gupta;Alberto Barrón-Cedeño;Paolo Rosso

  • Affiliations:
  • Natural Language Engineering Lab. - ELiRF Department of Information Systems and Computation, Universitat Politècnica de València, Spain;Natural Language Engineering Lab. - ELiRF Department of Information Systems and Computation, Universitat Politècnica de València, Spain;Natural Language Engineering Lab. - ELiRF Department of Information Systems and Computation, Universitat Politècnica de València, Spain

  • Venue:
  • CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.