Bilingual news clustering using named entities and fuzzy similarity

  • Authors:
  • Soto Montalvo;Raquel Martínez;Arantza Casillas;Víctor Fresno

  • Affiliations:
  • GAVAB Group, URJC;NLP&IR Group, UNED;Dpt. Electricidad y Electrónica, UPV-EHU;NLP&IR Group, UNED

  • Venue:
  • TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper is focused on discovering bilingual news clusters in a comparable corpus. Particularly, we deal with the news representation and with the calculation of the similarity between documents. We use as representative features of the news the cognate named entities they contain. One of our main goals consists of proving whether the use of only named entities is a good source of knowledge for multilingual news clustering. In the vectorial news representation we take into account the category of the named entities. In order to determine the similarity between two documents, we propose a new approach based on a fuzzy system, with a knowledge base that tries to incorporate the human knowledge about the importance of the named entities category in the news. We have compared our approach with a traditional one obtaining better results in a comparable corpus with news in Spanish and English.