Unsupervised Similarity Learning from Textual Data

  • Authors:
  • Andrzej Janusz;Dominik Ś/lę/zak;Hung Son Nguyen

  • Affiliations:
  • (Correspd.: University of Warsaw, Banacha 2, 02-097 Warszawa, Poland) Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland, andrzejanusz@gmai ...;(Also works: Infobright Inc., Krzywickiego 34 lok. 219, 02-078 Warsaw, Poland) Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland, andrzeja ...;Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland, andrzejanusz@gmail.com/ slezak@infobright.com/ son@mimuw.edu.pl

  • Venue:
  • Fundamenta Informaticae - Concurrency Specification and Programming (CS&P)
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a research on the construction of a new unsupervised model for learning a semantic similarity measure from text corpora. Two main components of the model are a semantic interpreter of texts and a similarity function whose properties are derived from data. The first one associates particular documents with concepts defined in a knowledge base corresponding to the topics covered by the corpus. It shifts the representation of a meaning of the texts from words that can be ambiguous to concepts with predefined semantics. With this new representation, the similarity function is derived from data using a modification of the dynamic rule-based similarity model, which is adjusted to the unsupervised case. The adjustment is based on a novel notion of an information bireduct having its origin in the theory of rough sets. This extension of classical information reducts is used in order to find diverse sets of reference documents described by diverse sets of reference concepts that determine different aspects of the similarity. The paper explains a general idea of the approach and also gives some implementation guidelines. Additionally, results of some preliminary experiments are presented in order to demonstrate usefulness of the proposed model.