Unsupervised Similarity Learning from Textual Data

Authors:
Andrzej Janusz;Dominik Ś/lę/zak;Hung Son Nguyen
Affiliations:
(Correspd.: University of Warsaw, Banacha 2, 02-097 Warszawa, Poland) Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland, andrzejanusz@gmai ...;(Also works: Infobright Inc., Krzywickiego 34 lok. 219, 02-078 Warsaw, Poland) Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland, andrzeja ...;Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland, andrzejanusz@gmail.com/ slezak@infobright.com/ son@mimuw.edu.pl
Venue:
Fundamenta Informaticae - Concurrency Specification and Programming (CS&P)
Year:
2012

Citing 14
Cited 1

A tolerance rough set approach to clustering web search results

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Outlier-robust clustering using independent components

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An ontology-driven approach for semantic information retrieval on the Web

ACM Transactions on Internet Technology (TOIT)
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Using evolution programs to learn local similarity measures

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
Applications of approximate reducts to the feature selection problem

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
Clustering of rough set related documents with use of knowledge from DBpedia

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
Formal Concept Analysis: foundations and applications

Formal Concept Analysis: foundations and applications
Ensembles of bireducts: towards robust classification and simple representation

FGIT'11 Proceedings of the Third international conference on Future Generation Information Technology
Dynamic rule-based similarity model for DNA microarray data

Transactions on Rough Sets XV
Calculi of Approximation Spaces

Fundamenta Informaticae - SPECIAL ISSUE ON CONCURRENCY SPECIFICATION AND PROGRAMMING (CS&P 2005) Ruciane-Nide, Poland, 28-30 September 2005
Approximate Entropy Reducts

Fundamenta Informaticae
Rough Sets, Rough Relations And Rough Functions

Fundamenta Informaticae
Tolerance Approximation Spaces

Fundamenta Informaticae

Semantic clustering of scientific articles using explicit semantic analysis

Transactions on Rough Sets XVI

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a research on the construction of a new unsupervised model for learning a semantic similarity measure from text corpora. Two main components of the model are a semantic interpreter of texts and a similarity function whose properties are derived from data. The first one associates particular documents with concepts defined in a knowledge base corresponding to the topics covered by the corpus. It shifts the representation of a meaning of the texts from words that can be ambiguous to concepts with predefined semantics. With this new representation, the similarity function is derived from data using a modification of the dynamic rule-based similarity model, which is adjusted to the unsupervised case. The adjustment is based on a novel notion of an information bireduct having its origin in the theory of rough sets. This extension of classical information reducts is used in order to find diverse sets of reference documents described by diverse sets of reference concepts that determine different aspects of the similarity. The paper explains a general idea of the approach and also gives some implementation guidelines. Additionally, results of some preliminary experiments are presented in order to demonstrate usefulness of the proposed model.