Semantic clustering of scientific articles using explicit semantic analysis

Authors:
Marcin Szczuka;Andrzej Janusz
Affiliations:
Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Warsaw, Poland;Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Warsaw, Poland
Venue:
Transactions on Rough Sets XVI
Year:
2013

Citing 9
Cited 0

Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Introduction to Information Retrieval

Introduction to Information Retrieval
Brighthouse: an analytic data warehouse for ad-hoc queries

Proceedings of the VLDB Endowment
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
DBpedia - A crystallization point for the Web of Data

Web Semantics: Science, Services and Agents on the World Wide Web
Clustering of rough set related documents with use of knowledge from DBpedia

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
Semantic analytics of pubmed content

USAB'11 Proceedings of the 7th conference on Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society: information Quality in e-Health
Dynamic rule-based similarity model for DNA microarray data

Transactions on Rough Sets XV
Unsupervised Similarity Learning from Textual Data

Fundamenta Informaticae - Concurrency Specification and Programming (CS&P)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper summarizes our recent research on semantic clustering of scientific articles. We present a case study which was focused on analysis of papers related to the Rough Sets theory. The proposed method groups the documents on the basis of their content, with an assistance of the DBpedia knowledge base. The text corpus is first processed using Natural Language Processing tools in order to produce vector representations of the content. In the second step the articles are matched against a collection of concepts retrieved from DBpedia. As a result, a new representation that better reflects the semantics of the texts, is constructed. With this new representation the documents are hierarchically clustered in order to form a partitioning of papers into semantically related groups. The steps in textual data preparation, the utilization of DBpedia and the employed clustering methods are explained and illustrated with experimental results. A quality of the resulting clustering is then discussed. It is assessed using feedback form human experts combined with typical cluster quality measures. These results are then discussed in the context of a larger framework that aims to facilitate search and information extraction from large textual repositories.