Semantic clustering of scientific articles using explicit semantic analysis

  • Authors:
  • Marcin Szczuka;Andrzej Janusz

  • Affiliations:
  • Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Warsaw, Poland;Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Warsaw, Poland

  • Venue:
  • Transactions on Rough Sets XVI
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper summarizes our recent research on semantic clustering of scientific articles. We present a case study which was focused on analysis of papers related to the Rough Sets theory. The proposed method groups the documents on the basis of their content, with an assistance of the DBpedia knowledge base. The text corpus is first processed using Natural Language Processing tools in order to produce vector representations of the content. In the second step the articles are matched against a collection of concepts retrieved from DBpedia. As a result, a new representation that better reflects the semantics of the texts, is constructed. With this new representation the documents are hierarchically clustered in order to form a partitioning of papers into semantically related groups. The steps in textual data preparation, the utilization of DBpedia and the employed clustering methods are explained and illustrated with experimental results. A quality of the resulting clustering is then discussed. It is assessed using feedback form human experts combined with typical cluster quality measures. These results are then discussed in the context of a larger framework that aims to facilitate search and information extraction from large textual repositories.