A graph-based topic extraction method enabling simple interactive customization

  • Authors:
  • Ajitesh Srivastava;Axel J. Soto;Evangelos Milios

  • Affiliations:
  • Birla Institute of Technology and Science, Pilani, India;Dalhousie University, Halifax, NS, Canada;Dalhousie University, Halifax, NS, Canada

  • Venue:
  • Proceedings of the 2013 ACM symposium on Document engineering
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

It is often desirable to identify the concepts that are present in a corpus. A popular way to deal with this objective is to discover clusters of words or topics, for which many algorithms exist in the literature. Yet most of these methods lack the interpretability that would enable interaction with a user not familiar with their inner workings. The paper proposes a graph-based topic extraction algorithm, which can also be viewed as a soft-clustering of words present in a given corpus. Each topic, in the form of a set of words, represents an underlying concept in the corpus. The method allows easy interpretation of the clustering process, and hence enables the scope of user involvement at various steps. For a quantitative evaluation of the topics extracted, we use them as features to get a compact representation of documents for classification tasks. We compare the classification accuracy achieved by a reduced feature set obtained with our method versus other topic extraction techniques, namely Latent Dirichlet Allocation and Non-negative Matrix Factorization. While the results from all the three algorithms are comparable, the speed and easy interpretability of our algorithm makes it more appropriate to be used interactively by lay users.