Discovering emerging topics in unlabelled text collections

  • Authors:
  • Rene Schult;Myra Spiliopoulou

  • Affiliations:
  • Institute of Technical and Business Information Systems, Otto-von-Guericke-University Magdeburg, Germany;Institute of Technical and Business Information Systems, Otto-von-Guericke-University Magdeburg, Germany

  • Venue:
  • ADBIS'06 Proceedings of the 10th East European conference on Advances in Databases and Information Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

As document collections accummulate over time, some of the discussion subjects in them become outfashioned, while new ones emerge. Then, old classification schemes should be updated. In this paper, we address the challenge of finding emerging and persistent “themes”, i.e. subjects that live long enough to be incorporated into a taxonomy or ontology describing the document collection. We focus on the identification of cluster labels that “survive” changes in the constitution of the underlying population of documents, including changes in the feature space of dominant words, because the terminology of the document archive also changes over time. We have conducted a set of promising experiments on the identification of themes that manifested themselves in section H2.8 of the ACM digital library and juxtapose them with the classes foreseen in the ACM taxonomy for this section.