Efficient system for clustering of dynamic document database

  • Authors:
  • Pawel Foszner;Aleksandra Gruca;Andrzej Polanski

  • Affiliations:
  • Silesian University of Technology, Institute of Informatics, Gliwice, Poland;Silesian University of Technology, Institute of Informatics, Gliwice, Poland;Silesian University of Technology, Institute of Informatics, Gliwice, Poland

  • Venue:
  • CDVE'11 Proceedings of the 8th international conference on Cooperative design, visualization, and engineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe in this paper, a system that groups, classifies and finds the latent semantic features in a database composed of a large number of documents. The database will be constantly growing as users who co-create it will be adding more and more new documents. Users require a system to provide them information, both about a specific document, and about the entire set of documents. This information includes statistical data about words in documents, information about aspects in which this words appears, classification, clustering, etc. To meet these expectations we propose using methods for searching for hidden patterns in multivariable data. We apply machine learning algorithms for data analysis, useful in identifying local patterns in multivariate data. We consider two different algorithms described in the literature (1) Probabilistic Latent Semantic Analysis Method [2] and (2) Nonnegative Matrix Factorization algorithm described in [4] and used in the text analysis system [1].