Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases

  • Authors:
  • Fazli Can;Esen A. Ozkarahan

  • Affiliations:
  • Miami Univ., Oxford, OH;Pennsylvania State Univ., Erie

  • Venue:
  • ACM Transactions on Database Systems (TODS)
  • Year:
  • 1990

Quantified Score

Hi-index 0.01

Visualization

Abstract

A new algorithm for document clustering is introduced. The base concept of the algorithm, the cover coefficient (CC) concept, provides a means of estimating the number of clusters within a document database and related indexing and clustering analytically. The CC concept is used also to identify the cluster seeds and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the information-retrieval effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method that is known to have good retrieval performance. The experiments also show that the algorithm is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster-based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm and have shown improvements in retrieval effectiveness. In the experiments two document databases are used: TODS214 and INSPEC. The latter is a common database with 12,684 documents.