On Knowledge-Enhanced Document Clustering

  • Authors:
  • Manjeet Rege;Josan Koruthu;Reynold Bailey

  • Affiliations:
  • Rochester Institute of Technology, Rochester, NY, USA;Rochester Institute of Technology, Rochester, NY, USA;Rochester Institute of Technology, Rochester, NY, USA

  • Venue:
  • International Journal of Information Retrieval Research
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document clustering plays an important role in text analytics by finding natural groupings of documents based on their similarity determined by the words appearing in them. Many of the clustering algorithms accessible through various text analytics tools are completely unsupervised in nature. That is, they are unable to incorporate any domain knowledge that might be available about the documents to improve the clustering accuracy and relevance. The authors present a graph partitioning based semi-supervised document clustering algorithm. The user provides knowledge about few of the documents in the form of "must-link" and "cannot-link" constraints between pairs of documents. A "must-link" constraint between two documents expresses the fact that the user feels that the two corresponding documents must be clustered irrespective of their dissimilarity. Similarly, a "cannot-link" signifies that the two documents should never be clustered together no matter how similar they might happen to be. These constraints are then incorporated into a graph partitioning based into a computationally efficient document clustering algorithm. Through experiments performed on publicly available text datasets, the proposed framework is validated.