Combining preference- and content-based approaches for improving document clustering effectiveness

  • Authors:
  • Chih-Ping Wei;Chin-Sheng Yang;Han-Wei Hsiao;Tsang-Hsiang Cheng

  • Affiliations:
  • Department of Information Management, College of Management, National Sun Yat-sen University, Kaohsiung, Taiwan, ROC;Department of Information Management, College of Management, National Sun Yat-sen University, Kaohsiung, Taiwan, ROC;Department of Information Management, National University of Kaohsiung, Kaohsiung, Taiwan, ROC;Department of Business Administration, Southern Taiwan University of Technology, Tainan, Taiwan, ROC

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

E-commerce and knowledge management applications generate and consume tremendous amounts of online information that is typically available as textual documents. To facilitate subsequent access of and leverage from these textual documents, the efficient and effective management of the ever-increasing volume of documents is essential to both organizations and individuals. Document management practices suggest the popularity of using categories (e.g., folders) for organizing, archiving, and accessing documents. Document clustering represents an appealing approach to enable organizations or individuals to create and maintain document categories automatically. Existing document clustering techniques usually group together similar documents on the basis of their textual content similarity. However, such content-based approaches operate at the lexical level and suffer greatly from the word mismatch problem. Therefore, this study aims to address this problem by exploiting users' document grouping preferences, as exhibited in those individuals' folder sets, to support document clustering. Specifically, we propose a hybrid document clustering technique that combines preference- and content-based approaches. Using a traditional content-based and a preference/ content switching document clustering technique as performance benchmarks, our empirical evaluation results show that the proposed hybrid technique improves the clustering effectiveness measured by both cluster precision and cluster recall.