Exploiting concept clusters for content-based information retrieval

  • Authors:
  • Bo-Yeong Kang;Dae-Won Kim;Sang-Jo Lee

  • Affiliations:
  • Department of Computer Engineering, Kyungpook National University, Sangyuk-dong, Puk-gu, Daegu 702-701, Republic of Korea;Department of Computer Science, KAIST, 373-1, Guseong-dong, Yuseong-gu, Daejeon 305-701, Republic of Korea;Department of Computer Engineering, Kyungpook National University, Sangyuk-dong, Puk-gu, Daegu 702-701, Republic of Korea

  • Venue:
  • Information Sciences—Informatics and Computer Science: An International Journal
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Current approaches to index weighting for information retrieval from texts are based on statistical analysis of the texts' contents. A key shortcoming of these indexing schemes, which consider only the terms in a document, is that they cannot extract semantically exact indexes that represent the semantic content of a document. To address this issue, we proposed a new indexing formalism that considers not only the terms in a document, but also the concepts. In the proposed method, concepts are extracted by exploiting clusters of terms that are semantically related, referred to as concept clusters. Through experiments on the TREC-2 collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the highest-ranked documents. Moreover, the index term dimension was 53.3% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.