Phrase-based Document Similarity Based on an Index Graph Model

  • Authors:
  • Khaled M. Hammouda;Mohamed S. Kamel

  • Affiliations:
  • -;-

  • Venue:
  • ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document clustering techniques mostly rely on singleterm analysis of the document data set, such as the VectorSpace Model. To better capture the structure of documents,the underlying data model should be able to represent thephrases in the document as well as single terms. We presenta novel data model, the Document Index Graph, which indexesweb documents based on phrases, rather than singleterms only. The semi-structured web documents helpin identifying potential phrases that when matched withother documents indicate strong similarity between the documents.The Document Index Graph captures this informa-tion,and finding significant matching phrases between documentsbecomes easy and efficient with such model. Thesimilarity between documents is based on both single termweights and matching phrases weights. The combined similaritiesare used with standard document clustering techniquesto test their effect on the clustering quality. Experimentalresults show that our phrase-based similarity, combinedwith single-term similarity measures, enhances webdocument clustering quality significantly.