Document Similarity Using a Phrase Indexing Graph Model

Authors:
Khaled M. Hammouda;Mohamed S. Kamel
Affiliations:
University of Waterloo, Department of Systems Design Engineering, N2L 3G1, Waterloo, Ontario, Canada;University of Waterloo, Department of Systems Design Engineering, N2L 3G1, Waterloo, Ontario, Canada
Venue:
Knowledge and Information Systems
Year:
2004

Citing 0
Cited 6

Distributed collaborative Web document clustering using cluster keyphrase summaries

Information Fusion
S2S: structural-to-syntactic matching similar documents

Knowledge and Information Systems
Exploiting noun phrases and semantic relationships for text document clustering

Information Sciences: an International Journal
Cube index for unstructured text analysis and mining

Proceedings of the 2011 International Conference on Communication, Computing & Security
Linear scale semantic mining algorithms in microsoft SQL server's semantic platform

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A measure based on optimal matching in graph theory for document similarity

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.