Phrase-based Document Similarity Based on an Index Graph Model

Authors:
Khaled M. Hammouda;Mohamed S. Kamel
Affiliations:
-;-
Venue:
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Year:
2002

Citing 0
Cited 16

Efficient Phrase-Based Document Indexing for Web Document Clustering

IEEE Transactions on Knowledge and Data Engineering
Automatic Pattern-Taxonomy Extraction for Web Mining

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Concept Learning of Text Documents

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Query-sets: using implicit feedback and query patterns to organize web documents

Proceedings of the 17th international conference on World Wide Web
Distributed collaborative Web document clustering using cluster keyphrase summaries

Information Fusion
Filtering and Sophisticated Data Processing for Web Information Gathering

RSEISP '07 Proceedings of the international conference on Rough Sets and Intelligent Systems Paradigms
A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities

Pattern Recognition
Using Link-Based Content Analysis to Measure Document Similarity Effectively

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Ontology-based relevance analysis for automatic reference tracking

International Journal of Computer Applications in Technology
Depth First Rule Generation for Text Categorization

Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
Incremental Document Clustering Based on Graph Model

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Efficient approach for incremental Vietnamese document clustering

Proceedings of the eleventh international workshop on Web information and data management
Comparative evaluation of ontology-based Automatic Reference Tracking (ART)

International Journal of Networking and Virtual Organisations
Similarity analysis of legal judgments

COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
Improving suffix tree clustering with new ranking and similarity measures

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering techniques mostly rely on singleterm analysis of the document data set, such as the VectorSpace Model. To better capture the structure of documents,the underlying data model should be able to represent thephrases in the document as well as single terms. We presenta novel data model, the Document Index Graph, which indexesweb documents based on phrases, rather than singleterms only. The semi-structured web documents helpin identifying potential phrases that when matched withother documents indicate strong similarity between the documents.The Document Index Graph captures this informa-tion,and finding significant matching phrases between documentsbecomes easy and efficient with such model. Thesimilarity between documents is based on both single termweights and matching phrases weights. The combined similaritiesare used with standard document clustering techniquesto test their effect on the clustering quality. Experimentalresults show that our phrase-based similarity, combinedwith single-term similarity measures, enhances webdocument clustering quality significantly.