Efficient Phrase-Based Document Indexing for Web Document Clustering
IEEE Transactions on Knowledge and Data Engineering
Automatic Pattern-Taxonomy Extraction for Web Mining
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Concept Learning of Text Documents
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Query-sets: using implicit feedback and query patterns to organize web documents
Proceedings of the 17th international conference on World Wide Web
Filtering and Sophisticated Data Processing for Web Information Gathering
RSEISP '07 Proceedings of the international conference on Rough Sets and Intelligent Systems Paradigms
Using Link-Based Content Analysis to Measure Document Similarity Effectively
APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Ontology-based relevance analysis for automatic reference tracking
International Journal of Computer Applications in Technology
Depth First Rule Generation for Text Categorization
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
Incremental Document Clustering Based on Graph Model
ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Efficient approach for incremental Vietnamese document clustering
Proceedings of the eleventh international workshop on Web information and data management
Comparative evaluation of ontology-based Automatic Reference Tracking (ART)
International Journal of Networking and Virtual Organisations
Similarity analysis of legal judgments
COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
Improving suffix tree clustering with new ranking and similarity measures
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Journal of Intelligent Information Systems
Hi-index | 0.00 |
Document clustering techniques mostly rely on singleterm analysis of the document data set, such as the VectorSpace Model. To better capture the structure of documents,the underlying data model should be able to represent thephrases in the document as well as single terms. We presenta novel data model, the Document Index Graph, which indexesweb documents based on phrases, rather than singleterms only. The semi-structured web documents helpin identifying potential phrases that when matched withother documents indicate strong similarity between the documents.The Document Index Graph captures this informa-tion,and finding significant matching phrases between documentsbecomes easy and efficient with such model. Thesimilarity between documents is based on both single termweights and matching phrases weights. The combined similaritiesare used with standard document clustering techniquesto test their effect on the clustering quality. Experimentalresults show that our phrase-based similarity, combinedwith single-term similarity measures, enhances webdocument clustering quality significantly.