Efficient Phrase-Based Document Indexing for Web Document Clustering

Authors:
Khaled M. Hammouda;Mohamed S. Kamel
Affiliations:
IEEE;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2004

Citing 29
Cited 43

Algorithms for clustering data

Algorithms for clustering data
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Data mining methods for knowledge discovery

Data mining methods for knowledge discovery
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Grouper: a dynamic clustering interface to Web search results

WWW '99 Proceedings of the eighth international conference on World Wide Web
Data clustering: a review

ACM Computing Surveys (CSUR)
Reducing the space requirement of suffix trees

Software—Practice & Experience
Partitioning-based clustering for Web document categorization

Decision Support Systems - Special issue on WITS '97
Document Categorization and Query Generation on the World Wide WebUsing WebACE

Artificial Intelligence Review - Special issue on data mining on the Internet
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
A vector space model for automatic indexing

Communications of the ACM
Document clustering with committees

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Learning Approaches for Detecting and Tracking News Events

IEEE Intelligent Systems
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
A Mutually Beneficial Integration of Data Mining and Information Extraction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Phrase-based Document Similarity Based on an Index Graph Model

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Investigating Measures for Pairwise Document Similarity

Investigating Measures for Pairwise Document Similarity
Statistical Phrases in Automated Text Categorization

Statistical Phrases in Automated Text Categorization

A partitioning based algorithm to fuzzy co-cluster documents and words

Pattern Recognition Letters
Effective and efficient object-based image retrieval using visual phrases

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
A new suffix tree similarity measure for document clustering

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
An investigation into the stability of contextual document clustering

Journal of the American Society for Information Science and Technology
Utilizing phrase-similarity measures for detecting and clustering informative RSS news articles

Integrated Computer-Aided Engineering
Constructing visual phrases for effective and efficient object-based image retrieval

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Distributed collaborative Web document clustering using cluster keyphrase summaries

Information Fusion
Finding Arbitrary Shaped Clusters for Character Recognition

ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
A new sentence similarity measure and sentence based extractive technique for automatic text summarization

Expert Systems with Applications: An International Journal
Clustering of document collection - A weighting approach

Expert Systems with Applications: An International Journal
A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities

Pattern Recognition
An online document clustering technique for short web contents

Pattern Recognition Letters
New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Performance evaluation of density-based clustering methods

Information Sciences: an International Journal
PhraseRank for document clustering: reweighting the weight of phrase

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Granular Computing for Text Mining: New Research Challenges and Opportunities

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
A web page usage prediction scheme using sequence indexing and clustering techniques

Data & Knowledge Engineering
Finding similar RSS news articles using correlation-based phrase matching

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Navigating among search results: an information content approach

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
A fuzzy bi-clustering approach to correlate web users and pages

International Journal of Knowledge and Web Intelligence
Context-based citation retrieval

International Journal of Networking and Virtual Organisations
A time-efficient pattern reduction algorithm for k-means clustering

Information Sciences: an International Journal
Searching protein 3-D structures for optimal structure alignment using intelligent algorithms and data structures

IEEE Transactions on Information Technology in Biomedicine
Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach

IEEE Transactions on Fuzzy Systems
Document clustering using synthetic cluster prototypes

Data & Knowledge Engineering
Research of fast SOM clustering for text information

Expert Systems with Applications: An International Journal
Toward a higher-level visual representation for content-based image retrieval

Proceedings of the 8th International Conference on Advances in Mobile Computing and Multimedia
Representing document as dependency graph for document clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
A novel hierarchical document clustering algorithm based on a kNN connection graph

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Dynamic hierarchical compact clustering algorithm

CIARP'05 Proceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications
Workflow clustering method based on process similarity

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II
Process mining by measuring process block similarity

BPM'06 Proceedings of the 2006 international conference on Business Process Management Workshops
CorePhrase: keyphrase extraction for document clustering

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Improving suffix tree clustering with new ranking and similarity measures

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Exploring clustering for multi-document arabic summarisation

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Toward a higher-level visual representation for content-based image retrieval

Multimedia Tools and Applications
Investigating usage of text segmentation and inter-passage similarities to improve text document clustering

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
A Roadmap to Integrate Document Clustering in Information Retrieval

International Journal of Information Retrieval Research
A new overlapping clustering algorithm based on graph theory

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Probability-based text clustering algorithm by alternately repeating two operations

Journal of Information Science
OClustR: A new graph-based algorithm for overlapping clustering

Neurocomputing
Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.