Algorithms for clustering data
Algorithms for clustering data
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Incremental clustering and dynamic information retrieval
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Data mining methods for knowledge discovery
Data mining methods for knowledge discovery
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Grouper: a dynamic clustering interface to Web search results
WWW '99 Proceedings of the eighth international conference on World Wide Web
ACM Computing Surveys (CSUR)
Reducing the space requirement of suffix trees
Software—Practice & Experience
Partitioning-based clustering for Web document categorization
Decision Support Systems - Special issue on WITS '97
Document Categorization and Query Generation on the World Wide WebUsing WebACE
Artificial Intelligence Review - Special issue on data mining on the Internet
ACM SIGKDD Explorations Newsletter
A vector space model for automatic indexing
Communications of the ACM
Document clustering with committees
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Principal Direction Divisive Partitioning
Data Mining and Knowledge Discovery
Learning Approaches for Detecting and Tracking News Events
IEEE Intelligent Systems
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Feature Engineering for Text Classification
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data
IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
A Mutually Beneficial Integration of Data Mining and Information Extraction
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Frequent term-based text clustering
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Phrase-based Document Similarity Based on an Index Graph Model
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Investigating Measures for Pairwise Document Similarity
Investigating Measures for Pairwise Document Similarity
Statistical Phrases in Automated Text Categorization
Statistical Phrases in Automated Text Categorization
A partitioning based algorithm to fuzzy co-cluster documents and words
Pattern Recognition Letters
Effective and efficient object-based image retrieval using visual phrases
MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
A new suffix tree similarity measure for document clustering
Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
An investigation into the stability of contextual document clustering
Journal of the American Society for Information Science and Technology
Utilizing phrase-similarity measures for detecting and clustering informative RSS news articles
Integrated Computer-Aided Engineering
Constructing visual phrases for effective and efficient object-based image retrieval
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Finding Arbitrary Shaped Clusters for Character Recognition
ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
Expert Systems with Applications: An International Journal
Clustering of document collection - A weighting approach
Expert Systems with Applications: An International Journal
An online document clustering technique for short web contents
Pattern Recognition Letters
New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Performance evaluation of density-based clustering methods
Information Sciences: an International Journal
PhraseRank for document clustering: reweighting the weight of phrase
Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Granular Computing for Text Mining: New Research Challenges and Opportunities
RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
A web page usage prediction scheme using sequence indexing and clustering techniques
Data & Knowledge Engineering
Finding similar RSS news articles using correlation-based phrase matching
KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Navigating among search results: an information content approach
WISE'07 Proceedings of the 8th international conference on Web information systems engineering
A fuzzy bi-clustering approach to correlate web users and pages
International Journal of Knowledge and Web Intelligence
Context-based citation retrieval
International Journal of Networking and Virtual Organisations
A time-efficient pattern reduction algorithm for k-means clustering
Information Sciences: an International Journal
IEEE Transactions on Information Technology in Biomedicine
IEEE Transactions on Fuzzy Systems
Document clustering using synthetic cluster prototypes
Data & Knowledge Engineering
Research of fast SOM clustering for text information
Expert Systems with Applications: An International Journal
Toward a higher-level visual representation for content-based image retrieval
Proceedings of the 8th International Conference on Advances in Mobile Computing and Multimedia
Representing document as dependency graph for document clustering
Proceedings of the 20th ACM international conference on Information and knowledge management
A novel hierarchical document clustering algorithm based on a kNN connection graph
ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Dynamic hierarchical compact clustering algorithm
CIARP'05 Proceedings of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis and Applications
Workflow clustering method based on process similarity
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II
Process mining by measuring process block similarity
BPM'06 Proceedings of the 2006 international conference on Business Process Management Workshops
CorePhrase: keyphrase extraction for document clustering
MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Improving suffix tree clustering with new ranking and similarity measures
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Exploring clustering for multi-document arabic summarisation
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Toward a higher-level visual representation for content-based image retrieval
Multimedia Tools and Applications
MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
A Roadmap to Integrate Document Clustering in Information Retrieval
International Journal of Information Retrieval Research
A new overlapping clustering algorithm based on graph theory
MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Probability-based text clustering algorithm by alternately repeating two operations
Journal of Information Science
Information Sciences: an International Journal
Hi-index | 0.01 |
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This paper presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the Document Index Graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.