Leveraging network structure for incremental document clustering

Authors:
Tieyun Qian;Jianfeng Si;Qing Li;Qian Yu
Affiliations:
State Key Laboratory of Software Engineering, Wuhan University, Wuhan, China and State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China;Department of Computer Science, City University of Hong Kong, Hong Kong, China;Department of Computer Science, City University of Hong Kong, Hong Kong, China;State Key Laboratory of Software Engineering, Wuhan University, Wuhan, China
Venue:
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Year:
2012

Citing 15
Cited 0

A vector space model for automatic indexing

Communications of the ACM
Concept decompositions for large sparse text data using clustering

Machine Learning
Document clustering with cluster refinement and model selection capabilities

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
FREM: fast and robust EM clustering for large data sets

Proceedings of the eleventh international conference on Information and knowledge management
A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
ReCoM: reinforcement clustering of multi-type interrelated data objects

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Graph-based text classification: learn from your neighbors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A neighborhood-based approach for clustering of linked document collections

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A comparative evaluation of different link types on enhancing document clustering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Scalable community discovery on textual data with relations

Proceedings of the 17th ACM conference on Information and knowledge management
Incremental Document Clustering Based on Graph Model

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Semantic smoothing of document models for agglomerative clustering

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent studies have shown that link-based clustering methods can significantly improve the performance of content-based clustering. However, most previous algorithms are developed for fixed data sets, and are not applicable to the dynamic environments such as data warehouse and online digital library. In this paper, we introduce a novel approach which leverages the network structure for incremental clustering. Under this framework, both the link and content information are incorporated to determine the host cluster of a new document. The combination of two types of information ensures a promising performance of the clustering results. Furthermore, the status of core members is used to quickly determine whether to split or merge a new cluster. This filtering process eliminates the unnecessary and time-consuming checks of textual similarity on the whole corpus, and thus greatly speeds up the entire procedure. We evaluate our proposed approach on several real-world publication data sets and conduct an extensive comparison with both the classic content based and the recent link based algorithms. The experimental results demonstrate the effectiveness and efficiency of our method.