An incremental document clustering algorithm based on a hierarchical agglomerative approach

Authors:
Kil Hong Joo;SooJung Lee
Affiliations:
Dept. of Computer Education, Gyeongin National University of Education, Inchon, Korea;Dept. of Computer Education, Gyeongin National University of Education, Inchon, Korea
Venue:
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Year:
2005

Citing 12
Cited 1

Algorithms for clustering data

Algorithms for clustering data
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Using a generalized instance set for automatic text categorization

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Combining Statistical and Relational Methods for Learning in Hypertext Domains

ILP '98 Proceedings of the 8th International Workshop on Inductive Logic Programming
Iterative optimization and simplification of hierarchical clusterings

Journal of Artificial Intelligence Research

A probabilistic relational approach for web document clustering

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering is classifying a data set of documents into groups of closely related documents, so that its resulting clusters can be used in browsing and searching the documents of a specific topic. In most cases of such as application, a set of new documents are incrementally added to the data set and there can be a large variation in the number of words in each document. This paper proposes an incremental document clustering method for an incrementally increasing data set of documents. The normalized inverse document frequency of a word in the data set is introduced to cope with the variation of the number of words in each document. Furthermore, an average link method for document clustering instead of using one similarity measure used in two similarity measures: a cluster cohesion rate and a cluster participation rate. Furthermore, a category tree for a set of identified clusters is introduced to assist the incremental document clustering of newly added documents. In this paper, the performance of the proposed method is analyzed by a series of experiments to identify their various characteristics.