Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

Authors:
Tuomo Korenius;Jorma Laurikkala;Martti Juhola;Kalervo Järvelin
Affiliations:
Department of Computer Sciences, University of Tampere, Finland;Department of Computer Sciences, University of Tampere, Finland;Department of Computer Sciences, University of Tampere, Finland;Center for Advanced Studies, University of Tampere, Finland
Venue:
Information Retrieval
Year:
2006

Citing 22
Cited 1

Algorithms for clustering data

Algorithms for clustering data
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Comparison of hierarchic agglomerative clustering methods for document retrieval

The Computer Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Clustering algorithms

Information retrieval
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Presenting results of experimental retrieval comparisons

Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Applied multivariate techniques

Applied multivariate techniques
IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Liberal relevance criteria of TREC -: counting on negligible documents?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
From Plain Character Strings to Meaningful Words: Producing Better Full Text Databases for Inflectional and Compounding Languages with Morphological Analysis Software

Information Retrieval
Hierarchical Clustering Using Non-Greedy Principal Direction Divisive Partitioning

Information Retrieval
Using graded relevance assessments in IR evaluation

Journal of the American Society for Information Science and Technology
Cluster Analysis

Cluster Analysis

Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clustering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clusters were compared with each of the 30 queries to retrieve the best-matching cluster. The four-level relevance assessment was collapsed into a binary one by (A) considering all the relevant and (B) only the highly relevant documents relevant, respectively. Single linkage (SL) was the worst method. It created many tiny clusters, and, consequently, searches enabled with it had high precision and low recall. The complete linkage (CL), average linkage (AL), and Ward's methods (WM) returned reasonably-sized clusters typically of 18---32 documents. Their recall (A: 27---52%, B: 50---82%) and precision (A: 83---90%, B: 18---21%) was higher than and comparable to those of the SL clusters, respectively. The AL and WM clustering had 1---8% better effectiveness than nearest neighbor searching (NN), and SL and CL were 1---9% less efficient that NN. However, the differences were statistically insignificant. When evaluated with the liberal assessment A, the results suggest that the AL and WM clustering offer better retrieval ability than NN. Assessment B renders the AL and WM clustering better than NN, when recall is considered more important than precision. The results imply that collections in the highly inflectional and agglutinative languages, such as Finnish, may be clustered as the collections in English, provided that documents are appropriately preprocessed.