An architecture for efficient document clustering and retrieval on a dynamic collection of newspaper texts

Authors:
Alan F. Smeaton;Mark Burnett;Francis Crimmins;Gerard Quinn
Affiliations:
School of Computer Applications, Dublin City University, Dublin 9, Ireland;School of Computer Applications, Dublin City University, Dublin 9, Ireland;School of Computer Applications, Dublin City University, Dublin 9, Ireland;School of Computer Applications, Dublin City University, Dublin 9, Ireland
Venue:
IRSG'98 Proceedings of the 20th Annual BCS-IRSG conference on Information Retrieval Research
Year:
1998

Citing 7
Cited 4

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Comparison of hierarchic agglomerative clustering methods for document retrieval

The Computer Journal
The efficiency of inverted index and cluster searches

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Almost-constant-time clustering of arbitrary corpus subsets4

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Logical structure of a hypermedia newspaper

Information Processing and Management: an International Journal - Special issue on electronic news
Automatic association of new items

Information Processing and Management: an International Journal - Special issue on electronic news

Multimedia information services enabling: an architectural approach

MULTIMEDIA '01 Proceedings of the 2001 ACM workshops on Multimedia: multimedia information retrieval
Taiscéalaí: Information Retrieval from an Archive of Spoken Radio News

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Scalable hierarchical topic detection: exploring a sample based approach

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental hierarchical clustering of text documents

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering of related or similar objects has long been regarded as a potentially useful contribution to helping users navigate an information space such as a document collection. When documents are related by virtue of being about the same or similar topics, then this is often a good indicator that they will be relevant to the same queries and this can be used during the retrieval operation. Many clustering algorithms and techniques have been developed and implemented since the earliest days of computational information retrieval but as the sizes of document collections have grown these techniques have not been scaled to large collections because of their computational overhead. In this paper we describe a technique for clustering a collection of documents such as a collection of online newspapers which uses a number of short-cuts to make the process computable for large collections. Furthermore, our design is extensible in that it caters for a dynamic collection of documents which would be periodically, perhaps nightly, updated, amended or have deletions. An implementation of the clustering on an archive of the Irish Times newspaper is reported here.