Indexing for fast categorisation

Authors:
Vaughan R. Shanks;Hugh E. Williams;Adam Cannane
Affiliations:
School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne;School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne;School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne
Venue:
ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
Year:
2003

Citing 27
Cited 3

Automatic text processing

Automatic text processing
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Document filtering for fast ranking

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Simulation of compressible flow on a massively parallel architecture

Scientific Programming - On applications analysis
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Document length normalization

Information Processing and Management: an International Journal - Special issue: history of information science
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Exploring the similarity space

ACM SIGIR Forum
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Impact transformation: effective and efficient web retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Novelty and redundancy detection in adaptive filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Bayesian online classifiers for text classification and filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A new family of online algorithms for category ranking

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
How Many Bits are Needed to Store Term Frequencies?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Index construction for linear categorisation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Compressed data structures for annotated web search

Proceedings of the 21st international conference on World Wide Web
Scalable text classification with sparse generative modeling

PRICAI'12 Proceedings of the 12th Pacific Rim international conference on Trends in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic categorisation is an important technique for the management of large document collections. Categorisation can be used to store or locate documents that satisfy an information need when the need cannot be expressed as a concise list of query terms. Inverted indexes are used in all query-based retrieval systems to allow efficient query processing. In this paper, we propose the application of inverted indexes to categorisation with the aim of developing a fast, scalable, and accurate approach. Specifically, we propose successful variants of inverted indexing to reduce index size: first, quantisation of term-category weights; second, compression of the quantised weights; and, last, storing only those weights that significantly impact the categorisation process. We show that our techniques permits fast, accurate categorisation: index size is reduced by orders of magnitude compared to conventional inverted indexing and the accuracy of categorisation is preserved.