A comparative study of two automatic document classification methods in a library setting

Authors:
Joanna Yi-Hang Pong;Ron Chi-Wai Kwok;Raymond Yiu-Keung Lau;Jin-Xing Hao;Percy Ching-Chi Wong
Affiliations:
Run Run Shaw Library, City University of Hong Kong,Tat Chee Avenue, Kowloon, Hong Kong;Department of Information Systems, City University ofHong Kong, Tat Chee Avenue, Kowloon, Hong Kong;Department of Information Systems, City University ofHong Kong, Tat Chee Avenue, Kowloon, Hong Kong;Department of Information Systems, City University ofHong Kong, Tat Chee Avenue, Kowloon, Hong Kong;Department of Information Systems, City University ofHong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Venue:
Journal of Information Science
Year:
2008

Citing 12
Cited 4

Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Exploiting Hierarchy in Text Categorization

Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Text categorization based on k-nearest neighbor approach for web site classification

Information Processing and Management: an International Journal
To grow in wisdom: vannevar bush, information overload, and the life of leisure

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries

Leveraging the legacy of conventional libraries for organizing digital libraries

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
A method for measuring co-authorship relationships in MediaWiki

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata

Journal of Information Science
A framework for personalizing web search with concept-based user profiles

ACM Transactions on Internet Technology (TOIT)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In current library practice, trained human experts usually carry out document cataloguing and indexing based on a manual approach. With the explosive growth in the number of electronic documents available on the Internet and digital libraries, it is increasingly difficult for library practitioners to categorize both electronic documents and traditional library materials using just a manual approach. To improve the effectiveness and efficiency of document categorization at the library setting, more in-depth studies of using automatic document classification methods to categorize library items are required. Machine learning research has advanced rapidly in recent years. However, applying machine learning techniques to improve library practice is still a relatively unexplored area. This paper illustrates the design and development of a machine learning based automatic document classification system to alleviate the manual categorization problem encountered within the library setting. Two supervised machine learning algorithms have been tested. Our empirical tests show that supervised machine learning algorithms in general, and the k-nearest neighbours (KNN) algorithm in particular, can be used to develop an effective document classification system to enhance current library practice. Moreover, some concrete recommendations regarding how to practically apply the KNN algorithm to develop automatic document classification in a library setting are made. To our best knowledge, this is the first in-depth study of applying the KNN algorithm to automatic document classification based on the widely used LCC classification scheme adopted by many large libraries.