Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features

Authors:
Chong Chen;Hongfei Yan;Xiaoming Li
Affiliations:
Computer Networks and Distributed Systems Laboratory, School of EECS, Peking University, Beijing, China 100871 and Department of Information Management, School of Management, Beijing Normal Univer ...;Computer Networks and Distributed Systems Laboratory, School of EECS, Peking University, Beijing, China 100871;Computer Networks and Distributed Systems Laboratory, School of EECS, Peking University, Beijing, China 100871
Venue:
PAKM '08 Proceedings of the 7th International Conference on Practical Aspects of Knowledge Management
Year:
2008

Citing 5
Cited 0

Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
Automatic classification in product catalogs

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Understanding user goals in web search

Proceedings of the 13th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

With a rich variety of forms and types, digital resources are complex data objects. They grows fast in volume on the Web, but hard to be classified efficiently. The paper presents a practical classification solution using features from file names and extensions of digital resources. The features are easy to get and common to all resource. But they are generally low frequency and sparse, which implies that statistical approach may not work well. Our solution combines Naive Bayes (NB) classifier with Simple Good-Turing (SGT) probability estimation, which shows great promise for this condition with a total accuracy of 80%. In our opinion, the results are due to 1) the features fit the NB's conditional independence hypothesis well; 2) the abound one-time-occurrence features lead to reasonable probability estimation on unobserved features, which also means general feature selection strategy is not needed in this case. A 7.4TB digital resource collection, CDAL, is used to train and evaluate the model.