Automatic Indexing: An Experimental Inquiry
Journal of the ACM (JACM)
Automatic classification in product catalogs
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Understanding user goals in web search
Proceedings of the 13th international conference on World Wide Web
Hi-index | 0.00 |
With a rich variety of forms and types, digital resources are complex data objects. They grows fast in volume on the Web, but hard to be classified efficiently. The paper presents a practical classification solution using features from file names and extensions of digital resources. The features are easy to get and common to all resource. But they are generally low frequency and sparse, which implies that statistical approach may not work well. Our solution combines Naive Bayes (NB) classifier with Simple Good-Turing (SGT) probability estimation, which shows great promise for this condition with a total accuracy of 80%. In our opinion, the results are due to 1) the features fit the NB's conditional independence hypothesis well; 2) the abound one-time-occurrence features lead to reasonable probability estimation on unobserved features, which also means general feature selection strategy is not needed in this case. A 7.4TB digital resource collection, CDAL, is used to train and evaluate the model.