Classifying Digital Resources in a Practical and Coherent Way with Easy-to-Get Features

  • Authors:
  • Chong Chen;Hongfei Yan;Xiaoming Li

  • Affiliations:
  • Computer Networks and Distributed Systems Laboratory, School of EECS, Peking University, Beijing, China 100871 and Department of Information Management, School of Management, Beijing Normal Univer ...;Computer Networks and Distributed Systems Laboratory, School of EECS, Peking University, Beijing, China 100871;Computer Networks and Distributed Systems Laboratory, School of EECS, Peking University, Beijing, China 100871

  • Venue:
  • PAKM '08 Proceedings of the 7th International Conference on Practical Aspects of Knowledge Management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

With a rich variety of forms and types, digital resources are complex data objects. They grows fast in volume on the Web, but hard to be classified efficiently. The paper presents a practical classification solution using features from file names and extensions of digital resources. The features are easy to get and common to all resource. But they are generally low frequency and sparse, which implies that statistical approach may not work well. Our solution combines Naive Bayes (NB) classifier with Simple Good-Turing (SGT) probability estimation, which shows great promise for this condition with a total accuracy of 80%. In our opinion, the results are due to 1) the features fit the NB's conditional independence hypothesis well; 2) the abound one-time-occurrence features lead to reasonable probability estimation on unobserved features, which also means general feature selection strategy is not needed in this case. A 7.4TB digital resource collection, CDAL, is used to train and evaluate the model.