Combining contents and citations for scientific document classification

  • Authors:
  • Minh Duc Cao;Xiaoying Gao

  • Affiliations:
  • School of Mathematics, Statistics & Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics & Computer Science, Victoria University of Wellington, Wellington, New Zealand

  • Venue:
  • AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces a classification system that exploits the content information as well as citation structure for scientific paper classification. The system first applies a content-based statistical classification method which is similar to general text classification. We investigate several classification methods including K-nearest neighbours, nearest centroid, naive Bayes and decision trees. Among those methods, the K-nearest neighbours is found to outperform others while the rest perform comparably. Using phrases in addition to words and a good feature selection strategy such as information gain can improve system accuracy and reduce training time in comparison with using words only. To combine citation links for classification, the system proposes an iterative method to update the labellings of classified instances using citation links. Our results show that, combining contents and citations significantly improves the system performance.