Support vector machines classification with a very large-scale taxonomy

  • Authors:
  • Tie-Yan Liu;Yiming Yang;Hao Wan;Hua-Jun Zeng;Zheng Chen;Wei-Ying Ma

  • Affiliations:
  • Microsoft Research Asia, Beijing, P. R. China;Carnegie Mellon University, PA;Tsinghua University, Beijing, P. R. China;Microsoft Research Asia, Beijing, P. R. China;Microsoft Research Asia, Beijing, P. R. China;Microsoft Research Asia, Beijing, P. R. China

  • Venue:
  • ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

Very large-scale classification taxonomies typically have hundreds of thousands of categories, deep hierarchies, and skewed category distribution over documents. However, it is still an open question whether the state-of-the-art technologies in automated text categorization can scale to (and perform well on) such large taxonomies. In this paper, we report the first evaluation of Support Vector Machines (SVMs) in web-page classification over the full taxonomy of the Yahoo! categories. Our accomplishments include: 1) a data analysis on the Yahoo! taxonomy; 2) the development of a scalable system for large-scale text categorization; 3) theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical settings for classification; 4) an investigation of threshold tuning algorithms with respect to time complexity and their effect on the classification accuracy of SVMs. We found that, in terms of scalability, the hierarchical use of SVMs is efficient enough for very large-scale classification; however, in terms of effectiveness, the performance of SVMs over the Yahoo! Directory is still far from satisfactory, which indicates that more substantial investigation is needed.