Automatic content-based categorization of Wikipedia articles

  • Authors:
  • Zeno Gantner;Lars Schmidt-Thieme

  • Affiliations:
  • University of Hildesheim;University of Hildesheim

  • Venue:
  • People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Wikipedia's article contents and its category hierarchy are widely used to produce semantic resources which improve performance on tasks like text classification and keyword extraction. The reverse -- using text classification methods for predicting the categories of Wikipedia articles -- has attracted less attention so far. We propose to "return the favor" and use text classifiers to improve Wikipedia. This could support the emergence of a virtuous circle between the wisdom of the crowds and machine learning/NLP methods. We define the categorization of Wikipedia articles as a multi-label classification task, describe two solutions to the task, and perform experiments that show that our approach is feasible despite the high number of labels.