Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection

  • Authors:
  • Domonkos Tikk;György Biró

  • Affiliations:
  • -;-

  • Venue:
  • ISUMA '03 Proceedings of the 4th International Symposium on Uncertainty Modelling and Analysis
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text categorization is the classification to assign a textdocument to an appropriate category in a predefined setof categories. This paper focuses on the special case whencategories are organized in hierarchy. We presents a newapproach on this recently emerged subfield of text categorization.The algorithm applies an iterative learning modulethat allow of gradually creating a classifier by trial-and-error-like method. We present a software that has beendeveloped on the basis of the algorithm to illustrate thecapability of the algorithm on large data collection. Weexperimented on the very large benchmark collection, onthe WIPO-alpha (World Intellectual Property Organization,Geneva, Switzerland, 2002) English patent database thatconsists of about 75000 XML documents distributed over5000 categories. Our software is able to index the corpusquickly and creates a classifier in a few iteration cycle. Wepresent the results achieved by the classifier w.r.t. varioustest setting.