Feature generation for text categorization using world knowledge

  • Authors:
  • Evgeniy Gabrilovich;Shaul Markovitch

  • Affiliations:
  • Computer Science Department, Technion, Haifa, Israel;Computer Science Department, Technion, Haifa, Israel

  • Venue:
  • IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing--synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field.