An integrated system for building enterprise taxonomies

  • Authors:
  • Li Zhang;Tao Li;Shixia Liu;Yue Pan

  • Affiliations:
  • IBM China Research Lab, Beijing, PR China;School of Computer Science, Florida International University, Miami, USA 33199;IBM China Research Lab, Beijing, PR China;IBM China Research Lab, Beijing, PR China

  • Venue:
  • Information Retrieval
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although considerable research has been conducted in the field of hierarchical text categorization, little has been done on automatically collecting labeled corpus for building hierarchical taxonomies. In this paper, we propose an automatic method of collecting training samples to build hierarchical taxonomies. In our method, the category node is initially defined by some keywords, the web search engine is then used to construct a small set of labeled documents, and a topic tracking algorithm with keyword-based content normalization is applied to enlarge the training corpus on the basis of the seed documents. We also design a method to check the consistency of the collected corpus. The above steps produce a flat category structure which contains all the categories for building the hierarchical taxonomy. Next, linear discriminant projection approach is utilized to construct more meaningful intermediate levels of hierarchies in the generated flat set of categories. Experimental results show that the training corpus is good enough for statistical classification methods.