Web taxonomy integration with hierarchical shrinkage algorithm and fine-grained relations

  • Authors:
  • Chia-Wei Wu;Richard Tzong-Han Tsai;Cheng-Wei Lee;Wen-Lian Hsu

  • Affiliations:
  • Institute of Information Science, Academia Sinica, Taipei, Taiwan;Department of Computer Science and Engineering, Yuan-Ze University, Chung-Li, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan and Department of Computer Science, National Tsing-Hua University, Hsingchu, Taiwan;Institute of Information Science, Academia Sinica, Taipei, Taiwan and Department of Computer Science, National Tsing-Hua University, Hsingchu, Taiwan

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2008

Quantified Score

Hi-index 12.05

Visualization

Abstract

We address the problem of integrating web taxonomies from different real Internet applications. Integrating web taxonomies is to transfer instances from a source to target taxonomy. Unlike the conventional text categorization problem, in taxonomy integration, the source taxonomy contains extra information that can be used to improve the categorization. The major existing methods can be divided in two types: those that use neighboring categories to smooth the document term vector and those that consider the semantic relationship between corresponding categories of the target and source taxonomies to facilitate categorization. In contrast to the first type of approach, which only uses a flattened hierarchy for smoothing, we apply a hierarchy shrinkage algorithm to smooth child documents by their parents. We also discuss the effect of using different hierarchical levels for smoothing. To extend the second type of approach, we extract fine-grain semantic relationships, which consider the relationships between lower-level categories. In addition, we use the cosine similarity to measure the semantic relationships, which achieves better performance than existing methods. Finally, we integrate the existing approaches and the proposed methods into one machine learning model to find the best feature configuration. The results of experiments on real Internet data demonstrate that our system outperforms standard text classifiers by about 10%.