Site abstraction for rare category classification in large-scale web directory

  • Authors:
  • Tie-Yan LIU;Hao WAN;Tao QIN;Zheng CHEN;Yong REN;Wei-Ying MA

  • Affiliations:
  • Microsoft Research Asia, Beijing, P. R. China;Tsinghua University Beijing, P.R. China;Tsinghua University Beijing, P.R. China;Microsoft Research Asia, Beijing, P. R. China;Tsinghua University Beijing, P.R. China;Microsoft Research Asia, Beijing, P. R. China

  • Venue:
  • WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.