Site abstraction for rare category classification in large-scale web directory

Authors:
Tie-Yan LIU;Hao WAN;Tao QIN;Zheng CHEN;Yong REN;Wei-Ying MA
Affiliations:
Microsoft Research Asia, Beijing, P. R. China;Tsinghua University Beijing, P.R. China;Tsinghua University Beijing, P.R. China;Microsoft Research Asia, Beijing, P. R. China;Tsinghua University Beijing, P.R. China;Microsoft Research Asia, Beijing, P. R. China
Venue:
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Year:
2005

Citing 6
Cited 0

Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Document-self expansion for text categorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Liveclassifier: creating hierarchical text classifiers through web corpora

Proceedings of the 13th international conference on World Wide Web
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Level-based link analysis

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.