Leveraging Web 2.0 Sources for Web Content Classification

Authors:
Somnath Banerjee;Martin Scholz
Affiliations:
-;-
Venue:
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2008

Citing 18
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Using the web to obtain frequencies for unseen bigrams

Computational Linguistics - Special issue on web as corpus
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Tri-Training: Exploiting Unlabeled Data Using Three Classifiers

IEEE Transactions on Knowledge and Data Engineering
Learning Object Categories from Google"s Image Search

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Using additive expert ensembles to cope with concept drift

ICML '05 Proceedings of the 22nd international conference on Machine learning
Tackling concept drift by temporal inductive transfer

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Reverse testing: an efficient framework to select amongst classifiers under sample selection bias

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining

Data Mining
Can social bookmarking enhance search in the web?

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
A knowledge-based search engine powered by wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Boosting Inductive Transfer for Text Classification Using Wikipedia

ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving Text Classification by Using Encyclopedia Knowledge

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Finding related pages using Green measures: an illustration with Wikipedia

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses practical aspects of web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the web. We study techniques for building training corpora automatically from publicly available web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the web.