The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Proceedings of the 11th international conference on World Wide Web
Using the web to obtain frequencies for unseen bigrams
Computational Linguistics - Special issue on web as corpus
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Tri-Training: Exploiting Unlabeled Data Using Three Classifiers
IEEE Transactions on Knowledge and Data Engineering
Learning Object Categories from Google"s Image Search
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Using additive expert ensembles to cope with concept drift
ICML '05 Proceedings of the 22nd international conference on Machine learning
Tackling concept drift by temporal inductive transfer
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Reverse testing: an efficient framework to select amongst classifiers under sample selection bias
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining
Can social bookmarking enhance search in the web?
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
A knowledge-based search engine powered by wikipedia
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Boosting Inductive Transfer for Text Classification Using Wikipedia
ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications
Enhancing text clustering by leveraging Wikipedia semantics
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Improving Text Classification by Using Encyclopedia Knowledge
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Finding related pages using Green measures: an illustration with Wikipedia
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Hi-index | 0.00 |
This paper addresses practical aspects of web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the web. We study techniques for building training corpora automatically from publicly available web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the web.