The nature of statistical learning theory
The nature of statistical learning theory
WordNet: a lexical database for English
Communications of the ACM
Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning to classify text from labeled and unlabeled documents
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training
Proceedings of the ninth international conference on Information and knowledge management
Combining Labeled and Unlabeled Data for MultiClass Text Categorization
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
PEBL: Web Page Classification without Negative Examples
IEEE Transactions on Knowledge and Data Engineering
Unsupervised word sense disambiguation rivaling supervised methods
ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Using urls and table layout for web classification tasks
Proceedings of the 13th international conference on World Wide Web
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Neural Computation
CiteSeerχ: a scalable autonomous scientific digital library
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Pattern Recognition and Machine Learning (Information Science and Statistics)
Pattern Recognition and Machine Learning (Information Science and Statistics)
The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Simple, robust, scalable semi-supervised learning via expectation regularization
Proceedings of the 24th international conference on Machine learning
Broad expertise retrieval in sparse data environments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning from labeled features using generalized expectation criteria
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
ArnetMiner: extraction and mining of academic social networks
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
When Does Cotraining Work in Real Data?
IEEE Transactions on Knowledge and Data Engineering
On identifying academic homepages for digital libraries
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Optimal distributed online prediction using mini-batches
The Journal of Machine Learning Research
Hi-index | 0.00 |
A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on the Web? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant" pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set.