Researcher homepage classification using unlabeled data

Authors:
Sujatha Das Gollapalli;Cornelia Caragea;Prasenjit Mitra;C. Lee Giles
Affiliations:
The Pennsylvania State University, State College, PA, USA;University of North Texas, Denton, TX, USA;The Pennsylvania State University, State College, PA, USA;The Pennsylvania State University, State College, PA, USA
Venue:
Proceedings of the 22nd international conference on World Wide Web
Year:
2013

Citing 32
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
WordNet: a lexical database for English

Communications of the ACM
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Combining Labeled and Unlabeled Data for MultiClass Text Categorization

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
PEBL: Web Page Classification without Negative Examples

IEEE Transactions on Knowledge and Data Engineering
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Using urls and table layout for web classification tasks

Proceedings of the 13th international conference on World Wide Web
Co-EM support vector learning

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
New Support Vector Algorithms

Neural Computation
CiteSeerχ: a scalable autonomous scientific digital library

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Simple, robust, scalable semi-supervised learning via expectation regularization

Proceedings of the 24th international conference on Machine learning
Broad expertise retrieval in sparse data environments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning from labeled features using generalized expectation criteria

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
ArnetMiner: extraction and mining of academic social networks

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
When Does Cotraining Work in Real Data?

IEEE Transactions on Knowledge and Data Engineering
On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Web page classification exploiting contents of surrounding pages for building a high-quality homepage collection

ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Optimal distributed online prediction using mini-batches

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on the Web? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant" pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set.