Using graph-based metrics with empirical risk minimization to speed up active learning on networked data

Authors:
Sofus A. Macskassy
Affiliations:
Fetch Technologies, El Segundo, CA, USA
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 17
Cited 3

Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Activity monitoring: noticing interesting changes in behavior

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Active + Semi-supervised Learning = Robust Multi-View Learning

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Learning from Labeled and Unlabeled Data using Graph Mincuts

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning relational probability trees

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised learning using randomized mincuts

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Label propagation through linear neighborhoods

ICML '06 Proceedings of the 23rd international conference on Machine learning
Classification in Networked Data: A Toolkit and a Univariate Case Study

The Journal of Machine Learning Research
Active learning with statistical models

Journal of Artificial Intelligence Research
Probabilistic classification and clustering in relational data

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

Combining link and content for collective active learning

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Batch Mode Active Learning for Networked Data

ACM Transactions on Intelligent Systems and Technology (TIST)
Semi-supervised correction of biased comment ratings

Proceedings of the 21st international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Active and semi-supervised learning are important techniques when labeled data are scarce. Recently a method was suggested for combining active learning with a semi-supervised learning algorithm that uses Gaussian fields and harmonic functions. This classifier is relational in nature: it relies on having the data presented as a partially labeled graph (also known as a within-network learning problem). This work showed yet again that empirical risk minimization (ERM) was the best method to find the next instance to label and provided an efficient way to compute ERM with the semi-supervised classifier. The computational problem with ERM is that it relies on computing the risk for all possible instances. If we could limit the candidates that should be investigated, then we can speed up active learning considerably. In the case where the data is graphical in nature, we can leverage the graph structure to rapidly identify instances that are likely to be good candidates for labeling. This paper describes a novel hybrid approach of using of community finding and social network analytic centrality measures to identify good candidates for labeling and then using ERM to find the best instance in this candidate set. We show on real-world data that we can limit the ERM computations to a fraction of instances with comparable performance.