Semi-supervised approach to rapid and reliable labeling of large data sets

Authors:
György J. Simon;Vipin Kumar;Zhi-Li Zhang
Affiliations:
University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA;University of Minnesota, Minneapolis, MN, USA
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 17
Cited 1

Training connectionist networks with queries and selective sampling

Advances in neural information processing systems 2
Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
The weighted majority algorithm

Information and Computation
How to use expert advice

Journal of the ACM (JACM)
Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Some label efficient learning results

COLT '97 Proceedings of the tenth annual conference on Computational learning theory
A review of port scanning techniques

ACM SIGCOMM Computer Communication Review
Data Mining for Scientific and Engineering Applications

Data Mining for Scientific and Engineering Applications
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
How to Better Use Expert Advice

Machine Learning
Transport layer identification of P2P traffic

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
Profiling internet backbone traffic: behavior models and applications

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Large scale semi-supervised linear SVMs

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Worst-Case Analysis of Selective Sampling for Linear Classification

The Journal of Machine Learning Research
Data mining techniques for network scan detection

Data mining techniques for network scan detection
Semi-Supervised Learning

Semi-Supervised Learning
Toward the accurate identification of network applications

PAM'05 Proceedings of the 6th international conference on Passive and Active Network Measurement

Semi-supervised learning applied to large data sets with very few labeled examples

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a method, where the labeling of the data set is carried out in a semi-supervised manner with user-specified guarantees about the quality of the labeling. In our scheme, we assume that for each class, we have some heuristics available, each of which can identify instances of one particular class. The heuristics are assumed to have reasonable performance but they do not need to cover all instances of the class nor do they need to be perfectly reliable. We further assume that we have an infallible expert, who is willing to manually label a few instances. The aim of the algorithm is to exploit the cluster structure of the problem, the predictions by the imperfect heuristics and the limited perfect labels provided by the expert to classify (label) the instances of the data set with guaranteed precision (specificed by the user) with regards to each class. The specified precision is not always attainable, so the algorithm is allowed to classify some instances as dontknow. The algorithm is evaluated by the number of instances labeled by the expert, the number of dontknow instances (global coverage) and the achieved quality of the labeling. On the KDD Cup Network Intrusion data set containing 500,000 instances, we managed to label 96.6% of the instances while guaranteeing a nominal precision of 90% (with 95% confidence) by having the expert label 630 instances; and by having the expert label 1200 instances, we managed to guarantee 95% nominal precision while labeling 96.4% of the data. We also provide a case study of applying our scheme to label the network traffic collected at a large campus network.