Text classification for data loss prevention

Authors:
Michael Hart;Pratyusa Manadhata;Rob Johnson
Affiliations:
Computer Science Department, Stony Brook University;HP Labs;Computer Science Department, Stony Brook University
Venue:
PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
Year:
2011

Citing 30
Cited 0

Retrieval test evaluation of a rule based automatic indexing (AIR/PHYS)

Proc. of the third joint BCS and ACM symposium on Research and development in information retrieval
Support-Vector Networks

Machine Learning
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Classification by pairwise coupling

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
Automatic Document Classification

Journal of the ACM (JACM)
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Supervised versus unsupervised binary-learning by feedforward neural networks

Machine Learning
Text classification in a hierarchical mixture model for small training sets

Proceedings of the tenth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories

IAAI '90 Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Theoretical Views of Boosting and Applications

ALT '99 Proceedings of the 10th International Conference on Algorithmic Learning Theory
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Filtering product reviews from web search results

Proceedings of the 2007 ACM symposium on Document engineering
Wikipedia-Based Kernels for Text Categorization

SYNASC '07 Proceedings of the Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
WikiRelate! computing semantic relatedness using wikipedia

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Teaching applied natural language processing: triumphs and tribulations

TeachNLP '05 Proceedings of the Second ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics
Learning to classify texts using positive and unlabeled data

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A general algorithm for approximate inference and its application to hybrid bayes nets

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Businesses, governments, and individuals leak confidential information, both accidentally and maliciously, at tremendous cost in money, privacy, national security, and reputation. Several security software vendors now offer "data loss prevention" (DLP) solutions that use simple algorithms, such as keyword lists and hashing, which are too coarse to capture the features what makes sensitive documents secret. In this paper, we present automatic text classification algorithms for classifying enterprise documents as either sensitive or non-sensitive. We also introduce a novel training strategy, supplement and adjust, to create a classifier that has a low false discovery rate, even when presented with documents unrelated to the enterprise. We evaluated our algorithm on several corpora that we assembled from confidential documents published on WikiLeaks and other archives. Our classifier had a false negative rate of less than 3.0% and a false discovery rate of less than 1.0% on all our tests (i.e, in a real deployment, the classifier can identify more than 97% of information leaks while raising at most 1 false alarm every 100th time).