Text classification for data loss prevention

  • Authors:
  • Michael Hart;Pratyusa Manadhata;Rob Johnson

  • Affiliations:
  • Computer Science Department, Stony Brook University;HP Labs;Computer Science Department, Stony Brook University

  • Venue:
  • PETS'11 Proceedings of the 11th international conference on Privacy enhancing technologies
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Businesses, governments, and individuals leak confidential information, both accidentally and maliciously, at tremendous cost in money, privacy, national security, and reputation. Several security software vendors now offer "data loss prevention" (DLP) solutions that use simple algorithms, such as keyword lists and hashing, which are too coarse to capture the features what makes sensitive documents secret. In this paper, we present automatic text classification algorithms for classifying enterprise documents as either sensitive or non-sensitive. We also introduce a novel training strategy, supplement and adjust, to create a classifier that has a low false discovery rate, even when presented with documents unrelated to the enterprise. We evaluated our algorithm on several corpora that we assembled from confidential documents published on WikiLeaks and other archives. Our classifier had a false negative rate of less than 3.0% and a false discovery rate of less than 1.0% on all our tests (i.e, in a real deployment, the classifier can identify more than 97% of information leaks while raising at most 1 false alarm every 100th time).