Behavior-based email analysis with application to spam detection

  • Authors:
  • Salvatore J. Stolfo;Shlomo Hershkop

  • Affiliations:
  • Columbia University;Columbia University

  • Venue:
  • Behavior-based email analysis with application to spam detection
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Email is the "killer network application". Email is ubiquitous and pervasive. In a relatively short timeframe, the Internet has become irrevocably and deeply entrenched in our modern society primarily due to the power of its communication substrate linking people and organizations around the globe. Much work on email technology has focused on making email easy to use, permitting a wide variety of information and information types to be conveniently, reliably, and efficiently sent throughout the Internet. However, the analysis of the vast storehouse of email content accumulated or produced by individual users has received relatively little attention other than for specific tasks such as spam and virus filtering. As one paper in the literature puts it, "the state of the art is still a messy desktop" (Denning, 1982). The Problem: Email clients provide only partial information - users have to manage much on their own, making it hard to search or prioritize large amounts of email. Our thesis is that advanced data mining can provide new opportunities for applications to increase email productivity and extract new information from email archives. This thesis presents an implemented framework for data mining behavior models from email data. The Email Mining Toolkit (EMT) is a data mining toolkit designed to analyze offline email corpora, including the entire set of email sent and received by an individual user, revealing much information about individual users as well as the behavior of groups of users in an organization. A number of machine learning and anomaly detection algorithms are embedded in the system to model the user's email behavior in order to classify email for a variety of tasks. The work has been successfully applied to the tasks of clustering and classification of similar emails, spam detection, and forensic analysis to reveal information about user's behavior. We organize the core functionality of EMT into a lightweight package called the Profiling Email Toolkit (PET). A novel contribution in PET is the focus on analyzing real time email flow information from both an individual and an organization in a standard framework. PET includes new algorithms that combine multiple models using a variety of features extracted from email to achieve higher accuracy and lower false positive than any one individual model for a variety of analytical tasks.