Adding Semantics to Email Clustering

Authors:
Hua Li;Dou Shen;Benyu Zhang;Zheng Chen;Qiang Yang
Affiliations:
Microsoft Research Asia, China;Hong Kong University of Science and Technology, Hong Kong;Microsoft Research Asia, China;Microsoft Research Asia, China;Hong Kong University of Science and Technology, Hong Kong
Venue:
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Year:
2006

Citing 0
Cited 4

Adapting LDA Model to Discover Author-Topic Relations for Email Analysis

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Finding topics in email using formal concept analysis and fuzzy membership functions

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
Towards an integrated e-mail forensic analysis framework

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Mining writeprints from anonymous e-mails for forensic investigation

Digital Investigation: The International Journal of Digital Forensics & Incident Response

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel algorithm to cluster emails according to their contents and the sentence styles of their subject lines. In our algorithm, natural language processing techniques and frequent itemset mining techniques are utilized to automatically generate meaningful generalized sentence patterns (GSPs) from subjects of emails. Then we put forward a novel unsupervised approach which treats GSPs as pseudo class labels and conduct email clustering in a supervised manner, although no human labeling is involved. Our proposed algorithm is not only expected to improve the clustering performance, it can also provide meaningful descriptions of the resulted clusters by the GSPs. Experimental results on open dataset (Enron email dataset) and a personal email dataset collected by ourselves demonstrate that the proposed algorithm outperforms the K-means algorithm in terms of the popular measurement F1. Furthermore, the cluster naming readability is improved by 68.5% on the personal email dataset.