Automatically tagging email by leveraging other users' folders

Authors:
Yehuda Koren;Edo Liberty;Yoelle Maarek;Roman Sandler
Affiliations:
Yahoo! Research, Haifa, Israel;Yahoo! Research, Haifa, Israel;Yahoo! research, Haifa, Israel;Yahoo! Research, Haifa, Israel
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 10
Cited 1

Incremental Learning in SwiftFile

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A Comparative Study of Classification Based Personal E-mail Filtering

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Email classification with co-training

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
eMailSift: Email Classification Based on Structure and Content

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Anatomy of the long tail: ordinary people with extraordinary tastes

Proceedings of the third ACM international conference on Web search and data mining
Large scale image annotation: learning to rank with joint word-image embeddings

Machine Learning
Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

Expert Systems with Applications: An International Journal
A Graph-Based Approach for Multi-folder Email Classification

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining

Threading machine generated email

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most email applications devote a significant part of their real estate to organization mechanisms such as folders. Yet, we verified on the Yahoo! Mail service that 70% of email users have never defined a single folder. This implies that one of the most well known email features is underexploited. We propose here to revive the feature by providing a method for generating a lighter form of folders, or tags, benefiting even the most passive users. The method automatically associates, whenever possible, an appropriate semantic tag with a given email. This gives rise to an alternate mechanism for organizing and searching email. We advocate a novel modeling approach that exploits the overall population of users, thereby learning from the wisdom-of-crowds how to categorize messages. Given our massive user base, it is enough to learn from a minority of the users who label certain messages in order to label that kind of messages for the general population. We design a novel cascade classification approach, which copes with the severe scalability and accuracy constraints we are facing. Significant efficiency gains are achieved by working within a low dimensional latent space, and by using a novel hierarchical classifier. Precision level is controlled by separating the task into a two-phase classification process. We performed an extensive empirical study covering three different time periods, over 100 million messages, and thousands of candidate tags per message. The results are encouraging and compare favorably with alternative approaches. Our method successfully tags 72% of incoming email traffic. Performance-wise, the computational overhead, even on surge large traffic, is sufficiently low for our approach to be applicable in production on any large Web mail service.