Incremental Learning in SwiftFile
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A Comparative Study of Classification Based Personal E-mail Filtering
PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Email classification with co-training
CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
eMailSift: Email Classification Based on Structure and Content
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Anatomy of the long tail: ordinary people with extraordinary tastes
Proceedings of the third ACM international conference on Web search and data mining
Expert Systems with Applications: An International Journal
A Graph-Based Approach for Multi-folder Email Classification
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Threading machine generated email
Proceedings of the sixth ACM international conference on Web search and data mining
Hi-index | 0.00 |
Most email applications devote a significant part of their real estate to organization mechanisms such as folders. Yet, we verified on the Yahoo! Mail service that 70% of email users have never defined a single folder. This implies that one of the most well known email features is underexploited. We propose here to revive the feature by providing a method for generating a lighter form of folders, or tags, benefiting even the most passive users. The method automatically associates, whenever possible, an appropriate semantic tag with a given email. This gives rise to an alternate mechanism for organizing and searching email. We advocate a novel modeling approach that exploits the overall population of users, thereby learning from the wisdom-of-crowds how to categorize messages. Given our massive user base, it is enough to learn from a minority of the users who label certain messages in order to label that kind of messages for the general population. We design a novel cascade classification approach, which copes with the severe scalability and accuracy constraints we are facing. Significant efficiency gains are achieved by working within a low dimensional latent space, and by using a novel hierarchical classifier. Precision level is controlled by separating the task into a two-phase classification process. We performed an extensive empirical study covering three different time periods, over 100 million messages, and thousands of candidate tags per message. The results are encouraging and compare favorably with alternative approaches. Our method successfully tags 72% of incoming email traffic. Performance-wise, the computational overhead, even on surge large traffic, is sufficiently low for our approach to be applicable in production on any large Web mail service.