A comparison of data preparation approaches for e-mail categorisation

  • Authors:
  • Helmut Berger;Dieter Merkl;Michael Dittenbach

  • Affiliations:
  • E-Commerce Competence Center (EC3), Donau City Strasse 1, A-1220 Wien, Austria.;Institut fur Softwaretechnik und Interaktive Systeme, Technische Universitat Wien, Favoritenstrasse 9-11/188, A-1040 Wien, Austria.;E-Commerce Competence Center (EC3), Donau City Strasse 1, A-1220 Wien, Austria

  • Venue:
  • International Journal of Intelligent Information and Database Systems
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper reports on experiments in multi-class e-mail categorisation with supervised and unsupervised machine learning techniques. To this end, Support Vector Machines, decision tree learners, instance-based classifiers, Naive Bayes classification approaches and Self-Organising Maps were applied. A word-based and a character n-gram document representation approach were employed in order to assess the categorisation performance of the various learning approaches. The results indicate a substantial increase in classification accuracy when e-mail header information is considered in the document representation. To a much lesser degree, word-based document representations are advantageous over n-gram representations.