A three-step preprocessing algorithm for minimizing e-mail document's atypical characteristics

Authors:
Ok-Ran Jeong;Dong-Sub Cho
Affiliations:
Department of Computer Science and Engineering, Ewha Womans University, Seoul, Korea;Department of Computer Science and Engineering, Ewha Womans University, Seoul, Korea
Venue:
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
Year:
2005

Citing 2
Cited 0

Machine Learning

Machine Learning
Type Classification of Semi-Structured Documents

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Documents that are widely in use today included many atypical characteristics. In particular, non-standardization appears more frequently in e-mail documents than other documents due to the extensive use of informal expressions such as slang and abbreviation. Automatic document classification may differ significantly according to the characteristics of documents that are subject to classification, as well as classifier's performance. We suggest a three-step preprocessing algorithm by stages for accurate automatic classification for each e-mail category. This research identifies e-mail document's characteristics to apply a three-step preprocessing algorithm that can minimize e-mail document's atypical characteristics.