Effectiveness of document representation for classification

  • Authors:
  • Ding-Yi Chen;Xue Li;Zhao Yang Dong;Xia Chen

  • Affiliations:
  • School of Information Technology and Electrical Engineering, University of Queensland, QLD, Australia;School of Information Technology and Electrical Engineering, University of Queensland, QLD, Australia;School of Information Technology and Electrical Engineering, University of Queensland, QLD, Australia;School of Information Technology and Electrical Engineering, University of Queensland, QLD, Australia

  • Venue:
  • DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.