Document classification via structure synopses

  • Authors:
  • Liping Ma;John Shepherd;Anh Nguyen

  • Affiliations:
  • School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia

  • Venue:
  • ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information available in the Internet is frequently supplied simply as plain ascii text, structured according to orthographic and semantic conventions. Traditional document classification is typically formulated as a learning problem where each instance is a whole document that is represented by a feature vector. Such feature vectors are often generated based on the appearance and frequencies of words in the documents. The high-dimensionality of these feature vectors causes some problems: important clues might be missed out, and the classification might be misled by some trivial elements. In this paper, we propose a method which makes use of structuring conventions to reduce size of the feature vector without affecting the accuracy of the classification process. Effectively, a synopsis of document structure is extracted, which contains only the most informative features; then a succinct feature vector is generated to represent the instance. Finally, a decision tree machine learning algorithm is used to classify the document based on its succinct feature vector.