Document classification via structure synopses

Authors:
Liping Ma;John Shepherd;Anh Nguyen
Affiliations:
School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
Venue:
ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Year:
2003

Citing 11
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using linear algebra for intelligent information retrieval

SIAM Review
Using corpus statistics to remove redundant words in text categorization

Journal of the American Society for Information Science
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Induction of Decision Trees

Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Extracting Information from Semistructured Data

WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
A Conceptual-Modeling Approach to Extracting Data from the Web

ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information available in the Internet is frequently supplied simply as plain ascii text, structured according to orthographic and semantic conventions. Traditional document classification is typically formulated as a learning problem where each instance is a whole document that is represented by a feature vector. Such feature vectors are often generated based on the appearance and frequencies of words in the documents. The high-dimensionality of these feature vectors causes some problems: important clues might be missed out, and the classification might be misled by some trivial elements. In this paper, we propose a method which makes use of structuring conventions to reduce size of the feature vector without affecting the accuracy of the classification process. Effectively, a synopsis of document structure is extracted, which contains only the most informative features; then a succinct feature vector is generated to represent the instance. Finally, a decision tree machine learning algorithm is used to classify the document based on its succinct feature vector.