Effectiveness of document representation for classification

Authors:
Ding-Yi Chen;Xue Li;Zhao Yang Dong;Xia Chen
Affiliations:
School of Information Technology and Electrical Engineering, University of Queensland, QLD, Australia;School of Information Technology and Electrical Engineering, University of Queensland, QLD, Australia;School of Information Technology and Electrical Engineering, University of Queensland, QLD, Australia;School of Information Technology and Electrical Engineering, University of Queensland, QLD, Australia
Venue:
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Year:
2005

Citing 19
Cited 0

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Representation and learning in information retrieval

Representation and learning in information retrieval
Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
The nature of statistical learning theory

The nature of statistical learning theory
Overfitting and undercomputing in machine learning

ACM Computing Surveys (CSUR)
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
An algorithm for suffix stripping

Readings in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Optimized Rule Induction

IEEE Expert: Intelligent Systems and Their Applications
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Determining the fitness of a document model by using conflict instances

ADC '05 Proceedings of the 16th Australasian database conference - Volume 39
Oversearching and layered search in empirical learning

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
A Bayesian approach to learning Bayesian networks with local structure

UAI'97 Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.