Rich document representation and classification: An analysis

Authors:
Mostafa Keikha;Ahmad Khonsari;Farhad Oroumchian
Affiliations:
University of Tehran, Department of Electrical and Computer Engineering, Gorgan 49139-66883, Iran;University of Tehran, Department of Electrical and Computer Engineering, Gorgan 49139-66883, Iran;University of Wollongong in Dubai, The College of Informatics and Computer Science, United Arab Emirates
Venue:
Knowledge-Based Systems
Year:
2009

Citing 12
Cited 3

On ordered weighted averaging aggregation operators in multicriteria decisionmaking

IEEE Transactions on Systems, Man and Cybernetics
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Properties of extended Boolean models in information retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Probability kinematics in information retrieval

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using linear algebra for intelligent information retrieval

SIAM Review
TELLTALE: experiments in a dynamic hypertext environment for degraded and multilingual data

Journal of the American Society for Information Science - Special issue on full-text retrieval
An application of plausible reasoning to information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
An overview of methods for determining OWA weights: Research Articles

International Journal of Intelligent Systems
n-Gram Statistics for Natural Language Understanding and Text Processing

IEEE Transactions on Pattern Analysis and Machine Intelligence

Persian text classification based on K-NN using wordnet

IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Semantically-grounded construction of centroids for datasets with textual attributes

Knowledge-Based Systems
Free-gram phrase identification for modeling Chinese text

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are three factors involved in text classification. These are classification model, similarity measure and document representation model. In this paper, we will focus on document representation and demonstrate that the choice of document representation has a profound impact on the quality of the classifier. In our experiments, we have used the centroid-based text classifier, which is a simple and robust text classification scheme. We will compare four different types of document representations: N-grams, Single terms, phrases and RDR which is a logic-based document representation. The N-gram representation is a string-based representation with no linguistic processing. The Single term approach is based on words with minimum linguistic processing. The phrase approach is based on linguistically formed phrases and single words. The RDR is based on linguistic processing and representing documents as a set of logical predicates. We have experimented with many text collections and we have obtained similar results. Here, we base our arguments on experiments conducted on Reuters-21578. We show that RDR, the more complex representation, produces more effective classifier on Reuters-21578, followed by the phrase approach.