A feature mining based approach for the classification of text documents into disjoint classes

Authors:
Salvador Nieto Sánchez;Evangelos Triantaphyllou;Donald Kraft
Affiliations:
Department of Industrial and Manufacturing Systems Engineering, 3128 CEBA Building, Louisiana State University, Baton Rouge, LA;Department of Industrial and Manufacturing Systems Engineering, 3128 CEBA Building, Louisiana State University, Baton Rouge, LA;Department of Computer Science, 286 Coates Hall, Louisiana State University, Baton Rouge, LA
Venue:
Information Processing and Management: an International Journal
Year:
2002

Citing 17
Cited 4

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Generating logical expressions from positive and negative examples via a branch-and-bound approach

Computers and Operations Research
Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
Term-relevance computations and perfect retrieval performance

Information Processing and Management: an International Journal
Optimization of relevance feedback weights

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
A stop list for general text

ACM SIGIR Forum
Text Information Retrieval Systems

Text Information Retrieval Systems
The Cluster Dissection and Analysis Theory FORTRAN Programs Examples

The Cluster Dissection and Analysis Theory FORTRAN Programs Examples
Information Retrieval

Information Retrieval
Statistical Analysis for Engineers and Scientists: A Computer-Based Approach (IBM)

Statistical Analysis for Engineers and Scientists: A Computer-Based Approach (IBM)
An incremental learning algorithm for constructing boolean functions from positive and negative examples

Computers and Operations Research
Automatic Information Organization and Retrieval.

Automatic Information Organization and Retrieval.
The text retrieval conferences (TRECS)

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques (Massive Computing)

Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques (Massive Computing)
A Greedy Randomized Adaptive Search Procedure (GRASP) for inferring logical clauses from examples in polynomial time and some extensions

Mathematical and Computer Modelling: An International Journal
An approach to guided learning of boolean functions

Mathematical and Computer Modelling: An International Journal

Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish

Pattern Recognition Letters
Exploration of textual document archives using a fuzzy hierarchical clustering algorithm in the GAMBAL system

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
An integrated platform of collaborative project management and silicon intellectual property management for IC design industry

Information Sciences: an International Journal
A fuzzy ontological knowledge document clustering methodology

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new approach for classifying text documents into two disjoint classes. The new approach is based on extracting patterns, in the form of two logical expressions, which are defined on various features (indexing terms) of the documents. The pattern extraction is aimed at providing descriptions (in the form of two logical expressions) of the two classes of positive and negative examples. This is achieved by means of a data mining approach, called One Clause At a Time (OCAT), which is based on mathematical logic. The application of a logic-based approach to text document classification is critical when one wishes to be able to justify why a particular document has been assigned to one class versus the other class. This situation occurs, for instance, in declassifying documents that have been previously considered important to national security and thus are currently being kept as secret. Some computational experiments have investigated the effectiveness of the OCAT-based approach and compared it to the well-known vector space model (VSM). These tests also have investigated finding the best indexing terms that could be used in making these classification decisions. The results of these computational experiments on a sample of 2897 text documents from the TIPSTER collection indicate that the first approach has many advantages over the VSM approach for solving this type of text document classification problem. Moreover, a guided strategy for the OCAT-based approach is presented for deciding which document one needs to consider next while building the training example sets.