Feature Reduction and Database Maintenance in NETNEWS Classification

Authors:
Wen-Lin Hsu;Sheau-Dong Lang
Affiliations:
-;-
Venue:
IDEAS '99 Proceedings of the 1999 International Symposium on Database Engineering & Applications
Year:
1999

Citing 11
Cited 1

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Noise reduction in a statistical approach to text categorization

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using corpus statistics to remove redundant words in text categorization

Journal of the American Society for Information Science
Using a generalized instance set for automatic text categorization

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Concurrency Control in B-Trees with Batch Updates

IEEE Transactions on Knowledge and Data Engineering
Using Statistical Methods to Improve Knowledge-Based News Categorization

IEEE Expert: Intelligent Systems and Their Applications
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Expiring Data in a Warehouse

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

A statistics-based approach to incrementally update inverted files

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a statistical feature-reduction technique to filter out the most ambiguous articles in the training data for categorizing the NETNEWS articles. We also incorporate a batch updating scheme to periodically do maintenance on the term structures of the news database after training. The baseline method combines the terms of all the articles of each newsgroup in the training set to represent the newsgroups as single vectors. After training, the incoming news articles are classified based on their similarity to the existing newsgroup categories. Our implementation uses an inverted file to store the trained term structures of each newsgroup, and uses a list similar to the inverted file to buffer the newly arrival articles, for efficient routing and updating purposes. Our experimental results using real NETNEWS articles and newsgroups demonstrate (1) applying feature reduction to the training set improves the routing accuracy, efficiency, and database storage; (2) updating improves the routing accuracy; and (3) the batch technique improves the efficiency of the updating operation.