Classification algorithms for NETNEWS articles

Authors:
Wen-Lin Hsu;Sheau-Dong Lang
Affiliations:
School of Computer Science, University of Central Florida, Orlando, FL;School of Computer Science, University of Central Florida, Orlando, FL
Venue:
Proceedings of the eighth international conference on Information and knowledge management
Year:
1999

Citing 16
Cited 8

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental updates of inverted lists for text document retrieval

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Noise reduction in a statistical approach to text categorization

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using corpus statistics to remove redundant words in text categorization

Journal of the American Society for Information Science
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Using a generalized instance set for automatic text categorization

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Concurrency Control in B-Trees with Batch Updates

IEEE Transactions on Knowledge and Data Engineering
Using Statistical Methods to Improve Knowledge-Based News Categorization

IEEE Expert: Intelligent Systems and Their Applications
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Expiring Data in a Warehouse

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Incremental Indexing for Full-Text Information Retrieval

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Personal ontologies for web navigation

Proceedings of the ninth international conference on Information and knowledge management
Ontology-based personalized search and browsing

Web Intelligence and Agent Systems
BDEI: Biodiversity Information Organization using Taxonomy (BIOT)

dg.o '02 Proceedings of the 2002 annual national conference on Digital government research
PeRSSonal's core functionality evaluation: Enhancing text labeling through personalized summaries

Data & Knowledge Engineering
Mitigating media bias: a computational approach

Proceedings of the hypertext 2008 workshop on Collaboration and collective intelligence
Parsimonious concept modeling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model-based text classification of medical documents

Journal of Information Science
Adaptation of RSS feeds based on the user profile and on the end device

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose several algorithms using the vector space model to classify the news articles posted on the NETNEWS according to the newsgroup categories. The baseline method combines the terms of all the articles of each newsgroup in the training set to represent the newsgroups as single vectors. After training, the incoming news articles are classified based on their similarity to the existing newsgroup categories. We propose to use the following techniques to improve the classification performance of the baseline method: (1) use routing (classification) accuracy and the similarity values to refine the training set; (2) update the underlying term structures periodically during testing; and (3) apply k-means clustering to partition the newsgroup articles and represent each newsgroup by k vectors. Our test collection consists of the real news articles and the 519 subnewsgroups under the REC newsgroup of NETNEWS in a period of 3 months. Our experimental results demonstrate that the technique of refining the training set reduces from one-third to two-thirds of the storage. The technique of periodical updates improves the routing accuracy ranging from 20% to 100% but incurs runtime overhead. Finally, representing each newsgroup by k vectors (with k = 2 or 3) using clustering yields the most significant improvement in routing accuracy, ranging from 60% to 100%, while causing only slightly higher storage requirements.