Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Incremental updates of inverted lists for text document retrieval
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Noise reduction in a statistical approach to text categorization
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using corpus statistics to remove redundant words in text categorization
Journal of the American Society for Information Science
Using a generalized instance set for automatic text categorization
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Concurrency Control in B-Trees with Batch Updates
IEEE Transactions on Knowledge and Data Engineering
Using Statistical Methods to Improve Knowledge-Based News Categorization
IEEE Expert: Intelligent Systems and Their Applications
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
The SMART Retrieval System—Experiments in Automatic Document Processing
The SMART Retrieval System—Experiments in Automatic Document Processing
A statistics-based approach to incrementally update inverted files
Information Processing and Management: an International Journal
Hi-index | 0.00 |
We propose a statistical feature-reduction technique to filter out the most ambiguous articles in the training data for categorizing the NETNEWS articles. We also incorporate a batch updating scheme to periodically do maintenance on the term structures of the news database after training. The baseline method combines the terms of all the articles of each newsgroup in the training set to represent the newsgroups as single vectors. After training, the incoming news articles are classified based on their similarity to the existing newsgroup categories. Our implementation uses an inverted file to store the trained term structures of each newsgroup, and uses a list similar to the inverted file to buffer the newly arrival articles, for efficient routing and updating purposes. Our experimental results using real NETNEWS articles and newsgroups demonstrate (1) applying feature reduction to the training set improves the routing accuracy, efficiency, and database storage; (2) updating improves the routing accuracy; and (3) the batch technique improves the efficiency of the updating operation.