Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
An Efficient Implementation of Static String Pattern Matching Machines
IEEE Transactions on Software Engineering
Representation and learning in information retrieval
Representation and learning in information retrieval
A sequential algorithm for training text classifiers
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of classifiers and document representations for the routing problem
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Combining classifiers in text categorization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic classification of e-mail messages by messages type
Journal of the American Society for Information Science
Using and combining predictors that specialize
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Context-sensitive learning methods for text categorization
ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Efficient string matching: an aid to bibliographic search
Communications of the ACM
Multimedia Information Retrieval: Content-Based Information Retrieval from Large Text and Audio Databases
Acquisition of Linguistic Patterns for Knowledge-Based Information Extraction
IEEE Transactions on Knowledge and Data Engineering
Using Statistical Methods to Improve Knowledge-Based News Categorization
IEEE Expert: Intelligent Systems and Their Applications
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Reduction for Neural Network Based Text Categorization
DASFAA '99 Proceedings of the Sixth International Conference on Database Systems for Advanced Applications
Context Filters for Document-Based Information Filtering
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Classification of Text Documents
ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 2 - Volume 2
Learning trees and rules with set-valued features
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
A New Term Significance Weighting Approach
Journal of Intelligent Information Systems
News-oriented automatic Chinese keyword indexing
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
A delimiter-based general approach for Chinese term extraction
Journal of the American Society for Information Science and Technology
Entropy-based authorship search in large document collections
ECIR'07 Proceedings of the 29th European conference on IR research
Class-driven correlation learning for chinese document categorization using discriminative features
Proceedings of the Third International Conference on Internet Multimedia Computing and Service
A term weighting approach for text categorization
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Hi-index | 0.00 |
In this article, an approach based on unknown words is proposed for meaningful term extraction and discriminative term selection in text categorization. For meaningful term extraction, a phrase-like unit (PLU)-based likelihood ratio is proposed to estimate the likelihood that a word sequence is an unknown word. On the other hand, a discriminative measure is proposed for term selection and is combined with the PLU-based likelihood ratio to determine the text category. We conducted several experiments on a news corpus, called MSDN. The MSDN corpus is collected from an online news Website maintained by the Min-Sheng Daily News, Taiwan. The corpus contains 44,675 articles with over 35 million words. The experimental results show that the system using a simple classifier achieved 95.31% accuracy. When using a state-of-the-art classifier, kNN, the average accuracy is 96.40%, outperforming all the other systems evaluated on the same collection, including the traditional term-word by kNN (88.52%); sleeping-experts (82.22%); sparse phrase by four-word sleeping-experts (86.34%); and Boolean combinations of words by RIPPER (87.54%). A proposed purification process can effectively reduce the dimensionality of the feature space from 50,576 terms in the word-based approach to 19,865 terms in the unknown word-based approach. In addition, more than 80% of automatically extracted terms are meaningful. Experiments also show that the proportion of meaningful terms extracted from training data is relative to the classification accuracy in outside testing.