Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology

Authors:
Yu-Sheng Lai;Chung-Hsien Wu
Affiliations:
National Cheng Kung University, Tainan, Taiwan;National Cheng Kung University, Tainan, Taiwan
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2002

Citing 21
Cited 6

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
An Efficient Implementation of Static String Pattern Matching Machines

IEEE Transactions on Software Engineering
Representation and learning in information retrieval

Representation and learning in information retrieval
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic classification of e-mail messages by messages type

Journal of the American Society for Information Science
Using and combining predictors that specialize

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Multimedia Information Retrieval: Content-Based Information Retrieval from Large Text and Audio Databases

Multimedia Information Retrieval: Content-Based Information Retrieval from Large Text and Audio Databases
Acquisition of Linguistic Patterns for Knowledge-Based Information Extraction

IEEE Transactions on Knowledge and Data Engineering
Using Statistical Methods to Improve Knowledge-Based News Categorization

IEEE Expert: Intelligent Systems and Their Applications
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Reduction for Neural Network Based Text Categorization

DASFAA '99 Proceedings of the Sixth International Conference on Database Systems for Advanced Applications
Context Filters for Document-Based Information Filtering

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Classification of Text Documents

ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 2 - Volume 2
Learning trees and rules with set-valued features

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

A New Term Significance Weighting Approach

Journal of Intelligent Information Systems
News-oriented automatic Chinese keyword indexing

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
A delimiter-based general approach for Chinese term extraction

Journal of the American Society for Information Science and Technology
Entropy-based authorship search in large document collections

ECIR'07 Proceedings of the 29th European conference on IR research
Class-driven correlation learning for chinese document categorization using discriminative features

Proceedings of the Third International Conference on Internet Multimedia Computing and Service
A term weighting approach for text categorization

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, an approach based on unknown words is proposed for meaningful term extraction and discriminative term selection in text categorization. For meaningful term extraction, a phrase-like unit (PLU)-based likelihood ratio is proposed to estimate the likelihood that a word sequence is an unknown word. On the other hand, a discriminative measure is proposed for term selection and is combined with the PLU-based likelihood ratio to determine the text category. We conducted several experiments on a news corpus, called MSDN. The MSDN corpus is collected from an online news Website maintained by the Min-Sheng Daily News, Taiwan. The corpus contains 44,675 articles with over 35 million words. The experimental results show that the system using a simple classifier achieved 95.31% accuracy. When using a state-of-the-art classifier, kNN, the average accuracy is 96.40%, outperforming all the other systems evaluated on the same collection, including the traditional term-word by kNN (88.52%); sleeping-experts (82.22%); sparse phrase by four-word sleeping-experts (86.34%); and Boolean combinations of words by RIPPER (87.54%). A proposed purification process can effectively reduce the dimensionality of the feature space from 50,576 terms in the word-based approach to 19,865 terms in the unknown word-based approach. In addition, more than 80% of automatically extracted terms are meaningful. Experiments also show that the proportion of meaningful terms extracted from training data is relative to the classification accuracy in outside testing.