Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Learning to extract symbolic knowledge from the World Wide Web
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Making large-scale support vector machine learning practical
Advances in kernel methods
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Feature selection on hierarchy of web documents
Decision Support Systems - Web retrieval and mining
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
On Machine Learning Methods for Chinese Document Categorization
Applied Intelligence
Authorship Attribution with Support Vector Machines
Applied Intelligence
Implementation of the SMART Information Retrieval System
Implementation of the SMART Information Retrieval System
Document Length Normalization
Feature selection for text categorization on imbalanced data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles
Journal of the American Society for Information Science and Technology
A Hierarchical Neural Network Document Classifier with Linguistic Feature Selection
Applied Intelligence
An intelligent web-page classifier with fair feature-subset selection
Engineering Applications of Artificial Intelligence
Imbalanced text classification: A term weighting approach
Expert Systems with Applications: An International Journal
An Algorithm of Text Categorization Based on Similar Rough Set and Fuzzy Cognitive Map
FSKD '08 Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 03
Feature selection for text classification with Naïve Bayes
Expert Systems with Applications: An International Journal
Supervised and Traditional Term Weighting Methods for Automatic Text Categorization
IEEE Transactions on Pattern Analysis and Machine Intelligence
International Journal of Approximate Reasoning
Beyond TFIDF weighting for text categorization in the vector space model
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Random-walk term weighting for improved text classification
TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Exploratory undersampling for class-imbalance learning
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Analytical evaluation of term weighting schemes for text categorization
Pattern Recognition Letters
Pairwise optimized Rocchio algorithm for text categorization
Pattern Recognition Letters
Fast text categorization using concise semantic analysis
Pattern Recognition Letters
Document representations for classification of short web-page descriptions
DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Hi-index | 0.00 |
In automatic text categorization, the influence of features on the decision is set by the term weights which are conventionally computed as the product of term frequency and collection frequency factors. The raw form of term frequencies or their logarithmic forms are generally used as the term frequency factor whereas the leading collection frequency factors take into account the document frequency of each term. In this study, it is firstly shown that the best-fitting form of the term frequency factor depends on the distribution of term frequency values in the dataset under concern. Taking this observation into account, a novel collection frequency factor is proposed which considers term frequencies. Five datasets are firstly tested to show that the distribution of term frequency values is task dependent. The proposed method is then proven to provide better F"1 scores compared to two recent approaches on majority of the datasets considered. It is confirmed that the use of term frequencies in the collection frequency factor is beneficial on tasks which does not involve highly repeated terms. It is also shown that the best F"1 scores are achieved on majority of the datasets when smaller number of features are considered.