Nonlinear transformation of term frequencies for term weighting in text categorization

Authors:
Zafer Erenel;Hakan AltıNçAy
Affiliations:
Department of Computer Engineering, European University of Lefke, Gemikonağı-Lefke, Northern Cyprus;Department of Computer Engineering, Eastern Mediterranean University, Famagusta, Northern Cyprus
Venue:
Engineering Applications of Artificial Intelligence
Year:
2012

Citing 28
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Learning to extract symbolic knowledge from the World Wide Web

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Feature selection on hierarchy of web documents

Decision Support Systems - Web retrieval and mining
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
On Machine Learning Methods for Chinese Document Categorization

Applied Intelligence
Authorship Attribution with Support Vector Machines

Applied Intelligence
Implementation of the SMART Information Retrieval System

Implementation of the SMART Information Retrieval System
Document Length Normalization

Document Length Normalization
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
A Hierarchical Neural Network Document Classifier with Linguistic Feature Selection

Applied Intelligence
An intelligent web-page classifier with fair feature-subset selection

Engineering Applications of Artificial Intelligence
Imbalanced text classification: A term weighting approach

Expert Systems with Applications: An International Journal
An Algorithm of Text Categorization Based on Similar Rough Set and Fuzzy Cognitive Map

FSKD '08 Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 03
Feature selection for text classification with Naïve Bayes

Expert Systems with Applications: An International Journal
Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Using scatterplots to understand and improve probabilistic models for text categorization and retrieval

International Journal of Approximate Reasoning
Beyond TFIDF weighting for text categorization in the vector space model

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Random-walk term weighting for improved text classification

TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Exploratory undersampling for class-imbalance learning

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Analytical evaluation of term weighting schemes for text categorization

Pattern Recognition Letters
Pairwise optimized Rocchio algorithm for text categorization

Pattern Recognition Letters
Fast text categorization using concise semantic analysis

Pattern Recognition Letters
Document representations for classification of short web-page descriptions

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

In automatic text categorization, the influence of features on the decision is set by the term weights which are conventionally computed as the product of term frequency and collection frequency factors. The raw form of term frequencies or their logarithmic forms are generally used as the term frequency factor whereas the leading collection frequency factors take into account the document frequency of each term. In this study, it is firstly shown that the best-fitting form of the term frequency factor depends on the distribution of term frequency values in the dataset under concern. Taking this observation into account, a novel collection frequency factor is proposed which considers term frequencies. Five datasets are firstly tested to show that the distribution of term frequency values is task dependent. The proposed method is then proven to provide better F"1 scores compared to two recent approaches on majority of the datasets considered. It is confirmed that the use of term frequencies in the collection frequency factor is beneficial on tasks which does not involve highly repeated terms. It is also shown that the best F"1 scores are achieved on majority of the datasets when smaller number of features are considered.