Hybrid DIAAF/RS: statistical textual feature selection for language-independent text classification

Authors:
Yanbo J. Wang;Fan Li;Frans Coenen;Robert Sanderson;Qin Xin
Affiliations:
Information Management Center, China Minsheng Banking Corp., Ltd., Beijing, China;Information Management Center, China Minsheng Banking Corp., Ltd., Beijing, China;Department of Computer Science, University of Liverpool, Liverpool, UK;Los Alamos National Laboratory, Los Alamos, New Mexico;Simula Research Laboratory, Oslo, Norway
Venue:
ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Year:
2010

Citing 23
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Models for retrieval with probabilistic indexing

Information Processing and Management: an International Journal - Modeling data, information and knowledge
A probabilistic learning approach for document indexing

ACM Transactions on Information Systems (TOIS) - Special issue on research and development in information retrieval
C4.5: programs for machine learning

C4.5: programs for machine learning
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
A vector space model for automatic indexing

Communications of the ACM
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Classifying text documents by associating terms with text categories

ADC '02 Proceedings of the 13th Australasian database conference - Volume 5
CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Two Odds-Radio-Based Text Classification Algorithms

WISEW '02 Proceedings of the Third International Conference on Web Information Systems Engineering (Workshops) - (WISEw'02)
Text Document Categorization by Term Association

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
An Evaluation of Approaches to Classification Rule Selection

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Best terms: an efficient feature-selection algorithm for text categorization

Knowledge and Information Systems
The effect of threshold values on association rule based classification accuracy

Data & Knowledge Engineering
CCIC: Consistent Common Itemsets Classifier

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Statistical Identification of Key Phrases for Text Classification

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Document-Base Extraction for Single-Label Text Classification

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Learning to classify texts using positive and unlabeled data

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Practical application of associative classifier for document classification

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Threshold tuning for improved classification association rule mining

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Textual Feature Selection (TFS) is an important phase in the process of text classification. It aims to identify the most significant textual features (i.e. key words and/or phrases), in a textual dataset, that serve to distinguish between text categories. In TFS, basic techniques can be divided into two groups: linguistic vs. statistical. For the purpose of building a language-independent text classifier, the study reported here is concerned with statistical TFS only. In this paper, we propose a novel statistical TFS approach that hybridizes the ideas of two existing techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and RS (Relevancy Score). With respect to associative (text) classification, the experimental results demonstrate that the proposed approach can produce greater classification accuracy than other alternative approaches.