Document-Base Extraction for Single-Label Text Classification

Authors:
Yanbo J. Wang;Robert Sanderson;Frans Coenen;Paul Leng
Affiliations:
Department of Computer Science, The University of Liverpool, Liverpool, UK L69 3BX;Department of Computer Science, The University of Liverpool, Liverpool, UK L69 3BX;Department of Computer Science, The University of Liverpool, Liverpool, UK L69 3BX;Department of Computer Science, The University of Liverpool, Liverpool, UK L69 3BX
Venue:
DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Year:
2008

Citing 13
Cited 4

OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Classifying text documents by associating terms with text categories

ADC '02 Proceedings of the 13th Australasian database conference - Volume 5
A refinement approach to handling model misfit in text categorization

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Two Odds-Radio-Based Text Classification Algorithms

WISEW '02 Proceedings of the Third International Conference on Web Information Systems Engineering (Workshops) - (WISEw'02)
Text Document Categorization by Term Association

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Multiclass text categorization for automated survey coding

Proceedings of the 2003 ACM symposium on Applied computing
Best terms: an efficient feature-selection algorithm for text categorization

Knowledge and Information Systems
A Probabilistic Approach to Feature Selection for Multi-class Text Categorization

ISNN '07 Proceedings of the 4th international symposium on Neural Networks: Advances in Neural Networks
Statistical Identification of Key Phrases for Text Classification

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Learning to classify texts using positive and unlabeled data

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Multi-label text categorization using k-nearest neighbor approach with m-similarity

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Hybrid DIAAF/RS: statistical textual feature selection for language-independent text classification

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Optimizing queries to remote resources

Journal of Intelligent Information Systems
Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many text mining applications, especially when investigating Text Classification (TC), require experiments to be performed using common text-collections, such that results can be compared with alternative approaches. With regard to single-label TC, most text-collections (textual data-sources) in their original form have at least one of the following limitations: the overall volume of textual data is too large for ease of experimentation; there are many predefined classes; most of the classes consist of only a very few documents; some documents are labeled with a single class whereas others have multiple classes; and there are documents found with little or no actual text-content. In this paper, we propose a standard approach to automatically extract "qualified" document-bases from a given textual data-source that can be used more effectively and reliably in single-label TC experiments. The experimental results demonstrate that document-bases extracted based on our approach can be used effectively in single-label TC experiments.