A study of thresholding strategies for text categorization

Authors:
Yiming Yang
Affiliations:
Carnegie Mellon Univ., Pittsburgh, PA
Venue:
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2001

Citing 14
Cited 96

An evaluation of phrasal and clustered representations on a text categorization task

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Boosting and Rocchio applied to text filtering

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A probabilistic description-oriented approach for categorizing web documents

Proceedings of the eighth international conference on Information and knowledge management
Improving text categorization methods for event tracking

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Information Retrieval

Information Retrieval
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hypertext Categorization using Hyperlink Patterns and Meta Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning

Bayesian online classifiers for text classification and filtering

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Study of category score algorithms for k-NN classifier

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Information Filtering in TREC-9 and TDT-3: A Comparative Analysis

Information Retrieval
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Web unit mining: finding and classifying subgraphs of web pages

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Text classification from positive and unlabeled documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Boosting support vector machines for text classification through parameter-free threshold relaxation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Index construction for linear categorisation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Category cluster discovery from distributed WWW directories

Information Sciences—Informatics and Computer Science: An International Journal - special issue: Knowledge discovery from distributed information sources
Liveclassifier: creating hierarchical text classifiers through web corpora

Proceedings of the 13th international conference on World Wide Web
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Using bayesian priors to combine classifiers for adaptive filtering

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Using a web-based categorization approach to generate thematic metadata from texts

ACM Transactions on Asian Language Information Processing (TALIP)
Text Classification without Labeled Negative Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
Boosting SVM classifiers by ensemble

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
An experimental study on large-scale web categorization

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Expectation of f-measures: tractable exact computation and some empirical observations of its properties

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Adaptive sampling for thresholding in document filtering and classification

Information Processing and Management: an International Journal
Parameter free bursty events detection in text streams

VLDB '05 Proceedings of the 31st international conference on Very large data bases
On Combining Classifier Mass Functions for Text Categorization

IEEE Transactions on Knowledge and Data Engineering
Support vector machines classification with a very large-scale taxonomy

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
A novel refinement approach for text categorization

Proceedings of the 14th ACM international conference on Information and knowledge management
A support vector method for multivariate performance measures

ICML '05 Proceedings of the 22nd international conference on Machine learning
Text Classification without Negative Examples Revisit

IEEE Transactions on Knowledge and Data Engineering
Efficient Text Classification by Weighted Proximal SVM

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Automatic detection of group functional roles in face to face interactions

Proceedings of the 8th international conference on Multimodal interfaces
Clustering e-commerce search engines based on their search interface pages using WISE-cluster

Data & Knowledge Engineering - Special issue: WIDM 2004
Contextual feature selection for text classification

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Answering bounded continuous search queries in the world wide web

Proceedings of the 16th international conference on World Wide Web
Using hypothesis margin to boost centroid text classifier

Proceedings of the 2007 ACM symposium on Applied computing
Dynamic category profiling for text filtering and classification

Information Processing and Management: an International Journal
Discriminative feature selection via multiclass variable memory Markov model

EURASIP Journal on Applied Signal Processing
Personalised online sales using web usage data mining

Computers in Industry
An empirical study of sentiment analysis for chinese documents

Expert Systems with Applications: An International Journal
Interactive high-quality text classification

Information Processing and Management: an International Journal
Finding and classifying web units in websites

International Journal of Business Intelligence and Data Mining
Using unlabeled data to handle domain-transfer problem of semantic detection

Proceedings of the 2008 ACM symposium on Applied computing
Automated Classification and Categorization of Mathematical Knowledge

Proceedings of the 9th AISC international conference, the 15th Calculemas symposium, and the 7th international MKM conference on Intelligent Computer Mathematics
Multi-value Classification of Very Short Texts

KI '08 Proceedings of the 31st annual German conference on Advances in Artificial Intelligence
Incorporating topical support documents into a small training set in text categorization

Proceedings of the 17th ACM conference on Information and knowledge management
Adapting svm for data sparseness and imbalance: A case study in information extraction

Natural Language Engineering
Effects of Term Distributions on Binary Classification

IEICE - Transactions on Information and Systems
Improving Automatic Text Classification by Integrated Feature Analysis

IEICE - Transactions on Information and Systems
Large scale multi-label classification via metalabeler

Proceedings of the 18th international conference on World wide web
Threshold selection for web-page classification with highly skewed class distribution

Proceedings of the 18th international conference on World wide web
Semi-structured document categorization with a semantic kernel

Pattern Recognition
Effective multi-label active learning for text classification

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Locating case discussion segments in recorded medical team meetings

SSCS '09 Proceedings of the third workshop on Searching spontaneous conversational speech
On strategies for imbalanced text classification using SVM: A comparative study

Decision Support Systems
Automatic content-based categorization of Wikipedia articles

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Computing with words for text processing: An approach to the text categorization

Information Sciences: an International Journal
Entropy-based authorship search in large document collections

ECIR'07 Proceedings of the 29th European conference on IR research
Semantic-based grouping of search engine results using WordNet

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Text classification for healthcare information support

IEA/AIE'07 Proceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems
Optimization of bounded continuous search queries based on ranking distributions

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Conditional mutual information based feature selection for classification task

CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications
An intelligent agent-based system for multilingual financial news digest

KES-AMSTA'08 Proceedings of the 2nd KES International conference on Agent and multi-agent systems: technologies and applications
Cascaded feature selection in SVMs text categorization

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Multilabel classification with meta-level features

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
An intelligent agent-based system for multilingual financial news digest

International Journal of Intelligent Information and Database Systems
CiteData: a new multi-faceted dataset for evaluating personalized search performance

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Link-based text classification using Bayesian networks

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
UJM at INEX 2009 XML mining track

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Using chi-square statistics to measure similarities for text categorization

Expert Systems with Applications: An International Journal
Modelling probabilistic inference networks and classification in probabilistic datalog

SUM'10 Proceedings of the 4th international conference on Scalable uncertainty management
An intraday market risk management approach based on textual analysis

Decision Support Systems
A comparative experimental assessment of a threshold selection algorithm in hierarchical text categorization

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
An effective feature selection method for text categorization

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Classifier selection approaches for multi-label problems

MCS'11 Proceedings of the 10th international conference on Multiple classifier systems
A comparative study of thresholding strategies in progressive filtering

AI*IA'11 Proceedings of the 12th international conference on Artificial intelligence around man and beyond
A classification approach with a reject option for multi-label problems

ICIAP'11 Proceedings of the 16th international conference on Image analysis and processing: Part I
A new nearest neighbor rule for text categorization

CIARP'06 Proceedings of the 11th Iberoamerican conference on Progress in Pattern Recognition, Image Analysis and Applications
Selection strategies for multi-label text categorization

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Dynamic category profiling for text filtering and classification

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Filtering contents with bigrams and named entities to improve text classification

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
A term weighting approach for text categorization

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Categorizing unknown text segments for information extraction using a search result mining approach

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Application of text categorization to astronomy field

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
A Non-VSM kNN algorithm for text classification

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
PERC: a personal email classifier

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Exploiting concept clumping for efficient incremental news article categorization

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Semi-automatic document classification: exploiting document difficulty

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
The nonverbal structure of patient case discussions in multidisciplinary medical team meetings

ACM Transactions on Information Systems (TOIS)
MCut: a thresholding strategy for multi-label classification

IDA'12 Proceedings of the 11th international conference on Advances in Intelligent Data Analysis
An approach to improving quality of crawlers using Naïve bayes for classifier and hyperlink filter

ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I
Threshold optimisation for multi-label classifiers

Pattern Recognition
Scoring-Thresholding pattern based text classifier

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part I
Iterative classification for multiple target attributes

Journal of Intelligent Information Systems
Multi-label classification with a reject option

Pattern Recognition
Recursive regularization for large-scale classification with hierarchical and graphical dependencies

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
A pattern based two-stage text classifier

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
GSVM: An SVM for handling imbalanced accuracy between classes inbi-classification problems

Applied Soft Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the testbets, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the ``optimal'' strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly.