Evaluating text categorization
HLT '91 Proceedings of the workshop on Speech and Natural Language
ACM SIGIR Forum
C4.5: programs for machine learning
C4.5: programs for machine learning
OHSUMED: an interactive retrieval evaluation and new large test collection for research
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory
The nature of statistical learning theory
Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical
Advances in kernel methods
Yahoo! as an ontology: using Yahoo! categories to describe documents
Proceedings of the eighth international conference on Information and knowledge management
Hierarchical classification of Web content
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A meta-learning approach for text categorization
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
The structure of broad topics on the web
Proceedings of the 11th international conference on World Wide Web
Personalized pocket directories for mobile devices
Proceedings of the 11th international conference on World Wide Web
Probabilistic combination of text classifiers using reliability indicators: models and results
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Study of Approaches to Hypertext Categorization
Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Automatic association of web directories with word senses
Computational Linguistics - Special issue on web as corpus
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
ICML '04 Proceedings of the twenty-first international conference on Machine learning
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Analysis of Statistical Question Classification for Fact-Based Questions
Information Retrieval
Feature selection methods for text classification
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An integrated system for building enterprise taxonomies
Information Retrieval
Kernel-Based Inductive Transfer
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Leveraging Web 2.0 Sources for Web Content Classification
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Cross domain distribution adaptation via kernel mapping
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning combination features with L1 regularization
NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Wikipedia-based semantic interpretation for natural language processing
Journal of Artificial Intelligence Research
Labeling design documents based on operators' consensus-A case study of robotic design
Computers in Industry
PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Does SVM really scale up to large bag of words feature spaces?
IDA'07 Proceedings of the 7th international conference on Intelligent data analysis
Automated text categorization based on readability fingerprints
ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
A Bayes-true data generator for evaluation of supervised and unsupervised learning methods
Pattern Recognition Letters
Transfer learning with adaptive regularizers
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Cross-Guided Clustering: Transfer of Relevant Supervision across Tasks
ACM Transactions on Knowledge Discovery from Data (TKDD)
Domain transfer dimensionality reduction via discriminant kernel learning
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
SVOIS: Support Vector Oriented Instance Selection for text classification
Information Systems
Approximate polytope ensemble for one-class classification
Pattern Recognition
Evolutionary instance selection for text classification
Journal of Systems and Software
Hi-index | 0.00 |
Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.