Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Authors:
Dmitry Davidov;Evgeniy Gabrilovich;Shaul Markovitch
Affiliations:
Technion, Haifa, Israel;Technion, Haifa, Israel;Technion, Haifa, Israel
Venue:
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2004

Citing 19
Cited 21

Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
The DARPA TIPSTER project

ACM SIGIR Forum
C4.5: programs for machine learning

C4.5: programs for machine learning
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical

Advances in kernel methods
Yahoo! as an ontology: using Yahoo! categories to describe documents

Proceedings of the eighth international conference on Information and knowledge management
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A meta-learning approach for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
The structure of broad topics on the web

Proceedings of the 11th international conference on World Wide Web
Personalized pocket directories for mobile devices

Proceedings of the 11th international conference on World Wide Web
Probabilistic combination of text classifiers using reliability indicators: models and results

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Automatic association of web directories with word senses

Computational Linguistics - Special issue on web as corpus
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning

Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Analysis of Statistical Question Classification for Fact-Based Questions

Information Retrieval
Feature selection methods for text classification

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An integrated system for building enterprise taxonomies

Information Retrieval
Kernel-Based Inductive Transfer

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Leveraging Web 2.0 Sources for Web Content Classification

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Cross domain distribution adaptation via kernel mapping

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning combination features with L1 regularization

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Wikipedia-based semantic interpretation for natural language processing

Journal of Artificial Intelligence Research
Labeling design documents based on operators' consensus-A case study of robotic design

Computers in Industry
A hybrid incremental clustering method-combining support vector machine and enhanced clustering by committee clustering algorithm

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Does SVM really scale up to large bag of words feature spaces?

IDA'07 Proceedings of the 7th international conference on Intelligent data analysis
Automated text categorization based on readability fingerprints

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
A Bayes-true data generator for evaluation of supervised and unsupervised learning methods

Pattern Recognition Letters
Transfer learning with adaptive regularizers

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Cross-Guided Clustering: Transfer of Relevant Supervision across Tasks

ACM Transactions on Knowledge Discovery from Data (TKDD)
Domain transfer dimensionality reduction via discriminant kernel learning

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
SVOIS: Support Vector Oriented Instance Selection for text classification

Information Systems
Approximate polytope ensemble for one-class classification

Pattern Recognition
Evolutionary instance selection for text classification

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.