On Using Partial Supervision for Text Categorization

Authors:
Charu C. Aggarwal;Stephen C. Gates;Philip S. Yu
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2004

Citing 19
Cited 10

Algorithms for clustering data

Algorithms for clustering data
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Constant interaction-time scatter/gather browsing of very large document collections

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Almost-constant-time clustering of arbitrary corpus subsets4

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting clustering and phrases for context-based information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Using a generalized instance set for automatic text categorization

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On the merits of building categorization systems by supervised clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases

Sampling search-engine results

WWW '05 Proceedings of the 14th international conference on World Wide Web
Creating MAGIC: system for generating learning object metadata for instructional content

Proceedings of the 13th annual ACM international conference on Multimedia
A local semi-supervised Sammon algorithm for textual data visualization

Journal of Intelligent Information Systems
Collaborative content and user-based web ontology learning system

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
A partially supervised metric multidimensional scaling algorithm for textual data visualization

IDA'07 Proceedings of the 7th international conference on Intelligent data analysis
Semi-supervised metrics for textual data visualization

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Knowledge discovery from text learning for ontology modeling

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Ontology-Based similarity between text documents on manifold

ASWC'06 Proceedings of the First Asian conference on The Semantic Web
Clustering and categorization of Brazilian portuguese legal documents

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
Cognitive gravitation model for classification on small noisy data

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Abstract--In this paper, we discuss the merits of building text categorization systems by using supervised clustering techniques. Traditional approaches for document classification on a predefined set of classes are often unable to provide sufficient accuracy because of the difficulty of fitting a manually categorized collection of documents in a given classification model. This is especially the case for heterogeneous collections of Web documents which have varying styles, vocabulary, and authorship. Hence, this paper investigates the use of clustering in order to create the set of categories and its use for classification of documents. Completely unsupervised clustering has the disadvantage that it has difficulty in isolating sufficiently fine-grained classes of documents relating to a coherent subject matter. In this paper, we use the information from a preexisting taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in defining and creating the classes. We show that the advantage of using partially supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical definition of how each category is defined. An extremely effective way then to categorize documents is to use this a priori knowledge of the definition of each category. We also discuss a new technique to help the classifier distinguish better among closely related clusters.