Text classification from unlabeled documents with bootstrapping and feature projection techniques

Authors:
Youngjoong Ko;Jungyun Seo
Affiliations:
Department of Computer Engineering, Dong-A University, 840 Hadan 2-dong, Saha-gu, Busan 604-714, Republic of Korea;Department of Computer Science and Program of Integrated Biotechnology, Sogang University, Sinsu-dong 1, Mapo-gu, Seoul 121-742, Republic of Korea
Venue:
Information Processing and Management: an International Journal
Year:
2009

Citing 22
Cited 17

An Information Retrieval Approach for Automatically Constructing Software Libraries

IEEE Transactions on Software Engineering
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Combining Labeled and Unlabeled Data for MultiClass Text Categorization

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Machine Learning Approach to Building Domain-Specific Search Engines

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Using unlabeled data to improve text classification

Using unlabeled data to improve text classification
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Bootstrapping for hierarchical document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Similarity-based word sense disambiguation

Computational Linguistics - Special issue on word sense disambiguation
Decision lists for lexical ambiguity resolution: application to accent restoration in Spanish and French

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Automatic text categorization by unsupervised learning

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Text categorization using feature projections

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1

Candidate working set strategy based SMO algorithm in support vector machine

Information Processing and Management: an International Journal
Building a Text Classifier by a Keyword and Wikipedia Knowledge

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
An extensive study on automated Dewey Decimal Classification

Journal of the American Society for Information Science and Technology
Exploiting probabilistic topic models to improve text categorization under class imbalance

Information Processing and Management: an International Journal
Editorial: Classifying text streams by keywords using classifier ensemble

Data & Knowledge Engineering
A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine

Expert Systems with Applications: An International Journal
An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization

Applied Intelligence
Improved multilevel security with latent semantic indexing

Expert Systems with Applications: An International Journal
Technology classification with latent semantic indexing

Expert Systems with Applications: An International Journal
Protecting research and technology from espionage

Expert Systems with Applications: An International Journal
Classifying unlabeled short texts using a fuzzy declarative approach

Language Resources and Evaluation
A document is known by the company it keeps: neighborhood consensus for short text categorization

Language Resources and Evaluation
Class-indexing-based term weighting for automatic text classification

Information Sciences: an International Journal
Weak signal identification with semantic web mining

Expert Systems with Applications: An International Journal
Improving semi-supervised text classification by using wikipedia knowledge

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Text classification using a few labeled examples

Computers in Human Behavior
Semantic compared cross impact analysis

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.