Unsupervised topic detection model and its application in text categorization

Authors:
Yashodhara Haribhakta;Arti Malgaonkar;Parag Kulkarni
Affiliations:
College of Engineering, Pune Shivajinagar, Pune Maharashtra, India;College of Engineering, Pune Shivajinagar, Pune Maharashtra, India;College of Engineering, Pune Shivajinagar, Pune Maharashtra, India
Venue:
Proceedings of the CUBE International Information Technology Conference
Year:
2012

Citing 12
Cited 0

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Topic-based document segmentation with probabilistic latent semantic analysis

Proceedings of the eleventh international conference on Information and knowledge management
Topic Detection, Tracking, and Trend Analysis Using Self-Organizing Neural Networks

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Topic analysis using a finite mixture model

Information Processing and Management: an International Journal
WordNet: a lexical database for English

HLT '93 Proceedings of the workshop on Human Language Technology
A Comparative Study of Feature Vector-Based Topic Detection Schemes A Comparative Study of Feature Vector-Based Topic Detection Schemes

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Topic Detection by Clustering Keywords

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
A Topic Detection Method Based on Bicharacteristic Vectors

NSWCTC '09 Proceedings of the 2009 International Conference on Networks Security, Wireless Communications and Trusted Computing - Volume 02
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A framework of feature selection methods for text categorization

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Locally discriminative topic modeling

Pattern Recognition
Comparison of term frequency and document frequency based feature selection metrics in text categorization

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In most of the research, topic detection is defined as the task of finding out different themes from the collection of documents. Our topic detection approach is about finding a topic for every document in the corpus. Any word or group of words which tells what the document is about is defined as the topic of the document. In this paper, we propose a novel topic detection approach using an unsupervised model. It is a simple yet effective approach for topic detection and finding keywords from the corpus. The keywords are extracted by identifying the relationship between the words in a set of unstructured data automatically, without any set of training data. The keyword extraction is based on an hypothesis for word decomposition which says that the words in bigram or trigram word vectors would have words that can be potential distribution of words from the unigram word vector. After keyword extraction, topics are determined for each document using our proposed algorithm of topic detection. The proposed algorithm finds the most suitable topic for each document. The topics detected in the entire corpus and the keywords related with each topic are stored and analyzed. We use the standard term frequency (TF) measure for finding the keywords. The effectiveness and accuracy of keywords is judged by using these keywords as features for classification and comparing the results against the standard bag-of- words approach. The topics detected by our algorithm are found to be relevant to the document. The experimental results using keywords show that the dimensionality of the corpus is drastically reduced while maintaining and in most of the cases, improving F-measure of categorization. Thus, it shows that our approach of feature selection for text categorization not only improves the classification accuracy but also reduces considerably the time required for classification.