Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Topic-based document segmentation with probabilistic latent semantic analysis
Proceedings of the eleventh international conference on Information and knowledge management
Topic Detection, Tracking, and Trend Analysis Using Self-Organizing Neural Networks
PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Topic analysis using a finite mixture model
Information Processing and Management: an International Journal
WordNet: a lexical database for English
HLT '93 Proceedings of the workshop on Human Language Technology
WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Topic Detection by Clustering Keywords
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
A Topic Detection Method Based on Bicharacteristic Vectors
NSWCTC '09 Proceedings of the 2009 International Conference on Networks Security, Wireless Communications and Trusted Computing - Volume 02
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
A framework of feature selection methods for text categorization
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Locally discriminative topic modeling
Pattern Recognition
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
In most of the research, topic detection is defined as the task of finding out different themes from the collection of documents. Our topic detection approach is about finding a topic for every document in the corpus. Any word or group of words which tells what the document is about is defined as the topic of the document. In this paper, we propose a novel topic detection approach using an unsupervised model. It is a simple yet effective approach for topic detection and finding keywords from the corpus. The keywords are extracted by identifying the relationship between the words in a set of unstructured data automatically, without any set of training data. The keyword extraction is based on an hypothesis for word decomposition which says that the words in bigram or trigram word vectors would have words that can be potential distribution of words from the unigram word vector. After keyword extraction, topics are determined for each document using our proposed algorithm of topic detection. The proposed algorithm finds the most suitable topic for each document. The topics detected in the entire corpus and the keywords related with each topic are stored and analyzed. We use the standard term frequency (TF) measure for finding the keywords. The effectiveness and accuracy of keywords is judged by using these keywords as features for classification and comparing the results against the standard bag-of- words approach. The topics detected by our algorithm are found to be relevant to the document. The experimental results using keywords show that the dimensionality of the corpus is drastically reduced while maintaining and in most of the cases, improving F-measure of categorization. Thus, it shows that our approach of feature selection for text categorization not only improves the classification accuracy but also reduces considerably the time required for classification.