Foundations of statistical natural language processing
Foundations of statistical natural language processing
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Document clustering based on non-negative matrix factorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The Journal of Machine Learning Research
IEEE Computer Graphics and Applications
Email Surveillance Using Non-negative Matrix Factorization
Computational & Mathematical Organization Theory
Pachinko allocation: DAG-structured mixture models of topic correlations
ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A Simple Yet Effective Data Clustering Algorithm
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Introduction to Information Retrieval
Introduction to Information Retrieval
Semi-supervised multi-label learning by constrained non-negative matrix factorization
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing
ACM Transactions on Intelligent Systems and Technology (TIST)
Comparing twitter and traditional media using topic models
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Subspace mapping of noisy text documents
Canadian AI'11 Proceedings of the 24th Canadian conference on Advances in artificial intelligence
Termite: visualization techniques for assessing textual topic models
Proceedings of the International Working Conference on Advanced Visual Interfaces
Personalized document clustering with dual supervision
Proceedings of the 2012 ACM symposium on Document engineering
Nonnegative Matrix Factorization: A Comprehensive Review
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 0.00 |
It is often desirable to identify the concepts that are present in a corpus. A popular way to deal with this objective is to discover clusters of words or topics, for which many algorithms exist in the literature. Yet most of these methods lack the interpretability that would enable interaction with a user not familiar with their inner workings. The paper proposes a graph-based topic extraction algorithm, which can also be viewed as a soft-clustering of words present in a given corpus. Each topic, in the form of a set of words, represents an underlying concept in the corpus. The method allows easy interpretation of the clustering process, and hence enables the scope of user involvement at various steps. For a quantitative evaluation of the topics extracted, we use them as features to get a compact representation of documents for classification tasks. We compare the classification accuracy achieved by a reduced feature set obtained with our method versus other topic extraction techniques, namely Latent Dirichlet Allocation and Non-negative Matrix Factorization. While the results from all the three algorithms are comparable, the speed and easy interpretability of our algorithm makes it more appropriate to be used interactively by lay users.