Tag-weighted topic model for mining semi-structured documents

Authors:
Shuangyin Li;Jiefei Li;Rong Pan
Affiliations:
Department of Computer Science, Sun Yat-sen University, Guangzhou, China;Department of Computer Science, Sun Yat-sen University, Guangzhou, China;Department of Computer Science, Sun Yat-sen University, Guangzhou, China
Venue:
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Year:
2013

Citing 12
Cited 0

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
The author-topic model for authors and documents

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling hidden topics on document manifold

Proceedings of the 17th ACM conference on Information and knowledge management
MedLDA: maximum margin supervised topic models for regression and classification

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
On smoothing and inference for topic models

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Partially labeled topic models for interpretable text mining

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic topic models with biased propagation on heterogeneous information networks

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
The contextual focused topic model

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the last decade, latent Dirichlet allocation (LDA) successfully discovers the statistical distribution of the topics over a unstructured text corpus. Meanwhile, more and more document data come up with rich human-provided tag information during the evolution of the Internet, which called semistructured data. The semi-structured data contain both unstructured data (e.g., plain text) and metadata, such as papers with authors and web pages with tags. In general, different tags in a document play different roles with their own weights. To model such semi-structured documents is nontrivial. In this paper, we propose a novel method to model tagged documents by a topic model, called Tag-Weighted Topic Model (TWTM). TWTM is a framework that leverages the tags in each document to infer the topic components for the documents. This allows not only to learn document-topic distributions, but also to infer the tag-topic distributions for text mining (e.g., classification, clustering, and recommendations). Moreover, TWTM automatically infers the probabilistic weights of tags for each document. We present an efficient variational inference method with an EM algorithm for estimating the model parameters. The experimental results show that our TWTM approach outperforms the baseline algorithms over three corpora in document modeling and text classification.