HTM: a topic model for hypertexts

Authors:
Congkai Sun;Bin Gao;Zhenfu Cao;Hang Li
Affiliations:
Shanghai Jiaotong University, Shanghai, P. R. China;Microsoft Research Asia, Beijing, P. R. China;Shanghai Jiaotong University, Shanghai, P. R. China;Microsoft Research Asia, Beijing, P. R. China
Venue:
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2008

Citing 10
Cited 5

An Introduction to Variational Methods for Graphical Models

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Learning to Probabilistically Identify Authoritative Documents

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Dynamic topic models

ICML '06 Proceedings of the 23rd international conference on Machine learning
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised prediction of citation influences

Proceedings of the 24th international conference on Machine learning
Topic modeling with network regularization

Proceedings of the 17th international conference on World Wide Web
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Expectation-propagation for the generative aspect model

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

Markov random topic fields

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Large scale microblog mining using distributed MB-LDA

Proceedings of the 21st international conference companion on World Wide Web
Mining contentions from discussions and debates

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating popularity in topic models for social network analysis

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Scalable text and link analysis with mixed-topic link models

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previously topic models such as PLSI (Probabilistic Latent Semantic Indexing) and LDA (Latent Dirichlet Allocation) were developed for modeling the contents of plain texts. Recently, topic models for processing hypertexts such as web pages were also proposed. The proposed hypertext models are generative models giving rise to both words and hyperlinks. This paper points out that to better represent the contents of hypertexts it is more essential to assume that the hyperlinks are fixed and to define the topic model as that of generating words only. The paper then proposes a new topic model for hypertext processing, referred to as Hypertext Topic Model (HTM). HTM defines the distribution of words in a document (i.e., the content of the document) as a mixture over latent topics in the document itself and latent topics in the documents which the document cites. The topics are further characterized as distributions of words, as in the conventional topic models. This paper further proposes a method for learning the HTM model. Experimental results show that HTM outperforms the baselines on topic discovery and document classification in three datasets.