Going beyond Corr-LDA for detecting specific comments on news & blogs

Authors:
Mrinal Kanti Das;Trapit Bansal;Chiranjib Bhattacharyya
Affiliations:
Indian Institute of Science, Bangalore, India;Indian Institute of Science, Bangalore, India;Indian Institute of Science, Bangalore, India
Venue:
Proceedings of the 7th ACM international conference on Web search and data mining
Year:
2014

Citing 9
Cited 0

Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Modeling online reviews with multi-grain topic models

Proceedings of the 17th international conference on World Wide Web
Predicting response to political blog posts with topic models

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Supervised matching of comments with news article segments

Proceedings of the 20th ACM international conference on Information and knowledge management
Comment spam detection by sequence mining

Proceedings of the fifth ACM international conference on Web search and data mining
Optimizing semantic coherence in topic models

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Topic-driven reader comments summarization

Proceedings of the 21st ACM international conference on Information and knowledge management
Diversionary comments under political blog posts

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding user generated comments in response to news and blog posts is an important area of research. After ignoring irrelevant comments, one finds that a large fraction, approximately 50%, of the comments are very specific and can be further related to certain parts of the article instead of the entire story. For example, in a recent product review of Google Nexus 7 in ArsTechnica (a popular blog), the reviewer talks about the prospect of "Retina equipped iPad mini" in a few sentences. It is interesting that although the article is on Nexus 7, but a significant number of comments are focused on this specific point regarding "iPad". We pose the problem of detecting such comments as specific comments location (SCL) problem. SCL is an important open problem with no prior work. SCL can be posed as a correspondence problem between comments and the parts of the relevant article, and one could potentially use Corr-LDA type models. Unfortunately, such models do not give satisfactory performance as they are restricted to using a single topic vector per article-comments pair. In this paper we propose to go beyond the single topic vector assumption and propose a novel correspondence topic model, namely SCTM, which admits multiple topic vectors (MTV) per article-comments pair. The resulting inference problem is quite complicated because of MTV and has no off-the-shelf solution. One of the major contributions of this paper is to show that using stick-breaking process as a prior over MTV, one can derive a collapsed Gibbs sampling procedure, which empirically works well for SCL. SCTM is rigorously evaluated on three datasets, crawled from Yahoo! News (138,000 comments) and two blogs, ArsTechnica (AT) Science (90,000 comments) and AT-Gadget (160,000 comments). We observe that SCTM performs better than Corr-LDA, not only in terms of metrics like perplexity and topic coherence but also discovers more unique topics. We see that this immediately leads to an order of magnitude improvement in F1 score over Corr-LDA for SCL.