Learning document-level semantic properties from free-text annotations

Authors:
S. R. K. Branavan;Harr Chen;Jacob Eisenstein;Regina Barzilay
Affiliations:
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA;Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
Venue:
Journal of Artificial Intelligence Research
Year:
2009

Citing 26
Cited 3

The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
Latent dirichlet allocation

The Journal of Machine Learning Research
Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3)

Computational Linguistics
Generating natural language summaries from multiple on-line sources

Computational Linguistics - Special issue on natural language generation
Discovery of inference rules for question-answering

Natural Language Engineering
More accurate tests for the statistical significance of result differences

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Mining and summarizing customer reviews

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Information fusion in the context of multi-document summarization

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Opinion observer: analyzing and comparing opinions on the Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Multidocument summarization via information extraction

HLT '01 Proceedings of the first international conference on Human language technology research
Statistical significance of MUC-6 results

MUC6 '95 Proceedings of the 6th conference on Message understanding
Extracting paraphrases from a parallel corpus

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)

Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Unsupervised topic modelling for multi-party spoken discourse

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Extracting product features and opinions from reviews

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Mark-up Barking Up the Wrong Tree

Computational Linguistics
Automatic identification of pro and con reasons in online reviews

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Modeling online reviews with multi-grain topic models

Proceedings of the 17th international conference on World Wide Web
Opinion integration through semi-supervised topic modeling

Proceedings of the 17th international conference on World Wide Web
Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization
Multi-document summarization by graph search and matching

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
The PASCAL recognising textual entailment challenge

MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment

A hybrid hierarchical model for multi-document summarization

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Incorporating content structure into text analysis applications

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Speech for Content Creation

International Journal of Mobile Human Computer Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new method for inferring the semantic properties of documents by leveraging free-text keyphrase annotations. Such annotations are becoming increasingly abundant due to the recent dramatic growth in semi-structured, user-generated online content. One especially relevant domain is product reviews, which are often annotated by their authors with pros/cons keyphrases such as "a real bargain" or "good value." These annotations are representative of the underlying semantic properties; however, unlike expert annotations, they are noisy: lay authors may use different labels to denote the same property, and some labels may be missing. To learn using such noisy annotations, we find a hidden paraphrase structure which clusters the keyphrases. The paraphrase structure is linked with a latent topic model of the review texts, enabling the system to predict the properties of unannotated documents and to effectively aggregate the semantic properties of multiple reviews. Our approach is implemented as a hierarchical Bayesian model with joint inference. We find that joint inference increases the robustness of the keyphrase clustering and encourages the latent topics to correlate with semantically meaningful properties. Multiple evaluations demonstrate that our model substantially outperforms alternative approaches for summarizing single and multiple documents into a set of semantically salient keyphrases.