Discovering diverse and salient threads in document collections

Authors:
Jennifer Gillenwater;Alex Kulesza;Ben Taskar
Affiliations:
University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA
Venue:
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Year:
2012

Citing 11
Cited 1

Temporal summaries of new topics

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Query based event extraction along a timeline

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Discovering evolutionary theme patterns from text: an exploration of temporal text mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Dynamic topic models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Near Optimal Dimensionality Reductions That Preserve Volumes

APPROX '08 / RANDOM '08 Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop, RANDOM 2008 on Approximation, Randomization and Combinatorial Optimization: Algorithms and Techniques
Meme-tracking and the dynamics of the news cycle

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
LexRank: graph-based lexical centrality as salience in text summarization

Journal of Artificial Intelligence Research
Connecting the dots between news articles

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Evolutionary timeline summarization: a balanced optimization framework via iterative substitution

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Trains of thought: generating information maps

Proceedings of the 21st international conference on World Wide Web

Text-based measures of document diversity

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads---singly-linked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.