Identifying the Original Contribution of a Document via Language Modeling

Authors:
Benyah Shaparenko;Thorsten Joachims
Affiliations:
Department of Computer Science, Cornell University, Ithaca, USA 14853;Department of Computer Science, Cornell University, Ithaca, USA 14853
Venue:
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Year:
2009

Citing 15
Cited 1

On-line new event detection and tracking

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Latent dirichlet allocation

The Journal of Machine Learning Research
Corpus structure, language models, and ad hoc information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic author-topic models for information discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Bibliometric impact measures leveraging topic analysis

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Dynamic topic models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Pachinko allocation: DAG-structured mixture models of topic correlations

ICML '06 Proceedings of the 23rd international conference on Machine learning
Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Topics over time: a non-Markov continuous-time model of topical trends

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic sentiment mixture: modeling facets and opinions in weblogs

Proceedings of the 16th international conference on World Wide Web
Unsupervised prediction of citation influences

Proceedings of the 24th international conference on Machine learning
Information genealogy: uncovering the flow of ideas in non-hyperlinked document databases

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Topic and role discovery in social networks

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Citation recommendation without author supervision

Proceedings of the fourth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

One major goal of text mining is to provide automatic methods to help humans grasp the key ideas in ever-increasing text corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and the model is used to identify each document's most original passages. Unlike heuristic approaches, the statistical model is extensible and open to analysis. We evaluate the approach both on synthetic data and on real data in the domains of research publications and news, showing that the passage impact model outperforms a heuristic baseline method.