Identifying the original contribution of a document via language modeling

Authors:
Benyah Shaparenko;Thorsten Joachims
Affiliations:
Cornell University, Ithaca, NY, USA;Cornell University, Ithaca, NY, USA
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 3
Cited 1

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Latent dirichlet allocation

The Journal of Machine Learning Research
Information genealogy: uncovering the flow of ideas in non-hyperlinked document databases

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Recommending citations with translation model

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

One goal of text mining is to provide readers with automatic methods for quickly finding the key ideas in individual documents and whole corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and it can be used to identify the most original passages in a document. Unlike heuristic approaches, this statistical model is extensible and open to analysis. We evaluate the approach on both synthetic and real data, showing that the passage impact model outperforms a heuristic baseline method.