Information genealogy: uncovering the flow of ideas in non-hyperlinked document databases

Authors:
Benyah Shaparenko;Thorsten Joachims
Affiliations:
Cornell University;Cornell University
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 23
Cited 9

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Hypertext, full text, and automatic linking

SIGIR '90 Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval
A spectrum of automatic hypertext constructions

Hypermedia
Automatic text structuring and retrieval-experiments in automatic encyclopedia searching

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic generation of “hyper-paths” in information retrieval systems: a stochastic and an incremental algorithms

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
A methodology for the automatic construction of a hypertext for information retrieval

SAC '93 Proceedings of the 1993 ACM/SIGAPP symposium on Applied computing: states of the art and practice
Automatic hypertext construction

Automatic hypertext construction
On the use of information retrieval techniques for the automatic construction of hypertext

Information Processing and Management: an International Journal - Special issue: methods and tools for the automatic construction of hypertext
On-line new event detection and tracking

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Automatic link generation

ACM Computing Surveys (CSUR)
Bursty and hierarchical structure in streams

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Corpus structure, language models, and ad hoc information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic author-topic models for information discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering evolutionary theme patterns from text: an exploration of temporal text mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Unweaving a web of documents

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Bibliometric impact measures leveraging topic analysis

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Data association for topic intensity tracking

ICML '06 Proceedings of the 23rd international conference on Machine learning
Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Topics over time: a non-Markov continuous-time model of topical trends

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Joint latent topic models for text and citations

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
COA: finding novel patents through text analysis

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying the original contribution of a document via language modeling

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Identifying the Original Contribution of a Document via Language Modeling

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
The web of topics: discovering the topology of topic evolution in a corpus

Proceedings of the 20th international conference on World wide web
Beyond keyword search: discovering relevant scientific literature

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
IPKB: a digital library for invertebrate paleontology

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Temporal corpus summarization using submodular word coverage

Proceedings of the 21st ACM international conference on Information and knowledge management
Story graphs: Tracking document set evolution using dynamic graphs

Intelligent Data Analysis - Dynamic Networks and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

We now have incrementally-grown databases of text documents ranging back for over a decade in areas ranging from personal email, to news-articles and conference proceedings. While accessing individual documents is easy, methods for overviewing and understanding these collections as a whole are lacking in number and in scope. In this paper, we address one such global analysis task, namely the problem of automatically uncovering how ideas spread through the collection over time. We refer to this problem as Information Genealogy. In contrast to bibliometric methods that are limited to collections with explicit citation structure, we investigate content-based methods requiring only the text and timestamps of the documents. In particular, we propose a language-modeling approach and a likelihood ratio test to detect influence between documents in a statistically well-founded way. Furthermore, we show how this method can be used to infer citation graphs and to identify the most influential documents in the collection. Experiments on the NIPS conference proceedings and the Physics ArXiv show that our method is more effective than methods based on document similarity.