Using the past to score the present: extending term weighting models through revision history analysis

Authors:
Ablimit Aji;Yu Wang;Eugene Agichtein;Evgeniy Gabrilovich
Affiliations:
Emory University, Atlanta, GA, USA;Emory University, Atlanta, GA, USA;Emory University, Atlanta, GA, USA;Yahoo! Research, Santa Clara, CA, USA
Venue:
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Year:
2010

Citing 30
Cited 6

A probabilistic learning approach for document indexing

ACM Transactions on Information Systems (TOIS) - Special issue on research and development in information retrieval
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Introduction to topic detection and tracking

Topic detection and tracking
On the bursty evolution of blogspace

WWW '03 Proceedings of the 12th international conference on World Wide Web
Combining document representations for known-item search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Newsjunkie: providing personalized newsfeeds via analysis of information novelty

Proceedings of the 13th international conference on World Wide Web
Information diffusion through blogspace

Proceedings of the 13th international conference on World Wide Web
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus structure, language models, and ad hoc information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Choosing document structure weights

Information Processing and Management: an International Journal
Integrating word relationships into language models

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Improving web search ranking by incorporating user behavior information

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
TopX: efficient and versatile top-k query processing for semistructured data

The VLDB Journal — The International Journal on Very Large Data Bases
Mining the search trails of surfing crowds: identifying relevant websites from user activity

Proceedings of the 17th international conference on World Wide Web
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Finding the right facts in the crowd: factoid question answering over social media

Proceedings of the 17th international conference on World Wide Web
Retrieval and feedback models for blog feed search

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Discriminative probabilistic models for passage based retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Discovering key concepts in verbose queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the INEX 2007 Ad Hoc Track

Focused Access to XML Documents
An improved markov random field model for supporting verbose queries

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Compact full-text indexing of versioned document collections

Proceedings of the 18th ACM conference on Information and knowledge management
Leveraging temporal dynamics of document content in relevance ranking

Proceedings of the third ACM international conference on Web search and data mining
Linear time series models for term weighting in information retrieval

Journal of the American Society for Information Science and Technology
Analysis of the INEX 2009 ad hoc track results

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval

Detecting and exploiting stability in evolving heterogeneous information spaces

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Temporal latent semantic analysis for collaboratively generated content: preliminary results

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Keeping keywords fresh: a BM25 variation for personalized keyword extraction

Proceedings of the 2nd Temporal Web Analytics Workshop
User edits classification using document revision histories

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence
Temporal web dynamics and its application to information retrieval

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks.