Hypergeometric language models for republished article finding

Authors:
Manos Tsagkias;Maarten de Rijke;Wouter Weerkamp
Affiliations:
University of Amsterdam, Amsterdam, Netherlands;University of Amsterdam, Amsterdam, Netherlands;University of Amsterdam, Amsterdam, Netherlands
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 35
Cited 3

Retrieval testing with hypergeometric document models

Journal of the American Society for Information Science
Performance standards and evaluations in IR test collections: vector-space and other retrieval models

Information Processing and Management: an International Journal
The probability ranking principle in IR

Readings in information retrieval
A theoretical study of recall and precision using a topological approach to information retrieval

Information Processing and Management: an International Journal
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
dSCAM: finding document copies across multiple databases

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Topic Detection and Tracking: Event-Based Information Organization

Topic Detection and Tracking: Event-Based Information Organization
The Basic Practice of Statistics with Cdrom

The Basic Practice of Statistics with Cdrom
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Bayesian extension to the language model for ad hoc information retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Combining fields for query expansion and adaptive query expansion

Information Processing and Management: an International Journal
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Finding Event-Relevant Content from the Web Using a Near-Duplicate Detection Approach

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Generating links by mining quotations

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
A new probabilistic retrieval model based on the dirichlet compound multinomial distribution

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Efficient overlap and content reuse detection in blogs and online news articles

Proceedings of the 18th international conference on World wide web
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Near-duplicate detection for web-forums

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Organization and Tagging of Blog and News Entries Based on Content Reuse

Journal of Signal Processing Systems
Using relevance feedback in expert search

ECIR'07 Proceedings of the 29th European conference on IR research
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Hypergeometric language model and zipf-like scoring function for web document similarity retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Linking online news and social media

Proceedings of the fourth ACM international conference on Web search and data mining
Information theoretic approach to information extraction

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Frequentist and bayesian approach to information retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Language intent models for inferring user browsing behavior

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Query likelihood with negative query generation

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Republished article finding is the task of identifying instances of articles that have been published in one source and republished more or less verbatim in another source, which is often a social media source. We address this task as an ad hoc retrieval problem, using the source article as a query. Our approach is based on language modeling. We revisit the assumptions underlying the unigram language model taking into account the fact that in our setup queries are as long as complete news articles. We argue that in this case, the underlying generative assumption of sampling words from a document with replacement, i.e., the multinomial modeling of documents, produces less accurate query likelihood estimates. To make up for this discrepancy, we consider distributions that emerge from sampling without replacement: the central and non-central hypergeometric distributions. We present two retrieval models that build on top of these distributions: a log odds model and a bayesian model where document parameters are estimated using the Dirichlet compound multinomial distribution. We analyse the behavior of our new models using a corpus of news articles and blog posts and find that for the task of republished article finding, where we deal with queries whose length approaches the length of the documents to be retrieved, models based on distributions associated with sampling without replacement outperform traditional models based on multinomial distributions.