Retrieval testing with hypergeometric document models
Journal of the American Society for Information Science
Information Processing and Management: an International Journal
The probability ranking principle in IR
Readings in information retrieval
A theoretical study of recall and precision using a topological approach to information retrieval
Information Processing and Management: an International Journal
A language modeling approach to information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
dSCAM: finding document copies across multiple databases
DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Topic Detection and Tracking: Event-Based Information Organization
Topic Detection and Tracking: Event-Based Information Organization
The Basic Practice of Statistics with Cdrom
The Basic Practice of Statistics with Cdrom
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Bayesian extension to the language model for ad hoc information retrieval
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Modeling word burstiness using the Dirichlet distribution
ICML '05 Proceedings of the 22nd international conference on Machine learning
ICML '06 Proceedings of the 23rd international conference on Machine learning
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Combining fields for query expansion and adaptive query expansion
Information Processing and Management: an International Journal
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Wikify!: linking documents to encyclopedic knowledge
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Finding Event-Relevant Content from the Web Using a Near-Duplicate Detection Approach
WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Generating links by mining quotations
Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
A new probabilistic retrieval model based on the dirichlet compound multinomial distribution
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to link with wikipedia
Proceedings of the 17th ACM conference on Information and knowledge management
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
Efficient overlap and content reuse detection in blogs and online news articles
Proceedings of the 18th international conference on World wide web
Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Near-duplicate detection for web-forums
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Organization and Tagging of Blog and News Entries Based on Content Reuse
Journal of Signal Processing Systems
Using relevance feedback in expert search
ECIR'07 Proceedings of the 29th European conference on IR research
Efficient partial-duplicate detection based on sequence matching
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Hypergeometric language model and zipf-like scoring function for web document similarity retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Linking online news and social media
Proceedings of the fourth ACM international conference on Web search and data mining
Information theoretic approach to information extraction
FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Frequentist and bayesian approach to information retrieval
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Language intent models for inferring user browsing behavior
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Query likelihood with negative query generation
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
Republished article finding is the task of identifying instances of articles that have been published in one source and republished more or less verbatim in another source, which is often a social media source. We address this task as an ad hoc retrieval problem, using the source article as a query. Our approach is based on language modeling. We revisit the assumptions underlying the unigram language model taking into account the fact that in our setup queries are as long as complete news articles. We argue that in this case, the underlying generative assumption of sampling words from a document with replacement, i.e., the multinomial modeling of documents, produces less accurate query likelihood estimates. To make up for this discrepancy, we consider distributions that emerge from sampling without replacement: the central and non-central hypergeometric distributions. We present two retrieval models that build on top of these distributions: a log odds model and a bayesian model where document parameters are estimated using the Dirichlet compound multinomial distribution. We analyse the behavior of our new models using a corpus of news articles and blog posts and find that for the task of republished article finding, where we deal with queries whose length approaches the length of the documents to be retrieved, models based on distributions associated with sampling without replacement outperform traditional models based on multinomial distributions.