Finding text reuse on the web

Authors:
Michael Bendersky;W. Bruce Croft
Affiliations:
University of Massachusetts, Amherst, MA;University of Massachusetts, Amherst, MA
Venue:
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Year:
2009

Citing 25
Cited 19

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Temporal summaries of new topics

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Time-based language models

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Using temporal profiles of queries for precision prediction

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Discovering evolutionary theme patterns from text: an exploration of temporal text mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Tracking Information Epidemics in Blogspace

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management
Information Extraction: Distilling Structured Data from Unstructured Text

Queue - Social Computing
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A translation model for sentence retrieval

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Web projections: learning from contextual subgraphs of the web

Proceedings of the 16th international conference on World Wide Web
A comparison of sentence retrieval techniques

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Genealogical trees on the web: a search engine user perspective

Proceedings of the 17th international conference on World Wide Web
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Utilizing passage-based language models for document retrieval

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
SBotMiner: large scale search bot detection

Proceedings of the third ACM international conference on Web search and data mining
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating text reuse discovery on the web

Proceedings of the third symposium on Information interaction in context
Large-scale copy detection

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Candidate document retrieval for web-scale text reuse detection

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Plagiarism detection based on structural information

Proceedings of the 20th ACM international conference on Information and knowledge management
Noise robust detection of the emergence and spread of topics on the web

Proceedings of the 2nd Temporal Web Analytics Workshop
Detecting quilted web pages at scale

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Generating queries from user-selected text

Proceedings of the 4th Information Interaction in Context Symposium
University_of_Sheffield: two approaches to semantic text similarity

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Text reuse with ACL: (upward) trends

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Computing similarity between items in a digital library of cultural heritage

Journal on Computing and Cultural Heritage (JOCCH)
Reconstructing provenance

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Folktale classification using learning to rank

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Synthetic review spamming and defense

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient estimation for high similarities using odd sketches

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the overwhelming number of reports on similar events originating from different sources on the web, it is often hard, using existing web search paradigms, to find the original source of "facts", statements, rumors, and opinions, and to track their development. Several techniques have been previously proposed for detecting such text reuse between different sources, however these techniques have been tested against relatively small and homogeneous TREC collections. In this work, we test the feasibility of text reuse detection techniques in the setting of web search. In addition to text reuse detection, we develop a novel technique that addresses the unique challenges of finding original sources on the web, such as defining a timeline. We also explore the use of link analysis for identifying reliable and relevant reports. Our experimental results show that the proposed techniques can operate on the scale of the web, are significantly more accurate than standard web search for finding text reuse, and provide a richer representation for tracking the information flow.