EPCI: extracting potentially copyright infringement texts from the web

Authors:
Takashi Tashiro;Takanori Ueda;Taisuke Hori;Yu Hirate;Hayato Yamana
Affiliations:
Waseda University;Waseda University;Waseda University;Waseda University;Waseda University
Venue:
Proceedings of the 16th international conference on World Wide Web
Year:
2007

Citing 4
Cited 1

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A Web-Enabled Plagiarism Detection Tool

IT Professional
SNITCH: a software tool for detecting cut and paste plagiarism

Proceedings of the 37th SIGCSE technical symposium on Computer science education
Retrieving similar documents from the web

Journal of Web Engineering

SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new system extracting potentially copyright infringement texts from the Web, called EPCI. EPCI extracts them in the following way: (1) generating a set of queries based on a given copyright reserved seed-text, (2) putting every query to search engine API, (3) gathering the search result Web pages from high ranking until the similarity between the given seed-text and the search result pages becomes less than a given threshold value, and (4) merging all the gathered pages, then re-ranking them in the order of their similarity. Our experimental result using 40 seed-texts shows that EPCI is able to extract 132 potentially copyright infringement Web pages per a given copyright reserved seed-text with 94% precision in average.