Retrieving similar documents from the web

Authors:
Álvaro R. Pereira;Nivio Ziviani
Affiliations:
Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil
Venue:
Journal of Web Engineering
Year:
2003

Citing 15
Cited 10

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
New indices for text: PAT Trees and PAT arrays

Information retrieval
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
CHECK: a document plagiarism detection system

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Towards an error free plagarism detection process

Proceedings of the 6th annual conference on Innovation and technology in computer science education
dSCAM: finding document copies across multiple databases

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Efficiency of data structures for detecting overlaps in digital documents

ACSC '01 Proceedings of the 24th Australasian conference on Computer science
Analysis of lexical signatures for finding lost or related documents

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Using Visualization to Detect Plagiarism in Computer Science Classes

INFOVIS '00 Proceedings of the IEEE Symposium on Information Vizualization 2000
Visualising Intra-Corpal Plagiarism

IV '01 Proceedings of the Fifth International Conference on Information Visualisation
Syntactic Similarity of Web Documents

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

EPCI: extracting potentially copyright infringement texts from the web

Proceedings of the 16th international conference on World Wide Web
Adaptive Web SitesA Knowledge Extraction from Web Data Approach

Proceedings of the 2008 conference on Adaptive Web Sites: A Knowledge Extraction from Web Data Approach
Automatic retrieval of similar content using search engine query interface

Proceedings of the 18th ACM conference on Information and knowledge management
DOCODE-lite: a meta-search engine for document similarity retrieval

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Hypergeometric language model and zipf-like scoring function for web document similarity retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
An algorithmic treatment of strong queries

Proceedings of the fourth ACM international conference on Web search and data mining
A logical framework for web data mining based on heterogeneous algebraic structure hierarchies

MMACTEE'06 Proceedings of the 8th WSEAS international conference on Mathematical methods and computational techniques in electrical engineering
Extracting significant Website Key Objects: A Semantic Web mining approach

Engineering Applications of Artificial Intelligence
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Using word clusters to detect similar web documents

KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a mechanism for detecting and retrieving documents from the web with a similarity relation to a suspicious document. The process is composed of three stages: a) generation of a "fingerprint" of the suspicious document, b) gathering candidate documents from the web and c) comparison of each candidate document and the suspicious document. In the first stage, the fingerprint of the suspicious document is used as its identification. The fingerprint is composed of representative sentences of the document. In the second stage, the sentences composing the fingerprint are used as queries submitted to a serach engine. The documents identified by the URLs returned from the search engine are collected to form a set of similarity candidate documents. In the third stage, the candidate documents are compared to the suspicious document. The process of comparing the documents uses two different methods: Shingles and Patricia tree. We implemented and evaluated the methods used for generating the document fingerprint and for comparing the suspicious document with the candidate documents. The experiments were performed using a collection of plagiarized documents constructed specially for this work. The best experimental result shows that in 61.53% of the tries the total number of source documents used in the composition were retrieved from the Web. In this case, in only 5.44% of the executions less than 50% of source documents used in the composition were retrieved from the Web. For the best fingerprint implemented, on average 87.06% of the documents were retrieved.