Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
New indices for text: PAT Trees and PAT arrays
Information retrieval
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM (JACM)
CHECK: a document plagiarism detection system
SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
Towards an error free plagarism detection process
Proceedings of the 6th annual conference on Innovation and technology in computer science education
dSCAM: finding document copies across multiple databases
DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Efficiency of data structures for detecting overlaps in digital documents
ACSC '01 Proceedings of the 24th Australasian conference on Computer science
Analysis of lexical signatures for finding lost or related documents
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Using Visualization to Detect Plagiarism in Computer Science Classes
INFOVIS '00 Proceedings of the IEEE Symposium on Information Vizualization 2000
Visualising Intra-Corpal Plagiarism
IV '01 Proceedings of the Fifth International Conference on Information Visualisation
Syntactic Similarity of Web Documents
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
EPCI: extracting potentially copyright infringement texts from the web
Proceedings of the 16th international conference on World Wide Web
Adaptive Web SitesA Knowledge Extraction from Web Data Approach
Proceedings of the 2008 conference on Adaptive Web Sites: A Knowledge Extraction from Web Data Approach
Automatic retrieval of similar content using search engine query interface
Proceedings of the 18th ACM conference on Information and knowledge management
DOCODE-lite: a meta-search engine for document similarity retrieval
KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Hypergeometric language model and zipf-like scoring function for web document similarity retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
An algorithmic treatment of strong queries
Proceedings of the fourth ACM international conference on Web search and data mining
A logical framework for web data mining based on heterogeneous algebraic structure hierarchies
MMACTEE'06 Proceedings of the 8th WSEAS international conference on Mathematical methods and computational techniques in electrical engineering
Extracting significant Website Key Objects: A Semantic Web mining approach
Engineering Applications of Artificial Intelligence
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Using word clusters to detect similar web documents
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Hi-index | 0.00 |
This paper presents a mechanism for detecting and retrieving documents from the web with a similarity relation to a suspicious document. The process is composed of three stages: a) generation of a "fingerprint" of the suspicious document, b) gathering candidate documents from the web and c) comparison of each candidate document and the suspicious document. In the first stage, the fingerprint of the suspicious document is used as its identification. The fingerprint is composed of representative sentences of the document. In the second stage, the sentences composing the fingerprint are used as queries submitted to a serach engine. The documents identified by the URLs returned from the search engine are collected to form a set of similarity candidate documents. In the third stage, the candidate documents are compared to the suspicious document. The process of comparing the documents uses two different methods: Shingles and Patricia tree. We implemented and evaluated the methods used for generating the document fingerprint and for comparing the suspicious document with the candidate documents. The experiments were performed using a collection of plagiarized documents constructed specially for this work. The best experimental result shows that in 61.53% of the tries the total number of source documents used in the composition were retrieved from the Web. In this case, in only 5.44% of the executions less than 50% of source documents used in the composition were retrieved from the Web. For the best fingerprint implemented, on average 87.06% of the documents were retrieved.