A fuzzy document retrieval system using the keyword connection matrix and a learning method
Fuzzy Sets and Systems - Special issue on applications of fuzzy systems theory, Iizuka '88
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Dotplot patterns: a literal look at pattern languages
Theory and Practice of Object Systems - Special issue on patterns
Modern Information Retrieval
An Approach to Identify Duplicated Web Pages
COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
Copy Detection Systems for Digital Documents
ADL '00 Proceedings of the IEEE Advances in Digital Libraries 2000
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Using word clusters to detect similar web documents
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Journal of the American Society for Information Science and Technology
Hi-index | 0.00 |
Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but they also degrade the efficiency of Web information retrieval. In this paper, we present a sentence-based copy detection approach on Web documents, which determines the existence of overlapped portions of any two given Web documents and graphically displays the locations of (semantically the) same sentences detected in the documents. Two sentences are treated as either the same or different according to the degree of similarity of the sentences computed by using either the three least-frequent 4-gram approach or the fuzzy-set information retrieval (IR) approach. Experimental results show that the fuzzy-set IR approach outperforms the three least-frequent 4-gram approach in our copy detection approach, which handles wide range of documents in different subject areas and does not require static word lists.