An evaluation of provenance-based near-duplicates detection

Authors:
Y. Syed Mudhasir;J. Deepika;S. Sendhilkumar
Affiliations:
Department of Computer Science and Engineering, Anna University, Chennai-25, Tamil Nadu, India;Department of Computer Science and Engineering, Anna University, Chennai-25, Tamil Nadu, India;Department of Information Science and Technology, Anna University, Chennai-25, Tamil Nadu, India
Venue:
International Journal of Knowledge and Web Intelligence
Year:
2011

Citing 6
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
CINDI Robot: an Intelligent Web Crawler Based on Multi-level Inspection

IDEAS '07 Proceedings of the 11th International Database Engineering and Applications Symposium
Architecture for effective personalised web search

International Journal of Computer Applications in Technology
Implementation of Web Crawler

ICETET '09 Proceedings of the 2009 Second International Conference on Emerging Trends in Engineering & Technology
Caching personalised and database-related dynamic web pages

International Journal of High Performance Computing and Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Any existing search engine suffers the problem of redundancy in their search results. Detecting and eliminating such redundancy (near-duplicates) is one thrust area of research conducted widely by many search engine researchers. Provenance-based factors would improve the web search in view of providing beneficial quality content to the user. For users, many factors that affect personalisation may prove to be useful in determining the quality and trust in web documents. Also provenance information is helpful in filtering near duplicates from search results based on 6W factors. Hence this paper is aimed towards developing a web search system using provenance-based technique of near-duplicates detection and elimination. This system incorporates a personalised crawler (focused crawler) for computing author credentials which contributes to the trustworthiness of a web document. Finally, the results of the proposed system are compared with existing algorithms using a test bed of web documents.