The roots of backpropagation: from ordered derivatives to neural networks and political forecasting
The roots of backpropagation: from ordered derivatives to neural networks and political forecasting
Summarizing text documents: sentence selection and evaluation metrics
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing
Communications of the ACM
Searching the Web: the public and their queries
Journal of the American Society for Information Science and Technology
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval
Using web helper agent profiles in query generation
AAMAS '03 Proceedings of the second international joint conference on Autonomous agents and multiagent systems
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Web Caching And Its Applications (Kluwer International Series in Engineering and Computer Science)
Web Caching And Its Applications (Kluwer International Series in Engineering and Computer Science)
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
DOCODE-lite: a meta-search engine for document similarity retrieval
KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Hypergeometric language model and zipf-like scoring function for web document similarity retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Hypergeometric language models for republished article finding
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Retrieving similar documents from the web
Journal of Web Engineering
Evaluation of result merging strategies for metasearch engines
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Information theoretic approach to information extraction
FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Expert Systems with Applications: An International Journal
An application for plagiarized source code detection based on a parse tree kernel
Engineering Applications of Artificial Intelligence
Hi-index | 0.00 |
The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a set of queries that are submitted to a list of Web search engines. Afterward, results are merged, their URLs are fetched and their content is compared with the given document using text comparison algorithms. However, the action of requesting results to multiple web servers could take a significant amount of time and effort. In this work, a similarity function between the given document and retrieved results is estimated. The function uses as variables features that come from information provided by search engine results records, like rankings, titles and snippets. Avoiding therefore, the bottleneck of requesting external Web Servers. We created a collection of around 10,000 search engine results by generating queries from 2,000 crawled Web documents. Then we fitted the similarity function using the cosine similarity between the input and results content as the target variable. The execution time between the exact and approximated solution was compared. Results obtained for our approximated solution showed a reduction of computational time of 86% at an acceptable level of precision with respect to the exact solution of the web document retrieval problem.