A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records

Authors:
Felipe Bravo-Marquez;Gaston L'Huillier;Sebastián A. Ríos;Juan D. Velásquez
Affiliations:
-;-;-;-
Venue:
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2011

Citing 17
Cited 2

The roots of backpropagation: from ordered derivatives to neural networks and political forecasting

The roots of backpropagation: from ordered derivatives to neural networks and political forecasting
Summarizing text documents: sentence selection and evaluation metrics

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Searching the Web: the public and their queries

Journal of the American Society for Information Science and Technology
Models for metasearch

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Using web helper agent profiles in query generation

AAMAS '03 Proceedings of the second international joint conference on Autonomous agents and multiagent systems
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Web Caching And Its Applications (Kluwer International Series in Engineering and Computer Science)

Web Caching And Its Applications (Kluwer International Series in Engineering and Computer Science)
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
DOCODE-lite: a meta-search engine for document similarity retrieval

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
Hypergeometric language model and zipf-like scoring function for web document similarity retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Hypergeometric language models for republished article finding

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Retrieving similar documents from the web

Journal of Web Engineering
Evaluation of result merging strategies for metasearch engines

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Information theoretic approach to information extraction

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems

Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style

Expert Systems with Applications: An International Journal
An application for plagiarized source code detection based on a parse tree kernel

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The retrieval of similar documents from the Web using documents as input instead of key-term queries is not currently supported by traditional Web search engines. One approach for solving the problem consists of fingerprint the document's content into a set of queries that are submitted to a list of Web search engines. Afterward, results are merged, their URLs are fetched and their content is compared with the given document using text comparison algorithms. However, the action of requesting results to multiple web servers could take a significant amount of time and effort. In this work, a similarity function between the given document and retrieved results is estimated. The function uses as variables features that come from information provided by search engine results records, like rankings, titles and snippets. Avoiding therefore, the bottleneck of requesting external Web Servers. We created a collection of around 10,000 search engine results by generating queries from 2,000 crawled Web documents. Then we fitted the similarity function using the cosine similarity between the input and results content as the target variable. The execution time between the exact and approximated solution was compared. Results obtained for our approximated solution showed a reduction of computational time of 86% at an acceptable level of precision with respect to the exact solution of the web document retrieval problem.