A vector space model for automatic indexing
Communications of the ACM
Towards a highly-scalable and effective metasearch engine
Proceedings of the 10th international conference on World Wide Web
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Approaches to collection selection and results merging for distributed information retrieval
Proceedings of the tenth international conference on Information and knowledge management
Modern Information Retrieval
Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Text similarity: an alternative way to search MEDLINE
Bioinformatics
Adaptive Web Sites: A Knowledge Extraction from Web Data Approach - Volume 170 Frontiers in Artificial Intelligence and Applications
Retrieving similar documents from the web
Journal of Web Engineering
A Text Similarity Meta-Search Engine Based on Document Fingerprints and Search Results Records
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
The retrieval of similar documents from large scale datasets has been the one of the main concerns in knowledge management environments, such as plagiarism detection, news impact analysis, and the matching of ideas within sets of documents. In all of these applications, a light-weight architecture can be considered as fundamental for the large scale of information needed to be analyzed. Furthermore, the relevance score for documents retrieval can be significantly improved using several previously built search engines and taking into account the relevance feedback from users. In this work, we propose a web-services architecture for the retrieval of similar documents from the web. We focus on software engineering to support the manipulation of users' knowledge into the retrieval algorithm. An human evaluation for the relevance feedback of the system over a built set of documents is presented, showing that the proposed architecture can retrieve similar documents by using the main search engines. In particular, the document plagiarism detection task was evaluated, for which its main results are shown.