Advanced information retrieval from web pages

  • Authors:
  • A. Vedeshin

  • Affiliations:
  • Tallinn University of Technology, Tallinn, Estonia

  • Venue:
  • FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A lightweight, web based with near to real-time speed algorithm is proposed in this work. It is able to retrieve main parts (menu, main text, header and footer) of a randomly selected web page entirely using CSS, JavaScript, frames, layers, images, etc. for retrieval. Moreover shortcomings of well-known modern algorithms for content retrieval from web pages are discussed in this proposal. The algorithm is useful for the improvement of existing: searching, content matching, summaries making, web graph calculation, and etc. engines. Moreover it is practical as a data provider for classification and data mining. The experimental results of a PHP realization of the algorithm showed near to real-time speed, 20-25% error rate for the multipurpose mode and less than 1% error rate for the specific mode.