Stochastic simulations of web search engines: RBF versus second-order regression models

  • Authors:
  • George Meghabghab;Abraham Kandel

  • Affiliations:
  • Department of Computer Science Technology, Roane State, Oak Ridge, TN;Department of Computer Science and Engineering, University of South Florida, 4202 East Fowler Avenue, ENB 118, Tampa, FL

  • Venue:
  • Information Sciences—Informatics and Computer Science: An International Journal
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Stochastic simulation has been very effective in many domains but never applied to the World Wide Web (WWW). This study is a premiere in using neural networks in stochastic simulation of the number of rejected web pages per search query. The evaluation of the quality of search engines should involve not only the resulting set of web pages but also an estimate of the rejected set of web pages. The iterative Radial Basis Functions (RBF) neural network developed by Meghabghab and Nasr [Iterative RBF neural networks as meta-models for stochastic simulations, in: The Second International Conference on Intelligent Processing and Manufacturing of Materials, 1999, p. 729] was applied for the evaluation of the number of rejected web pages on four search engines, i.e., Yahoo, Alta Vista, Google, and Northern Light. Nine input variables were selected for the simulation: (1) precision, (2) overlap, (3) response time, (4) coverage (5) update frequency, (6) Boolean logic, (7) truncation, (8) word and multiword searching and (9) portion of the web pages indexed. Typical stochastic simulation meta-modeling uses regression models in Response Surface Methods (RSM) to test the N training data or patterns collected. RBF neural networks become a natural target for RSM because they use a family of surfaces each of which naturally divides an input space into two regions Z+ and Z-, and the N patterns for testing will be assigned either class Z+ or Z-. This technique divides the resulting set of responses to a query into accepted and rejected web pages. To test the hypothesis that the evaluation of any search engine query should involve an estimate of the number of rejected web pages as part of the evaluation, RBF meta-model was trained on a set of 9000 different simulation runs on the nine different input variables. Results show that two of the variables can be eliminated which include: response time and portion of the web indexed without affecting evaluation results. Results show that the number of rejected web pages for a specific set of search queries on these four engines is very high. Also a goodness measure of a search engine for a given set of queries can be designed which is a function of the coverage of the search engine and the normalized age of a new document in result set for the query. This study concludes that unless search engine designers address the issue of rejected web pages, indexing, and crawling, the usage of the Web as a research tool for academic and educational purposes will stay hindered.