DeepWeb Navigation in Web Data Extraction

  • Authors:
  • Robert Baumgartner;Michal Ceresna;Gerald Ledermuller

  • Affiliations:
  • Lixto Software GmbH and DBAI, TU Wien;DBAI, TU Wien, Vienna, Austria;Lixto Software GmbH, Vienna, Austria

  • Venue:
  • CIMCA '05 Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce Vol-2 (CIMCA-IAWTIC'06) - Volume 02
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In literature, data extraction techniques for HTML and semi-structured data in general have been exhaustively studied and a number of automatic and semi-automatic approaches proposed. Howeover, in real-life scenarios data extraction capabilities are only one half of the game. Password-protected sites, cookies, non-HTML data formats, JavaScript, Session IDs, Web Form iterations and dynamic changes onWeb sites are the obstacles that makeWeb data extraction difficult in real-life application scenarios. We propose, based on current Lixto technology, a novel approach that introduces action-based Web navigation sequence recording and replaying and its close integration with extraction technologies. On the one hand, the technical innovation is the embedding of the Mozilla browser into the Lixto Visual Wrapper with the advantage of the support of a large number of Web standards and an open-source API to permit close interaction of Lixto with Mozilla. On the other hand, we develop a navigation language and explore its close interaction with Elog, the extraction language of Lixto. Current research status and sample screenshots are given. The paper closes with a description of two application domains where Deep Web navigation capabilities play a crucial role, that is automotive B2B Web platforms and Business Intelligence scenarios.