Automatically maintaining navigation sequences for querying semi-structured web sources

  • Authors:
  • Alberto Pan;Juan Raposo;Manuel Álvarez;Víctor Carneiro;Fernando Bellas

  • Affiliations:
  • Department of Information and Communication Technologies, Facultad de Informatica, Campus de Elviña s/n, University of A Coruña, 15071 A Coruña, Spain;Department of Information and Communication Technologies, Facultad de Informatica, Campus de Elviña s/n, University of A Coruña, 15071 A Coruña, Spain;Department of Information and Communication Technologies, Facultad de Informatica, Campus de Elviña s/n, University of A Coruña, 15071 A Coruña, Spain;Department of Information and Communication Technologies, Facultad de Informatica, Campus de Elviña s/n, University of A Coruña, 15071 A Coruña, Spain;Department of Information and Communication Technologies, Facultad de Informatica, Campus de Elviña s/n, University of A Coruña, 15071 A Coruña, Spain

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A substantial subset of Web data has an underlying structure. For instance, the pages obtained in response to a query executed through a Web search form are usually generated by a program that accesses structured data in a local database, and embeds them into an HTML template. For software programs to gain full benefit from these ''semi-structured'' Web sources, wrapper programs must be built to provide a ''machine-readable'' view over them. Since Web sources are autonomous, they may experience changes that invalidate the current wrapper, thus automatic maintenance is an important issue. Wrappers must perform two tasks: navigating through Web sites and extracting structured data from HTML pages. While several works have addressed the automatic maintenance of data extraction tasks, the problem of maintaining the navigation sequences remains unaddressed to the best of our knowledge. In this paper, we propose a set of novel techniques to fill this gap.