Finding and extracting data records from web pages

  • Authors:
  • Manuel Álvarez;Alberto Pan;Juan Raposo;Fernando Bellas;Fidel Cacheda

  • Affiliations:
  • Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain

  • Venue:
  • EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.