Information Extraction from HTML Product Catalogues: From Source Code and Images to RDF

  • Authors:
  • Martin Labsky;Vojtech Svatek;Ondrej Svab;Pavel Praks;Michal Kratky;Vaclav Snasel

  • Affiliations:
  • University of Economics;University of Economics;University of Economics;VŠB - Technical University of Ostrava;VŠB - Technical University of Ostrava;VŠB - Technical University of Ostrava

  • Venue:
  • WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe an application of information extraction from company websites focusing on product offers. A statistical approach to text analysis is used in conjunction with different ways of image classification. Ontological knowledge is used to group the extracted items into structured objects. The results are stored in an RDF repository and made available for structured search.