Extracting product descriptions from polish e-commerce websites using classification and clustering

  • Authors:
  • Piotr Kołaczkowski;Piotr Gawrysiak

  • Affiliations:
  • Institute of Computer Science, Warsaw University of Technology;Institute of Computer Science, Warsaw University of Technology

  • Venue:
  • ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A novel method for extracting product descriptions from ecommerce websites is presented. The algorithm consists of three major steps: (1) extracting descriptions of appropriate length from the source documents related to the search query using shallow text analysis methods; (2) assigning each of the description to one of the predefined categories by means of text classification and (3) grouping the results by a text clustering algorithm to return the descriptions found in the clusters with the highest quality. The recall and precision of the search are examined using a set of queries for laptops currently being sold in popular shopping sites. It is shown that, although the extraction method based purely on the classification and the method based purely on the clustering give acceptable results, the highest precision is achieved when using them together. It was also observed that examining about 20 first sites returned by Google is sufficient to get high quality descriptions of popular products.