Extracting product descriptions from polish e-commerce websites using classification and clustering

Authors:
Piotr Kołaczkowski;Piotr Gawrysiak
Affiliations:
Institute of Computer Science, Warsaw University of Technology;Institute of Computer Science, Warsaw University of Technology
Venue:
ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Year:
2011

Citing 12
Cited 1

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Wrapper induction for information extraction

Wrapper induction for information extraction
Accurately and reliably extracting data from the Web: a machine learning approach

Intelligent exploration of the web
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Extracting Web Data Using Instance-Based Learning

World Wide Web

Using web mining for discovering spatial patterns and hot spots for spatial generalization

ISMIS'12 Proceedings of the 20th international conference on Foundations of Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A novel method for extracting product descriptions from ecommerce websites is presented. The algorithm consists of three major steps: (1) extracting descriptions of appropriate length from the source documents related to the search query using shallow text analysis methods; (2) assigning each of the description to one of the predefined categories by means of text classification and (3) grouping the results by a text clustering algorithm to return the descriptions found in the clusters with the highest quality. The recall and precision of the search are examined using a set of queries for laptops currently being sold in popular shopping sites. It is shown that, although the extraction method based purely on the classification and the method based purely on the clustering give acceptable results, the highest precision is achieved when using them together. It was also observed that examining about 20 first sites returned by Google is sufficient to get high quality descriptions of popular products.