Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

  • Authors:
  • Jakub Stárka;Irena Holubová;Martin Nečaský

  • Affiliations:
  • Department of Software Engineering, Charles University in Prague, Czech Republic;Department of Software Engineering, Charles University in Prague, Czech Republic;Department of Software Engineering, Charles University in Prague, Czech Republic

  • Venue:
  • Proceedings of International Conference on Information Integration and Web-based Applications & Services
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we introduce Strigil, a framework for automated data extraction. It represents an easily configurable tool that enables one to retrieve a data from textual or weak-structured documents. The paper contains description of the framework architecture and its important components. Additionally, we propose a scraping language inspired by the XSL transformations designed to extract data from different kinds of documents. Although there are many different approaches focused on various aspects of data scraping, they are usually very specialized to a concrete domain or a data source. We compare these solutions and discuss their advantages and disadvantages. Our scraping language is designed to work with an ontology to map scraped data directly to classes and attributes.