DEiXTo: a web data extraction suite

  • Authors:
  • Fotios Kokkoras;Konstantinos Ntonas;Nick Bassiliades

  • Affiliations:
  • TEI of Larisa, Larisa, Greece;International Hellenic University, Moudania, Thermi, Greece;International Hellenic University, Moudania, Thermi, Greece and Aristotle University of Thessaloniki, Thessaloniki, Greece

  • Venue:
  • Proceedings of the 6th Balkan Conference in Informatics
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web data extraction (or web scraping) is the process of collecting unstructured or semi-structured information from the World Wide Web, at different levels of automation. It is an important, valuable and practical approach towards web reuse while at the same time can serve the transition of the web to the semantic web, by providing the structured data required by the latter. In this paper we present DEiXTo, a web data extraction suite that provides an arsenal of features aiming at designing and deploying well-engineered extraction tasks. We focus on presenting the core pattern matching algorithm and the overall architecture, which allows programming of custom-made solutions for hard extraction tasks. DEiXTo consists of both freeware and open source components.