The Lixto data extraction project: back and forth between theory and practice

  • Authors:
  • Georg Gottlob;Christoph Koch;Robert Baumgartner;Marcus Herzog;Sergio Flesca

  • Affiliations:
  • DBAI, TU Wien, Austria;DBAI, TU Wien, Austria;Lixto Software GmbH, Austria;Lixto Software GmbH, Austria;D.E.I.S. - Università della Calabria, Italy

  • Venue:
  • PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.