Automatic web spreadsheet data extraction

  • Authors:
  • Zhe Chen;Michael Cafarella

  • Affiliations:
  • University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI

  • Venue:
  • Proceedings of the 3rd International Workshop on Semantic Search Over the Web
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Spreadsheets contain a huge amount of high-value data but do not observe a standard data model and thus are difficult to integrate. A large number of data integration tools exist, but they generally can only work on relational data. Existing systems for extracting relational data from spreadsheets are too labor intensive to support ad-hoc integration tasks, in which the correct extraction target is only learned during the course of user interaction. This paper introduces a system that automatically extracts relational data from spreadsheets, thereby enabling relational spreadsheet integration. The resulting integrated relational data can be queried directly or can be translated into RDF triples. When compared to standard techniques for spreadsheet data extraction on a set of 100 random Web spreadsheets, the system reduces the amount of human labor by 72% to 92%. In addition to the system design, we present the results of a general survey of more than 400,000 spreadsheets we downloaded from the Web, giving a novel view of how users organize their data in spreadsheets.