Automatic web spreadsheet data extraction

Authors:
Zhe Chen;Michael Cafarella
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI
Venue:
Proceedings of the 3rd International Workshop on Semantic Search Over the Web
Year:
2013

Citing 21
Cited 0

The table lens: merging graphical and symbolic representations in an interactive focus + context visualization for tabular information

CHI '94 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
FOCUS: the interactive table for product comparison and selection

Proceedings of the 9th annual ACM symposium on User interface software and technology
Microsoft Excel 2000 Functions in Practice

Microsoft Excel 2000 Functions in Practice
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Fluid Visualization of Spreadsheet Structures

VL '98 Proceedings of the IEEE Symposium on Visual Languages
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Spreadsheets in RDBMS for OLAP

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Query by Excel

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Nested mappings: schema mapping reloaded

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
UCheck: A spreadsheet type checker for end users

Journal of Visual Languages and Computing
Business modeling using SQL spreadsheets

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
From spreadsheets to relational databases and back

Proceedings of the 2009 ACM SIGPLAN workshop on Partial evaluation and program manipulation
Operationalization of the UFuRT methodology for usability analysis in the clinical research data management domain

Journal of Biomedical Informatics
Clip: a Visual Language for Explicit Schema Mappings

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A Spreadsheet Algebra for a Direct Data Manipulation Query Interface

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Spreadsheet as a relational database engine

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Spreadsheet-based complex data transformation

Proceedings of the 20th ACM international conference on Information and knowledge management
Senbazuru: a prototype spreadsheet database management system

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spreadsheets contain a huge amount of high-value data but do not observe a standard data model and thus are difficult to integrate. A large number of data integration tools exist, but they generally can only work on relational data. Existing systems for extracting relational data from spreadsheets are too labor intensive to support ad-hoc integration tasks, in which the correct extraction target is only learned during the course of user interaction. This paper introduces a system that automatically extracts relational data from spreadsheets, thereby enabling relational spreadsheet integration. The resulting integrated relational data can be queried directly or can be translated into RDF triples. When compared to standard techniques for spreadsheet data extraction on a set of 100 random Web spreadsheets, the system reduces the amount of human labor by 72% to 92%. In addition to the system design, we present the results of a general survey of more than 400,000 spreadsheets we downloaded from the Web, giving a novel view of how users organize their data in spreadsheets.