Quarrying dataspaces: Schemaless profiling of unfamiliar information sources

Authors:
Bill Howe;David Maier;Nicolas Rayner;James Rucker
Affiliations:
Oregon Health&Science University, Center for Coastal Margin Observation and Prediction, 20000 NW Walker Road, Beaverton, USA;Portland State University, Department of Computer Science, 1900 SW 4th Avenue, Oregon, USA;Portland State University, Department of Computer Science, 1900 SW 4th Avenue, Oregon, USA;Portland State University, Department of Computer Science, 1900 SW 4th Avenue, Oregon, USA
Venue:
ICDEW '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering Workshop
Year:
2008

Citing 0
Cited 5

A first tutorial on dataspaces

Proceedings of the VLDB Endowment
SW-Store: a vertically partitioned DBMS for Semantic Web data management

The VLDB Journal — The International Journal on Very Large Data Bases
Dimensions of Dataspaces

BNCOD 26 Proceedings of the 26th British National Conference on Databases: Dataspace: The Final Frontier
Chapter 7: dataspaces

Search Computing
Incrementally improving dataspaces based on user feedback

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional data integration and analysis approaches tend to assume intimate familiarity with the structure, semantics, and capabilities of the available information sources before applicable tools can be used effectively. This assumption often does not hold in practice. We introduce dataspace profiling as the cardinal activity when beginning a project in an unfamiliar dataspace. Dataspace profiling is an analysis of the structures and properties exposed by an information source, allowing 1) assessment of the utility and importance of the information source as a whole, 2) assessment of compatibility with the services of a dataspace support platform, and 3) determination and externalization of structure in preparation for specific data applications. In this paper, we define dataspace profiling and articulate requirements for dataspace profilers. We then describe the Quarry system, which offers a generic browse-and-query interface to support dataspace profiling activities, including path profiling, over a variety of data sources with minimal setup costs and minimal a priori assumptions.We show that the mechanisms used in Quarry deliver strong performance in large-scale applications. Specifically, we use Quarry to efficiently profile 1) a detailed standard for medication nomenclature supplied under a generic schema and 2) the metadata for an environmental observation and forecasting system, and conclude that in these contexts Quarry offers advantages over existing tools.