SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data

Authors:
Mathias Konrath;Thomas Gottron;Steffen Staab;Ansgar Scherp
Affiliations:
-;-;-;-
Venue:
Web Semantics: Science, Services and Agents on the World Wide Web
Year:
2012

Citing 9
Cited 5

Extracting schema from semistructured data

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Index Structures for Path Expressions

ICDT '99 Proceedings of the 7th International Conference on Database Theory
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
D(k)-index: an adaptive structural summary for graph-structured data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Exploiting Local Similarity for Indexing Paths in Graph-Structured Data

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Semantic Search --- Using Graph-Structured Semantic Models for Supporting the Search Process

ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Data summaries for on-demand queries over linked data

Proceedings of the 19th international conference on World wide web
Creating voiD descriptions for Web-scale data

Web Semantics: Science, Services and Agents on the World Wide Web
ExpLOD: summary-based exploration of interlinking and RDF usage in the linked open data cloud

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II

Incompleteness-aware programming with RDF data

DDFP '13 Proceedings of the 2013 workshop on Data driven functional programming
Structure inference for linked data sources using clustering

Proceedings of the Joint EDBT/ICDT 2013 Workshops
LOVER: support for modeling data using linked open vocabularies

Proceedings of the Joint EDBT/ICDT 2013 Workshops
LODatio: using a schema-level index to support users infinding relevant sources of linked data

Proceedings of the seventh international conference on Knowledge capture
Large-scale bisimulation of RDF graphs

Proceedings of the Fifth Workshop on Semantic Web Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present SchemEX, an approach and tool for a stream-based indexing and schema extraction of Linked Open Data (LOD) at web-scale. The schema index provided by SchemEX can be used to locate distributed data sources in the LOD cloud. It serves typical LOD information needs such as finding sources that contain instances of one specific data type, of a given set of data types (so-called type clusters), or of instances in type clusters that are connected by one or more common properties (so-called equivalence classes). The entire process of extracting the schema from triples and constructing an index is designed to have linear runtime complexity. Thus, the schema index can be computed on-the-fly while the triples are crawled and provided as a stream by a linked data spider. To demonstrate the web-scalability of our approach, we have computed a SchemEX index over the Billion Triples Challenge (BTC) dataset 2011 consisting of 2,170 million triples. In addition, we have computed the SchemEX index on a dataset with 11 million triples. We use this smaller dataset for conducting a detailed qualitative analysis. We are capable of locating relevant data sources with recall between 71% and 98% and a precision between 74% and 100% at a window size of 100 K triples observed in the stream and depending on the complexity of the query, i.e. if one wants to find specific data types, type clusters or equivalence classes.