Using linked data to mine RDF from wikipedia's tables

Authors:
Emir Muñoz;Aidan Hogan;Alessandra Mileo
Affiliations:
Fujitsu (Ireland) Limited, Galway, Ireland;Universidad de Chile, Santiago, Chile;INSIGHT @ NUI Galway, Galway, Ireland
Venue:
Proceedings of the 7th ACM international conference on Web search and data mining
Year:
2014

Citing 13
Cited 0

A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Transforming arbitrary tables into logical form with TARTAR

Data & Knowledge Engineering
Freebase: a collaboratively created graph database for structuring human knowledge

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
DBpedia - A crystallization point for the Web of Data

Web Semantics: Science, Services and Agents on the World Wide Web
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
A fine-grained taxonomy of tables on the web

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
Web-scale table census and classification

Proceedings of the fourth ACM international conference on Web search and data mining
Data Mining: Practical Machine Learning Tools and Techniques

Data Mining: Practical Machine Learning Tools and Techniques
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment
DBpedia spotlight: shedding light on the web of documents

Proceedings of the 7th International Conference on Semantic Systems
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia

Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The tables embedded in Wikipedia articles contain rich, semi-structured encyclopaedic content. However, the cumulative content of these tables cannot be queried against. We thus propose methods to recover the semantics of Wikipedia tables and, in particular, to extract facts from them in the form of RDF triples. Our core method uses an existing Linked Data knowledge-base to find pre-existing relations between entities in Wikipedia tables, suggesting the same relations as holding for other entities in analogous columns on different rows. We find that such an approach extracts RDF triples from Wikipedia's tables at a raw precision of 40%. To improve the raw precision, we define a set of features for extracted triples that are tracked during the extraction phase. Using a manually labelled gold standard, we then test a variety of machine learning methods for classifying correct/incorrect triples. One such method extracts 7.9 million unique and novel RDF triples from over one million Wikipedia tables at an estimated precision of 81.5%.