Scalable column concept determination for web tables using large knowledge bases

Authors:
Dong Deng;Yu Jiang;Guoliang Li;Jian Li;Cong Yu
Affiliations:
Department of Computer Science, Tsinghua University, Beijing, China;Department of Computer Science, Tsinghua University, Beijing, China;Department of Computer Science, Tsinghua University, Beijing, China;Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China;Google Research, New York
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 35
Cited 0

Multilevel k-way hypergraph partitioning

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Freebase: a collaboratively created graph database for structuring human knowledge

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
RDF123: From Spreadsheets to RDF

ISWC '08 Proceedings of the 7th International Conference on The Semantic Web
Fuzzy Annotation of Web Data Tables Driven by a Domain Ontology

ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
DBpedia: a nucleus for a web of open data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
Converting and annotating quantitative data tables

ISWC'10 Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I
ITEM: extract and integrate entities from tabular data to RDF knowledge base

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment
Fast-join: An efficient method for fuzzy token matching based string similarity join

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Probase: a probabilistic taxonomy for text understanding

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards a high quality and web-scalable table search engine

KEYS '12 Proceedings of the Third International Workshop on Keyword Search on Structured Data
Parallel Top-K Similarity Join Algorithms Using MapReduce

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Viewing the Web as a Distributed Knowledge Base

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Answering table queries on the web using column keywords

Proceedings of the VLDB Endowment
Understanding tables on the web

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Entity discovery and annotation in tables

Proceedings of the 16th International Conference on Extending Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tabular data on the Web has become a rich source of structured data that is useful for ordinary users to explore. Due to its potential, tables on the Web have recently attracted a number of studies with the goals of understanding the semantics of those Web tables and providing effective search and exploration mechanisms over them. An important part of table understanding and search is column concept determination, i.e., identifying the most appropriate concepts associated with the columns of the tables. The problem becomes especially challenging with the availability of increasingly rich knowledge bases that contain hundreds of millions of entities. In this paper, we focus on an important instantiation of the column concept determination problem, namely, the concepts of a column are determined by fuzzy matching its cell values to the entities within a large knowledge base. We provide an efficient and scalable MapReduce-based solution that is scalable to both the number of tables and the size of the knowledge base and propose two novel techniques: knowledge concept aggregation and knowledge entity partition. We prove that both the problem of finding the optimal aggregation strategy and that of finding the optimal partition strategy are NP-hard, and propose efficient heuristic techniques by leveraging the hierarchy of the knowledge base. Experimental results on real-world datasets show that our method achieves high annotation quality and performance, and scales well.