InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

Authors:
Mohamed Yakout;Kris Ganjam;Kaushik Chakrabarti;Surajit Chaudhuri
Affiliations:
Purdue University, West Lafayette, IN, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 16
Cited 8

Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Data integration for the relational web

Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
SEISA: set expansion by iterative similarity aggregation

Proceedings of the 20th international conference on World wide web
Schema Matching and Mapping

Schema Matching and Mapping
Fast personalized PageRank on MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment

The first joint international workshop on entity-oriented and semantic search (JIWES)

ACM SIGIR Forum
Understanding tables on the web

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Methods for exploring and mining tables on Wikipedia

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
MetKB: enriching RDF knowledge bases with web entity-attribute tables

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A human-machine method for web table understanding

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment
Schema extraction for tabular data on the web

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage. Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.