Reconciling schemas of disparate data sources: a machine-learning approach
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Proceedings of the 11th international conference on World Wide Web
Generic Schema Matching with Cupid
Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Statistical schema matching across web query interfaces
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
Pairwise document similarity in large collections with MapReduce
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Data integration for the relational web
Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships
Proceedings of the VLDB Endowment
SEISA: set expansion by iterative similarity aggregation
Proceedings of the 20th international conference on World wide web
Schema Matching and Mapping
Fast personalized PageRank on MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Recovering semantics of tables on the web
Proceedings of the VLDB Endowment
Understanding tables on the web
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Methods for exploring and mining tables on Wikipedia
Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
MetKB: enriching RDF knowledge bases with web entity-attribute tables
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A human-machine method for web table understanding
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Scalable column concept determination for web tables using large knowledge bases
Proceedings of the VLDB Endowment
Schema extraction for tabular data on the web
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage. Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.