Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A Survey of Longest Common Subsequence Algorithms
SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
Automatic information extraction from large websites
Journal of the ACM (JACM)
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Bootstrapping lexical choice via multiple-sequence alignment
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
Learning to link with wikipedia
Proceedings of the 17th ACM conference on Information and knowledge management
Table extraction using spatial reasoning on the CSS2 visual box model
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Data integration for the relational web
Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships
Proceedings of the VLDB Endowment
Joint training for open-domain extraction on the web: exploiting overlap when supervision is limited
Proceedings of the fourth ACM international conference on Web search and data mining
Local and global algorithms for disambiguation to Wikipedia
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Recovering semantics of tables on the web
Proceedings of the VLDB Endowment
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Understanding tables on the web
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Hi-index | 0.00 |
Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics [Cafarella et al., 2008a; Elmeleegy et al., 2009; Limaye et al., 2010; Venetis et al., 2011]. As a result, hundreds of millions of high quality structured data tables can now be explored by the users. In this paper, we argue that those efforts only scratch the surface of the true value of structured data on the Web, and study the challenging problem of synthesizing tables from the Web, i.e., producing never-before-seen tables from raw tables on the Web. Table synthesis offers an important semantic advantage: when a set of related tables are combined into a single union table, powerful mechanisms, such as temporal or geographical comparison and visualization, can be employed to understand and mine the underlying data holistically. We focus on one fundamental task of table synthesis, namely, table stitching. Within a given site, many tables with identical schemas can be scattered across many pages. The task of table stitching involves combining such tables into a single meaningful union table and identifying extra attributes and values for its rows so that rows from different original tables can be distinguished. Specifically, we first define the notion of stitchable tables and identify collections of tables that can be stitched. Second, we design an effective algorithm for extracting hidden attributes that are essential for the stitching process and for aligning values of those attributes across tables to synthesize new columns. We also assign meaningful names to these synthesized columns. Experiments on real world tables demonstrate the effectiveness of our approach.