WebTables: exploring the power of tables on the web

Authors:
Michael J. Cafarella;Alon Halevy;Daisy Zhe Wang;Eugene Wu;Yang Zhang
Affiliations:
University of Washington, Seattle, WA;Google, Inc., Mountain View, CA;UC Berkeley, Berkeley, CA;MIT, Cambridge, MA;MIT, Cambridge, MA
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 21
Cited 100

Automated database schema design using mined data dependencies

Journal of the American Society for Information Science - Special issue: knowledge discovery and data mining
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Snowball: a prototype system for extracting relations from large text collections

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
DIRT @SBT@discovery of inference rules from text

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
DBXplorer: enabling keyword search over relational databases

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Generic Schema Matching with Cupid

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Knocking the door to the deep Web: integrating Web query interfaces

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Assisted querying using instant-response interfaces

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Discover: keyword search in relational databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

A first tutorial on dataspaces

Proceedings of the VLDB Endowment
Web-scale extraction of structured data

ACM SIGMOD Record
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Vispedia: on-demand data integration for interactive visualization and exploration

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
SCOVO: Using Statistics on the Web of Data

ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Weakly-supervised acquisition of labeled class instances using graph random walks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Semi-supervised learning of semantic classes for query understanding: from the web and for the web

Proceedings of the 18th ACM conference on Information and knowledge management
ExSearch: a novel vertical search engine for online barter business

Proceedings of the 18th ACM conference on Information and knowledge management
Spatio-textual spreadsheets: geotagging via spatial coherence

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Answering web questions using structured data: dream or reality?

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Data integration for the relational web

Proceedings of the VLDB Endowment
Character-level analysis of semi-structured documents for set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Indexing relations on the web

Proceedings of the 13th International Conference on Extending Database Technology
Liquid query: multi-domain exploratory search on the web

Proceedings of the 19th international conference on World wide web
Exploiting information redundancy to wring out structured data from the web

Proceedings of the 19th international conference on World wide web
Entity relation discovery from web tables and links

Proceedings of the 19th international conference on World wide web
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automatically incorporating new sources in keyword search-based data integration

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Expressive and flexible access to web-extracted data: a keyword-based structured query language

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Structured annotations of web queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Google fusion tables: web-centered data management and collaboration

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Structured data on the web

NGITS'09 Proceedings of the 7th international conference on Next generation information technologies and systems
Querying structured information sources on the Web

International Journal of Metadata, Semantics and Ontologies
Acquisition of instance attributes via labeled and related instances

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Extraction and approximation of numerical attributes from the Web

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
WikiAnalytics: disambiguation of keyword search results on highly heterogeneous structured data

Procceedings of the 13th International Workshop on the Web and Databases
Redundancy-driven web data extraction and integration

Procceedings of the 13th International Workshop on the Web and Databases
Entity-relationship queries over wikipedia

SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
Structured data on the web

Communications of the ACM
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
QUICK: expressive and flexible search over knowledge bases and text collections

Proceedings of the VLDB Endowment
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Instance sense induction from attribute sets

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
HyLiEn: a hybrid approach to general list extraction on the web

Proceedings of the 20th international conference companion on World wide web
Automatically building probabilistic databases from the web

Proceedings of the 20th international conference companion on World wide web
Semi-supervised truth discovery

Proceedings of the 20th international conference on World wide web
FACTO: a fact lookup engine based on web tables

Proceedings of the 20th international conference on World wide web
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Schema-as-you-go: on probabilistic tagging and querying of wide tables

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Web data management

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Retrieving attributes using web tables

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment
Exploring schema repositories with schemr

ACM SIGMOD Record
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Knowledge and reasoning for question answering: Research perspectives

Information Processing and Management: an International Journal
Attribute retrieval from relational web tables

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
DC proposal: graphical models and probabilistic reasoning for generating linked data from tables

ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part II
Towards a framework for attribute retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
Finding dimensions for queries

Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient query rewrite for structured web queries

Proceedings of the 20th ACM international conference on Information and knowledge management
Best-effort modeling of structured data on the web

ER'11 Proceedings of the 30th international conference on Conceptual modeling
Query splitting techniques and search service recommendation for multi-domain natural language queries

Proceedings of the 5th International Workshop on Web APIs and Service Mashups
WebSets: extracting sets of entities from the web using unsupervised information extraction

Proceedings of the fifth ACM international conference on Web search and data mining
The role of query sessions in extracting instance attributes from web search queries

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Chapter 6: web data extraction for service creation

Search Computing
Chapter 13: liquid queries and liquid results in search computing

Search Computing
An analysis of structured data on the web

Proceedings of the VLDB Endowment
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Automatic web-scale information extraction

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Finding related tables

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards an ecosystem of structured data on the web

Proceedings of the 15th International Conference on Extending Database Technology
Building enriched web page representations using link paths

Proceedings of the 23rd ACM conference on Hypertext and social media
Research directions in data wrangling: visuatizations and transformations for usable and credible data

Information Visualization - Special issue on State of the Field and New Research Directions
Answering table queries on the web using column keywords

Proceedings of the VLDB Endowment
Entity-Relationship Queries over Wikipedia

ACM Transactions on Intelligent Systems and Technology (TIST)
LIEGE:: link entities in web lists with knowledge base

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A system for extracting top-K lists from the web

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to perceive two-dimensional displays using probabilistic grammars

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia

Artificial Intelligence
A domain independent framework for extracting linked semantic data from tables

Search Computing
Schema decryption for large extract-transform-load systems

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Understanding tables on the web

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
OXPath: A language for scalable data extraction, automation, and crawling on the deep web

The VLDB Journal — The International Journal on Very Large Data Bases
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Exploring structure and content on the web: extraction and integration of the semi-structured web

Proceedings of the sixth ACM international conference on Web search and data mining
Transforming graph data for statistical relational learning

Journal of Artificial Intelligence Research
Proactive natural language search engine: tapping into structured data on the web

Proceedings of the 16th International Conference on Extending Database Technology
Entity discovery and annotation in tables

Proceedings of the 16th International Conference on Extending Database Technology
InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Methods for exploring and mining tables on Wikipedia

Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics
User-driven quality evaluation of DBpedia

Proceedings of the 9th International Conference on Semantic Systems
MetKB: enriching RDF knowledge bases with web entity-attribute tables

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
Using natural language to integrate, evaluate, and optimize extracted knowledge bases

Proceedings of the 2013 workshop on Automated knowledge base construction
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
Aggregated search: A new information retrieval paradigm

ACM Computing Surveys (CSUR)
A human-machine method for web table understanding

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Semantic extraction of geographic data from web tables for big data integration

Proceedings of the 7th Workshop on Geographic Information Retrieval
Extraction and integration of partially overlapping web sources

Proceedings of the VLDB Endowment
Scalable column concept determination for web tables using large knowledge bases

Proceedings of the VLDB Endowment
Schema extraction for tabular data on the web

Proceedings of the VLDB Endowment
Web table taxonomy and formalization

ACM SIGMOD Record
Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

Proceedings of International Conference on Information Integration and Web-based Applications & Services
Synthesizing union tables from the web

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Using linked data to mine RDF from wikipedia's tables

Proceedings of the 7th ACM international conference on Web search and data mining
A temporal-probabilistic database model for information extraction

Proceedings of the VLDB Endowment
Test-driven evaluation of linked data quality

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.02

Visualization

Abstract

The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus? First, we develop new techniques for keyword search over a corpus of tables, and show that they can achieve substantially higher relevance than solutions based on a traditional search engine. Second, we introduce a new object derived from the database corpus: the attribute correlation statistics database (AcsDB) that records corpus-wide statistics on co-occurrences of schema elements. In addition to improving search relevance, the AcsDB makes possible several novel applications: schema auto-complete, which helps a database designer to choose schema elements; attribute synonym finding, which automatically computes attribute synonym pairs for schema matching; and join-graph traversal, which allows a user to navigate between extracted schemas using automatically-generated join links.