Structured data on the web

Authors:
Alon Y. Halevy
Affiliations:
Google Inc., Mountain View, California
Venue:
NGITS'09 Proceedings of the 7th international conference on Next generation information technologies and systems
Year:
2009

Citing 2
Cited 0

WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Google's Deep Web crawl

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Though search on the World-Wide Web has focused mostly on unstructured text, there is an increasing amount of structured data on the Web and growing interest in harnessing such data. I will describe several current projects at Google whose overall goal is to leverage structured data and better expose it to our users. The first project is on crawling the deep web. The deep web refers to content that resides in databases behind forms, but is unreachable by search engines because there are no links to these pages. I will describe a system that surfaces pages from the deep web by guessing queries to submit to these forms, and entering the results into the Google index [1]. The pages that we generated using this system come from millions of forms, hundreds of domains and over 40 languages. Pages from the deep web are served in the top-10 results on google.com for over 1000 queries per second. The second project considers the collection of HTML tables on the web. The Web Tables Project [2] built a corpus of over 150 million tables from HTML tables on the Web. The WebTables System addresses the challenges of extracting these tables from the Web, and offers search over this collection of tables. The project also illustrates the potential of leveraging the collection of schemas of these tables. Finally, I'll discuss current work on computing aspects of queries in order to better organize search results for exploratory queries.