An analysis of structured data on the web

Authors:
Nilesh Dalvi;Ashwin Machanavajjhala;Bo Pang
Affiliations:
Yahoo! Research, Great America Parkway, Santa Clara, CA;Yahoo! Research, Great America Parkway, Santa Clara, CA;Yahoo! Research, Great America Parkway, Santa Clara, CA
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 19
Cited 4

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Crawling the web: discovery and maintenance of large-scale web data

Crawling the web: discovery and maintenance of large-scale web data
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
"More like these": growing entity classes from seeds

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Flint: Google-basing the Web

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
Iterative Set Expansion of Named Entities Using the Web

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Anatomy of the long tail: ordinary people with extraordinary tastes

Proceedings of the third ACM international conference on Web search and data mining
Understanding deja reviewers

Proceedings of the 2010 ACM conference on Computer supported cooperative work
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment

Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Truth finding on the deep web: is the problem solved?

Proceedings of the VLDB Endowment
Extraction and integration of partially overlapping web sources

Proceedings of the VLDB Endowment
WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites and tail sites for various interesting domains. We believe this is the first study of its kind, and gives us new insights for information extraction over the Web.