Web site metadata

Authors:
Erik Wilde;Anuradha Roy
Affiliations:
School of Information, UC Berkeley, Berkeley, CA;School of Information, UC Berkeley, Berkeley, CA
Venue:
Journal of Web Engineering
Year:
2010

Citing 7
Cited 1

Conceptual Modeling of Data-Intensive Web Applications

IEEE Internet Computing
Indexing aids at corporate websites: the use of robots.txt and META Tags

Information Processing and Management: an International Journal
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
A large-scale study of robots.txt

Proceedings of the 16th international conference on World Wide Web
Determining Bias to Search Engines from Robots.txt

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
BotSeer: An Automated Information System for Analyzing Web Robots

ICWE '08 Proceedings of the 2008 Eighth International Conference on Web Engineering
Google's Deep Web crawl

Proceedings of the VLDB Endowment

Personalizing search using socially enhanced interest model, built from the stream of user's activity

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The currently established formats for how a Web site can publish metadata about a site's pages, the robots.txt file and sitemaps, focus on how to provide information to crawlers about where to not go and where to go on a site. This is sufficient as input for crawlers, but does not allow Web sites to publish richer metadata about their site's structure, such as the navigational structure. This paper looks at the availability of Web site metadata on today's Web in terms of available information resources and quantitative aspects of their contents. Such an analysis of the available Web site metadata not only makes it easier to understand what data is available today; it also serves as the foundation for investigating what kind of information retrieval processes could be driven by that data, and what additional data could be provided by Web sites if they had richer data formats to publish metadata.