Conceptual Modeling of Data-Intensive Web Applications
IEEE Internet Computing
Indexing aids at corporate websites: the use of robots.txt and META Tags
Information Processing and Management: an International Journal
Communications of the ACM - ACM at sixty: a look back in time
A large-scale study of robots.txt
Proceedings of the 16th international conference on World Wide Web
Determining Bias to Search Engines from Robots.txt
WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
BotSeer: An Automated Information System for Analyzing Web Robots
ICWE '08 Proceedings of the 2008 Eighth International Conference on Web Engineering
Proceedings of the VLDB Endowment
Journal of Web Engineering
Hi-index | 0.00 |
The currently established formats for how a Web site can publish metadata about a site's pages, the robots.txt file and sitemaps, focus on how to provide information to crawlers about where to not go and where to go on a site. This is sufficient as input for crawlers, but does not allow Web sites to publish richer metadata about their site's structure, such as the navigational structure. This paper looks at the availability of Web site metadata on today's Web in terms of available information resources and quantitative aspects of their contents. Such an analysis of the available Web site metadata not only makes it easier to understand what data is available today; it also serves as the foundation for investigating what kind of information retrieval processes could be driven by that data, and what additional data could be provided by Web sites if they had richer data formats to publish metadata.