The downside of markup: examining the harmful effects of CSS and javascript on indexing today's web

Authors:
Karl Gyllstrom;Carsten Eickhoff;Arjen P. de Vries;Marie-Francine Moens
Affiliations:
Katholieke Universiteit Leuven, Leuven, Belgium;Delft University of Technology, Delft, Netherlands;Centrum Wiskunde & Informatica, Amsterdam, Netherlands;Katholieke Universiteit Leuven, Leuven, Belgium
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 4
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Multi-style language model for web scale information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The continued development and maturation of advanced HTML features such as Cascading style sheets (CSS), Javascript, and AJAX, as well as their widespread adoption by browsers, has enabled web pages to flourish with sophistication and interactivity. Unfortunately, this presents challenges to the web search community, as a web page's representation in the browser (i.e., what users see) can diverge dramatically from its raw HTML content (i.e., what search engines index and retrieve). For example, interactive pages may contain content in regions that are not visible before a user action, such as focusing a tab, but which are nonetheless still contained within the raw HTML. We study this divergence by comparing raw HTML to its fully rendered form across a number of metrics spanning presentation, geometry, and content, using a large, representative sample of popular web pages. We find that a large divergence currently exists, and we show via a historical analysis that this divergence has grown more pronounced over the last decade. The general finding of our study is that continuing to index the web via simple HTML parsing will diminish the effectiveness of retrieval on the modern web, and that the IR community should work toward more sophisticated web page processing in indexing technology.