Robust web content extraction

Authors:
Marek Kowalkiewicz;Maria E. Orlowska;Tomasz Kaczmarek;Witold Abramowicz
Affiliations:
The Poznan University of Economics, Poznan, Poland;The University of Queensland, St. Lucia, Australia;The Poznan University of Economics, Poznan, Poland;The Poznan University of Economics, Poznan, Poland
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 3
Cited 5

A brief survey of web data extraction tools

ACM SIGMOD Record
Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation

SAINT '03 Proceedings of the 2003 Symposium on Applications and the Internet
Towards more personalized web: extraction and integration of dynamic content from the web

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development

Extending Services Delivery with Lightweight Composition

WISE '08 Proceedings of the 2008 international workshops on Web Information Systems Engineering
Crosslanguage blog mining and trend visualisation

Proceedings of the 18th international conference on World wide web
Blog credibility ranking by exploiting verified content

Proceedings of the 3rd workshop on Information credibility on the web
Robust web extraction: an approach based on a probabilistic tree-edit model

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
The OXPath to success in the deep web

Proceedings of the 20th international conference companion on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.