Retrieval of snippets of web pages converted to plain text: more questions than answers

Authors:
Carlos G. Figuerola;José Luis Alonso Berrocal;Ángel F. Zazo Rodríguez;Montserrat Mateos
Affiliations:
University of Salamanca, REINA Research Group, Salamanca, Spain;University of Salamanca, REINA Research Group, Salamanca, Spain;University of Salamanca, REINA Research Group, Salamanca, Spain;University of Salamanca, REINA Research Group, Salamanca, Spain
Venue:
CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Year:
2008

Citing 6
Cited 0

Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Tagging sentence boundaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Overview of WebCLEF 2007

Advances in Multilingual and Multimodal Information Retrieval
Reformulation of queries using similarity thesauri

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This year's WebCLEF task was to retrieve snippets and pieces from documents on various topics. The extraction and the choice of the most widely used snippets can be carried out using various methods. However, the way in which web pages are usually converted to plain text introduces a series of problems that cause inefficiency in the retrieval. Duplicate information, absolutely irrelevants snippets or even meaningless, are some of these problems. Also, it is intended in this paper to explore the real impact of the use of several languages in obtaining relevant fragments.