University of glasgow at WebCLEF 2005: experiments in per-field normalisation and language specific stemming

Authors:
Craig Macdonald;Vassilis Plachouras;Ben He;Christina Lioma;Iadh Ounis
Affiliations:
University of Glasgow, UK;University of Glasgow, UK;University of Glasgow, UK;University of Glasgow, UK;University of Glasgow, UK
Venue:
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Year:
2005

Citing 7
Cited 14

Monolingual Document Retrieval for European Languages

Information Retrieval
Understanding user goals in web search

Proceedings of the 13th international conference on World Wide Web
Toward better weighting of anchors

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
An analysis of web searching by European AlltheWeb.com users

Information Processing and Management: an International Journal
A study of the dirichlet priors for term frequency normalisation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
EuroGOV: engineering a multilingual web corpus

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Terrier information retrieval platform

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Voting for candidates: adapting data fusion techniques for an expert search task

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Automatic document prior feature selection for web retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Voting techniques for expert search

Knowledge and Information Systems
Usefulness of quality click-through data for training

Proceedings of the 2009 workshop on Web Search Click Data
Mixed monolingual homepage finding in 34 languages: the role of language script and search domain

Information Retrieval
Selective Application of Query-Independent Features in Web Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Predicting the Usefulness of Collection Enrichment for Enterprise Search

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Multinomial randomness models for retrieval with document fields

ECIR'07 Proceedings of the 29th European conference on IR research
On the usefulness of query features for learning to rank

Proceedings of the 21st ACM international conference on Information and knowledge management
Effective retrieval model for entity with multi-valued attributes: BM25MF and beyond

EKAW'12 Proceedings of the 18th international conference on Knowledge Engineering and Knowledge Management
Efficient and effective retrieval using selective pruning

Proceedings of the sixth ACM international conference on Web search and data mining
Merging words and concepts for medical articles retrieval

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Relevance in microblogs: enhancing tweet retrieval using hyperlinked documents

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
About learning models with multiple query-dependent features

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming.