Towards realistic known-item topics for the ClueWeb

Authors:
Claudia Hauff;Matthias Hagen;Anna Beyer;Benno Stein
Affiliations:
Delft University of Technology, Delft, The Netherlands;Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany
Venue:
Proceedings of the 4th Information Interaction in Context Symposium
Year:
2012

Citing 10
Cited 0

Finding and reminding: file organization from the desktop

ACM SIGCHI Bulletin
Stuff I've seen: a system for personal information retrieval and re-use

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
"Stuff goes into the computer and doesn't come out": a cross-tool study of personal information management

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
What do people recall about their documents?: implications for desktop search tools

Proceedings of the 12th international conference on Intelligent user interfaces
Building simulated queries for known-item topics: an analysis using six european languages

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Exploring memory in email refinding

ACM Transactions on Information Systems (TOIS)
Retrieval experiments using pseudo-desktop collections

Proceedings of the 18th ACM conference on Information and knowledge management
Large scale query log analysis of re-finding

Proceedings of the third ACM international conference on Web search and data mining
Ranking using multiple document types in desktop search

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Cognitive processes in query generation

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily relies on corpora of known-item queries and the respective known items. However, many existing corpora are proprietary and not available to the public (in particular those derived from Web query logs), a fact which does not allow for repeatable research. The existing publicly available corpora either contain automatically generated queries or queries that were manually generated while seeing the known item itself. Hence, we consider these public corpora to be rather artificial in nature. In this paper, we propose a methodology to create a known-item topic set that is much more realistic and that is built on top of a large-scale public test corpus. From know-item questions posted on the popular Yahoo! Answers platform we extract queries for known-items in a crowdsourcing setup. Since we ensure that all the known-items correspond to Web pages in the publicly available ClueWeb09 corpus (a large static Web crawl), we provide an environment for repeatable realistic Web-scale known-item searches.