Improving retrievability and recall by automatic corpus partitioning

Authors:
Shariq Bashir;Andreas Rauber
Affiliations:
Institute of Software Technology and Interactive Systems, Vienna University of Technology, Austria;Institute of Software Technology and Interactive Systems, Vienna University of Technology, Austria
Venue:
Transactions on large-scale data- and knowledge-centered systems II
Year:
2010

Citing 16
Cited 0

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Risk minimization and language modeling in text retrieval dissertation abstract

ACM SIGIR Forum
Search engine coverage bias: evidence and possible causes

Information Processing and Management: an International Journal
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Term distillation in patent retrieval

PATENT '03 Proceedings of the ACL-2003 workshop on Patent corpus processing - Volume 20
Using controlled query generation to evaluate blind relevance feedback algorithms

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
A new approach for evaluating query expansion: query-document term mismatch

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Enhancing patent retrieval by citation analysis

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Retrievability: an evaluation measure for higher order information access tasks

Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias

Proceedings of the 17th ACM conference on Information and knowledge management
Transforming patents into prior-art queries

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Analyzing Document Retrievability in Patent Retrieval Settings

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Identification of low/high retrievable patents using content-based features

Proceedings of the 2nd international workshop on Patent information retrieval
TREC-CHEM: large scale chemical information retrieval evaluation at TREC

ACM SIGIR Forum
Applications of web query mining

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

With increasing volumes of data, much effort has been devoted to finding the most suitable answer to an information need. However, in many domains, the question whether any specific information item can be found at all via a reasonable set of queries is essential. This concept of Retrievability of information has evolved into an important evaluation measure of IR systems in recall-oriented application domains. While several studies evaluated retrieval bias in systems, solid validation of the impact of retrieval bias and the development of methods to counter low retrievability of certain document types would be desirable. This paper provides an in-depth study of retrievability characteristics over queries of different length in a large benchmark corpus, validating previous studies. It analyzes the possibility of automatically categorizing documents into low and high retrievable documents based on document properties rather than complex retrievability analysis. We furthermore show, that this classification can be used to improve overall retrievability of documents by treating these classes as separate document corpora, combining individual retrieval results. Experiments are validated on 1.2 million patents of the TREC Chemical Retrieval Track.