Bias-variance analysis in estimating true query model for information retrieval
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Classification tasks in information retrieval deal with document collections of enormous size, which makes the ratio between the document set underlying the learning process and the set of unseen documents very small. With a ratio close to zero, the evaluation of a model-classifier-combination's generalization ability with leave-n-out-methods or cross-validation becomes unreliable: The generalization error of a complex model (with a more complex hypothesis structure) might underestimated compared to the generalization error of a simple model (with a less complex hypothesis structure). Given this situation, optimizing the bias-variance-tradeoff to select among these models will lead one astray. To address this problem we introduce the idea of robust models, where one intentionally restricts the hypothesis structure within the model formation process. We observe that -- despite the fact that such a robust model entails a higher test error -- its efficiency "in the wild" outperforms the model that would have been chosen normally, under the perspective of the best bias-variance-tradeoff. We present two case studies: (1) a categorization task, which demonstrates that robust models are more stable in retrieval situations when training data is scarce, and (2) a genre identification task, which underlines the practical relevance of robust models.