Statistical machine learning for information retrieval

Authors:
Adam Berger;John Lafferty
Affiliations:
Carnegie Mellon University;Carnegie Mellon University
Venue:
Statistical machine learning for information retrieval
Year:
2001

Citing 0
Cited 5

Introduction to the special issue on statistical language modeling

ACM Transactions on Asian Language Information Processing (TALIP)
Incorporating query difference for learning retrieval functions in world wide web search

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A regression framework for learning ranking functions using relative relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Finding the right facts in the crowd: factoid question answering over social media

Proceedings of the 17th international conference on World Wide Web
Comparison, selection and merging of techniques, methods and tools for operational CLIR systems: the case of the Greek language

ICCOMP'05 Proceedings of the 9th WSEAS International Conference on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

The purpose of this work is to introduce and experimentally validate a framework, based on statistical machine learning, for handling a broad range of problems in information retrieval (IR). Probably the most important single component of this framework is a parametric statistical model of word relatedness. A longstanding problem in IR has been to develop a mathematically principled model for document processing which acknowledges that one sequence of words may be closely related to another even if the pair have few (or no) words in common. Until now, the word-relatedness problem has typically been addressed with techniques like automatic query expansion [75], an often successful though ad hoc technique which artificially injects new, related words into a document for the purpose of ensuring that related documents have some lexical overlap. In the past few years have emerged a number of novel probabilistic approaches to information processing—including the language modeling approach to document ranking suggested first by Ponte and Croft [67], the non-extractive summarization work of Mittal and Witbrock [87], and the Hidden Markov Model-based ranking of Miller et al. [61]. This thesis advances that body of work by proposing a principled, general probabilistic framework which naturally accounts for word-relatedness issues, using techniques from statistical machine learning such as the Expectation-Maximization (EM) algorithm [24]. Applying this new framework to the problem of ranking documents by relevancy to a query, for instance, we discover a model that contains a version of the Ponte and Miller models as a special case, but surpasses these in its ability to recognize the relevance of a document to a query even when the two have minimal lexical overlap. (Abstract shortened by UMI.)