Beyond keywords: finding information more accurately and easily using natural language

  • Authors:
  • Eugene Charniak;Matthew Lease

  • Affiliations:
  • Brown University;Brown University

  • Venue:
  • Beyond keywords: finding information more accurately and easily using natural language
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information retrieval (IR) has become a ubiquitous technology for quickly and easily finding information on a given topic amidst the wealth of digital content available today. This dissertation addresses search for written and spoken natural language documents, including news articles, Web pages, and spoken interviews. Effective model estimation is identified as a key problem, and several novel estimation techniques are presented and shown to significantly enhance search accuracy. While search is typically performed via a few carefully chosen keywords, formulating effective keyword queries is often unintuitive and iterative, particularly when seeking complex information. As an alternative to keyword search, this dissertation investigates search using “natural” queries, such as questions or sentences a person might naturally articulate in communicating their information need to another person. By moving toward supporting natural queries, the communication burden is shifted from user query formulation to system interpretation of natural language. The challenge in enacting such a shift is enabling automatic IR systems to more effectively cope with natural language. To this end, several new estimation techniques for modeling natural queries are described. In comparison to a maximum likelihood baseline, 15-20% relative improvement in mean-average precision (MAP) is demonstrated without use of query expansion. When an IR system discovers or is provided one or more feedback documents exemplifying a user’s information need, there is further opportunity to improve search accuracy by exploiting document contents for query expansion. However, since documents typically discuss multiple topics varying in importance and relevance to any information need, the system must again be able to effectively interpret verbose natural language. Consequently, an estimation method for leveraging such documents is presented and shown to yield state-of-the-art search accuracy. Depending on the base model employed, 15-85% relative MAP improvement is achieved. When modeling higher-order lexical features or searching smaller document collections like cultural history archives, sparsity become particularly problematic for estimation. To cope with such sparsity, additional estimation methods are described which yield 5-20% relative improvement in MAP accuracy across varying conditions of query verbosity.