Learning to identify new information

  • Authors:
  • Barry Schiffman;Kathleen R. Mckeown

  • Affiliations:
  • Columbia University;Columbia University

  • Venue:
  • Learning to identify new information
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This thesis is an investigation into a new problem in natural language processing: new-information detection. It is a similar task to first-story detection, but with a very large difference. First-story detection operates on the document level, while new-information detection is on the statement level. In its fundamental guise, new-information detection is the ability for a machine to be able to compare two textual statements and decide whether they say the same thing or not. But the task is complicated by the fact that each new statement must be tested against all previous statements. In this thesis, I show that the sentence is a poor choice of syntactic unit for this task since sentences are arbitrarily composed of one or more structures. Thus, the system must do a deeper syntactic analysis of the inputs than recognizing sentence boundaries. At the same time, I found that context is important, and I developed a mechanism to look beyond sentence boundaries for evidence of novelty. Thus, the system I developed considers a mixture of features, from a micro perspective, looking within sentences, and from a macro perspective, looking beyond sentence boundaries. I apply machine learning techniques to combine the features coherently into a unified hypothesis for the problem, using rule induction. The system is designed to function in a multi-document summarization system, like Columbia's NEWSB LASTER, for which it produces update summaries focusing on the developments of the day in an event that has interested the public over a period of several days. The new-information system provides all the novel statements to the DEMS summarizer, which I had previously built for NEWS BLASTER, for the final selection of material. The system also includes a semantic unit that improves performance a bit, but not as much as I hoped. The system uses a plugin lexicon that is largely taken from the WordNet data at present.