OCELOT: a system for summarizing Web pages

  • Authors:
  • Adam L. Berger;Vibhu O. Mittal

  • Affiliations:
  • School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;Just Research, 4616 Henry Street, Pittsburgh, PA

  • Venue:
  • SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure and content. Instead of coherent text with a well-defined discourse structure, they are more often likely to be a chaotic jumble of phrases, links, graphics and formatting commands. Such text provides little foothold for extractive summarization techniques, which attempt to generate a summary of a document by excerpting a contiguous, coherent span of text from it. This paper builds upon recent work in non-extractive summarization, producing the gist of a web page by “translating” it into a more concise representation rather than attempting to extract a text span verbatim. OCELOT uses probabilistic models to guide it in selecting and ordering words into a gist. This paper describes a technique for learning these models automatically from a collection of human-summarized web pages.