Elimination of junk document surrogate candidates through pattern recognition

  • Authors:
  • Eunyee Koh;Daniel Caruso;Andruid Kerne;Ricardo Gutierrez-Osuna

  • Affiliations:
  • Texas A&M University, College Station, TX;Texas A&M University, College Station, TX;Texas A&M University, College Station, TX;Texas A&M University, College Station, TX

  • Venue:
  • Proceedings of the 2007 ACM symposium on Document engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A surrogate is an object that stands for a document and enables navigation to that document. Hypermedia is often represented with textual surrogates, even though studies have shown that image and text surrogates facilitate the formation of mental models and overall understanding. Surrogates may be formed by breaking a document down into a set of smaller elements, each of which is a surrogate candidate. While processing these surrogate candidates from an HTML document, relevant information may appear together with less useful junk material, such as navigation bars and advertisements. This paper develops a pattern recognition based approach for eliminating junk while building the set of surrogate candidates. The approach defines features on candidate elements, and uses classification algorithms to make selection decisions based on these features. For the purpose of defining features in surrogate candidates, we introduce the Document Surrogate Model (DSM), a streamlined Document Object Model (DOM)-like representation of semantic structure. Using a quadratic classifier, we were able to eliminate junk surrogate candidates with an average classification rate of 80%. By using this technique, semiautonomous agents can be developed to more effectively generate surrogate collections for users. We end by describing a new approach for hypermedia and the semantic web, which uses the DSM to define value-added surrogates for a document.