Exploring a new space of features for document classification: figure clustering

  • Authors:
  • Nawei Chen;Hagit Shatkay;Dorothea Blostein

  • Affiliations:
  • Queen's University, Kingston, Ontario, Canada;Queen's University, Kingston, Ontario, Canada;Queen's University, Kingston, Ontario, Canada

  • Venue:
  • CASCON '06 Proceedings of the 2006 conference of the Center for Advanced Studies on Collaborative research
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic document classification is an important step in organizing and mining documents. Information in documents is often conveyed using both text and images that complement each other. Typically, only the text content forms the basis for features that are used in document classification. In this paper, we explore the use of information from figure images to assist in this task. We explore image clustering as a basis for constructing visual words for representing documents. Once such visual words are formed, the standard bag-of-words representation along with commonly used classifiers, such as the naïve Bayes, can be used to classify a document. We report here results from classifying biomedical documents that were previously used in the TREC Genomics track, employing the image-based representation. Efforts are ongoing to improve image-based classification and analyze the relationships between text and images. The goal is to develop a new set of features to supplement current text-based features.