Text categorization for multi-page documents: a hybrid naive Bayes HMM approach

  • Authors:
  • Paolo Frasconi;Giovanni Soda;Alessandro Vullo

  • Affiliations:
  • Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy;Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy;Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy

  • Venue:
  • Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text categorization is typically formulated as a concept learning prob lem where each instance is a single isolated document. In this paper we are interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. We describe a method for classifying pages of sequential OCR text documents into one of several assigned categories and suggest that taking into account contextual information provided by the whole page sequence can significantly improve classification accuracy. The proposed architecture relies on hidden Markov models whose emissions are bag-of-words according to a multinomial word event model, as in the generative portion of the Naive Bayes classifier. Our results on a collection of scanned journals from the Making of America project confirm the importance of using whole page sequences. Empirical evaluation indicates that the error rate (as obtained by running a plain Naive Bayes classifier on isolated page) can be roughly reduced by half if contextual information is incorporated.