Text categorization for multi-page documents: a hybrid naive Bayes HMM approach

Authors:
Paolo Frasconi;Giovanni Soda;Alessandro Vullo
Affiliations:
Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy;Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy;Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy
Venue:
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Year:
2001

Citing 21
Cited 8

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Bayesian Belief Networks as a tool for stochastic parsing

Speech Communication
Probabilistic independence networks for hidden Markov probability models

Neural Computation
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Statistical Language Learning

Statistical Language Learning
Machine Learning

Machine Learning
Introduction to Bayesian Networks

Introduction to Bayesian Networks
Bayesian Networks for Data Mining

Data Mining and Knowledge Discovery
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Hidden Markov Model} Induction by Bayesian Model Merging

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Evaluating OCR and Non-OCR Text Representations for Learning Document Classifiers

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Making of America: Online Searching and Page Presentation at theUniversity of Michigan

Making of America: Online Searching and Page Presentation at theUniversity of Michigan
A New Probabilistic Model of Text Classification and Retrieval TITLE2:

A New Probabilistic Model of Text Classification and Retrieval TITLE2:

Searching for experts on the Web: A review of contemporary expertise locator systems

ACM Transactions on Internet Technology (TOIT)
Resume information extraction with cascaded hybrid model

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
An impact of linguistic features on automated classification of OCR texts

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
A subspace decision cluster classifier for text classification

Expert Systems with Applications: An International Journal
Intelligent search on the internet

Reasoning, Action and Interaction in AI Theories and Systems
The impact of OCR accuracy and feature transformation on automatic text classification

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text categorization is typically formulated as a concept learning prob lem where each instance is a single isolated document. In this paper we are interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. We describe a method for classifying pages of sequential OCR text documents into one of several assigned categories and suggest that taking into account contextual information provided by the whole page sequence can significantly improve classification accuracy. The proposed architecture relies on hidden Markov models whose emissions are bag-of-words according to a multinomial word event model, as in the generative portion of the Naive Bayes classifier. Our results on a collection of scanned journals from the Making of America project confirm the importance of using whole page sequences. Empirical evaluation indicates that the error rate (as obtained by running a plain Naive Bayes classifier on isolated page) can be roughly reduced by half if contextual information is incorporated.