Hidden Markov Models for Text Categorization in Multi-Page Documents

Authors:
Paolo Frasconi;Giovanni Soda;Alessandro Vullo
Affiliations:
Department of Systems and Computer Science, University of Florence, Firenze I-50139, Italy. paolo@dsi.unifi.it;Department of Systems and Computer Science, University of Florence, Firenze I-50139, Italy. giovanni@dsi.unifi.it;Department of Systems and Computer Science, University of Florence, Firenze I-50139, Italy. vullo@dsi.unifi.it
Venue:
Journal of Intelligent Information Systems
Year:
2002

Citing 21
Cited 7

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems (TOIS)
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Bayesian Belief Networks as a tool for stochastic parsing

Speech Communication
Probabilistic independence networks for hidden Markov probability models

Neural Computation
Feature selection, perceptron learning, and a usability case study for text categorization

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Statistical Language Learning

Statistical Language Learning
Machine Learning

Machine Learning
Introduction to Bayesian Networks

Introduction to Bayesian Networks
Bayesian Networks for Data Mining

Data Mining and Knowledge Discovery
Automating the Construction of Internet Portals with Machine Learning

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Hidden Markov Model} Induction by Bayesian Model Merging

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Making of America: Online Searching and Page Presentation at theUniversity of Michigan

Making of America: Online Searching and Page Presentation at theUniversity of Michigan
A New Probabilistic Model of Text Classification and Retrieval TITLE2:

A New Probabilistic Model of Text Classification and Retrieval TITLE2:

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A hidden Markov model-based text classification of medical documents

Journal of Information Science
Automatic metadata extraction from museum specimen labels

DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Monotone Increasing Binary Similarity and Its Application to Automatic Document-Acquisition of a Category

IEICE - Transactions on Information and Systems
An Inductive Logic Programming Approach to Statistical Relational Learning

Proceedings of the 2005 conference on An Inductive Logic Programming Approach to Statistical Relational Learning
Logical hidden Markov models

Journal of Artificial Intelligence Research
The impact of OCR accuracy and feature transformation on automatic text classification

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the traditional setting, text categorization is formulated as a concept learning problem where each instance is a single isolated document. However, this perspective is not appropriate in the case of many digital libraries that offer as contents scanned and optically read books or magazines. In this paper, we propose a more general formulation of text categorization, allowing documents to be organized as sequences of pages. We introduce a novel hybrid system specifically designed for multi-page text documents. The architecture relies on hidden Markov models whose emissions are bag-of-words resulting from a multinomial word event model, as in the generative portion of the Naive Bayes classifier. The rationale behind our proposal is that taking into account contextual information provided by the whole page sequence can help disambiguation and improves single page classification accuracy. Our results on two datasets of scanned journals from the Making of America collection confirm the importance of using whole page sequences. The empirical evaluation indicates that the error rate (as obtained by running the Naive Bayes classifier on isolated pages) can be significantly reduced if contextual information is incorporated.