Bayesian network model for semi-structured document classification

Authors:
Ludovic Denoyer;Patrick Gallinari
Affiliations:
Laboratoire of Informatique de Paris VI, L1P6, 8 rue du Capitaine Scott, 75015 Paris, France;Laboratoire of Informatique de Paris VI, L1P6, 8 rue du Capitaine Scott, 75015 Paris, France
Venue:
Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Year:
2004

Citing 19
Cited 21

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A flexible model for retrieval of SGML documents

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The Hierarchical Hidden Markov Model: Analysis and Applications

Machine Learning
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Learning probabilistic models of the Web (poster session)

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A classifier for semi-structured documents

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Using the Fisher Kernel Method to Detect Remote Protein Homologies

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
TreeFinder: a First Step towards XML Data Mining

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Text categorization by boosting automatically extracted concepts

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Classification of HTML Documents by Hidden Tree-Markov Models

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic latent semantic analysis

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
Hierarchical topic segmentation of websites

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Review article: A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management

Computers in Industry
Information retrieval and applications of graphical models (IRGM 2007)

ACM SIGIR Forum
Probabilistic Model for Structured Document Mapping

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Exploiting Attribute-Wise Distribution of Keywords and Category Dependent Attributes for E-Catalog Classification

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Theoretical and Methodological Issues
A bottom-up approach for XML documents classification

IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Anomaly detection in the case of message oriented middleware

Proceedings of the 2008 workshop on Middleware security
Feature Matrix Extraction and Classification of XML Pages

Advanced Web and NetworkTechnologies, and Applications
Web news categorization using a cross-media document graph

Proceedings of the ACM International Conference on Image and Video Retrieval
Exploiting structural information for semi-structured document categorization

Information Processing and Management: an International Journal
Discovering missing values in semi-structured databases

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Collective classification for spam filtering

CISIS'11 Proceedings of the 4th international conference on Computational intelligence in security for information systems
Discovering patterns in traffic sensor data

Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStreaming
A web classification framework based on XSLT

APWeb'06 Proceedings of the 2006 international conference on Advanced Web and Network Technologies, and Applications
Identification of multi-word expressions by combining multiple linguistic information sources

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Modified naïve bayes classifier for e-catalog classification

DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
Classification of XSLT-Generated web documents with support vector machines

KDXD'06 Proceedings of the First international conference on Knowledge Discovery from XML Documents
Examining text categorization methods for incidents analysis

PAISI'12 Proceedings of the 2012 Pacific Asia conference on Intelligence and Security Informatics
Combining link and content-based information in a Bayesian inference model for entity search

Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic Search
Structure-based document model with discrete wavelet transforms and its application to document classification

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.