Exploiting structural information for semi-structured document categorization

Authors:
Andrej Bratko;Bogdan Filipič
Affiliations:
Klika, informacijske tehnologije d.o.o., Stegne 21c, SI-1000 Ljubljana, Slovenia and Department of Intelligent Systems, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia;Department of Intelligent Systems, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 21
Cited 6

Original Contribution: Stacked generalization

Neural Networks
The nature of statistical learning theory

The nature of statistical learning theory
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A classifier for semi-structured documents

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Challenges of the Email Domain for Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Hypertext Categorization using Hyperlink Patterns and Meta Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
The Combination of Text Classifiers Using Reliability Indicators

Information Retrieval
Bayesian network model for semi-structured document classification

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Learning from little: comparison of classifiers given little training

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Stacked generalization: when does it work?

IJCAI'97 Proceedings of the Fifteenth international joint conference on Artifical intelligence - Volume 2

Exploiting Attribute-Wise Distribution of Keywords and Category Dependent Attributes for E-Catalog Classification

ICIC '08 Proceedings of the 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Theoretical and Methodological Issues
New approach for field association term dictionary with passage retrieval

ACMOS'07 Proceedings of the 9th WSEAS international conference on Automatic control, modelling and simulation
Modified naïve bayes classifier for e-catalog classification

DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
E-commerce market analysis from a graph-based product classifier

PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
X-Class: Associative Classification of XML Documents by Structure

ACM Transactions on Information Systems (TOIS)
Structure-based document model with discrete wavelet transforms and its application to document classification

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.