Exploiting structural information for semi-structured document categorization

Authors:
Andrej Bratko;Bogdan Filipič
Affiliations:
Klika, informacijske tehnologije d.o.o., Stegne, Ljubljana, Slovenia and Department of Intelligent Systems, Jozef Stefan Institute, Jamova, Ljubljana, Slovenia;Department of Intelligent Systems, Jozef Stefan Institute, Jamova, Ljubljana, Slovenia
Venue:
Information Processing and Management: an International Journal
Year:
2006

Citing 22
Cited 7

Original Contribution: Stacked generalization

Neural Networks
The nature of statistical learning theory

The nature of statistical learning theory
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A classifier for semi-structured documents

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Machine Learning

Machine Learning
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Challenges of the Email Domain for Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Hypertext Categorization using Hyperlink Patterns and Meta Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Distributional word clusters vs. words for text categorization

The Journal of Machine Learning Research
Supervised term weighting for automated text categorization

Proceedings of the 2003 ACM symposium on Applied computing
The Combination of Text Classifiers Using Reliability Indicators

Information Retrieval
Bayesian network model for semi-structured document classification

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Learning from little: comparison of classifiers given little training

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Stacked generalization: when does it work?

IJCAI'97 Proceedings of the Fifteenth international joint conference on Artifical intelligence - Volume 2

Churn prediction in subscription services: An application of support vector machines while comparing two parameter-selection techniques

Expert Systems with Applications: An International Journal
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Improvement of building field association term dictionary using passage retrieval

Information Processing and Management: an International Journal
Spam filtering for short messages

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Review article: A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management

Computers in Industry
Probabilistic Methods for Structured Document Classification at INEX'07

Focused Access to XML Documents
Email Spam Filtering: A Systematic Review

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a fiat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.