A bottom-up approach for XML documents classification

Authors:
Junwei Wu;Jian Tang
Affiliations:
Memorial University of Newfoundland, St. John's, Canada;Memorial University of Newfoundland, St. John's, Canada
Venue:
IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Year:
2008

Citing 10
Cited 1

A classifier for semi-structured documents

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Email classification with co-training

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Bayesian network model for semi-structured document classification

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Feature selection methods for text classification

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A belief networks-based generative model for structured documents: an application to the XML categorization

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
Classification of XSLT-Generated web documents with support vector machines

KDXD'06 Proceedings of the First international conference on Knowledge Discovery from XML Documents

X-Class: Associative Classification of XML Documents by Structure

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extensible Markup Language (XML) is a simple and flexible text format derived from SGML [1]. It has been widely accepted as one of the crucial components in many information retrieval related applications, such as XML databases, web services, etc. One of the reasons for its wide acceptance is its customized format during data transmission or data storage. Classification is an important data mining task, which aims to assign unknown objects to classes which best characterize them. In this paper, we propose a method to classify XML documents under the assumption that they do not have a common schema, which may or may not be available. Our method is similarity-based. Its main characteristics is its way to handle the roles played by texts and the structural information. Unlike most existing methods, we use a bottom-up approach, i.e., we start from the text first, and then embed the structural information. This is based on the observation that in XML documents with diversified tag structures, the most informative information are carried by the terms in the texts. Our experiments show that this strategy can achieve a better performance than the existing methods for documents from sources that exhibit heterogeneous structures.