A flexible structured-based representation for XML document mining

Authors:
Anne-Marie Vercoustre;Mounir Fegas;Saba Gul;Yves Lechevallier
Affiliations:
INRIA, Rocquencourt, France;INRIA, Rocquencourt, France;INRIA, Rocquencourt, France;INRIA, Rocquencourt, France
Venue:
INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Year:
2005

Citing 16
Cited 11

An algorithm for suffix stripping

Readings in information retrieval
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A classifier for semi-structured documents

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A semi-structured document model for text mining

Journal of Computer Science and Technology
BitCube: A Three-Dimensional Bitmap Indexing for XML Documents

Journal of Intelligent Information Systems
TreeFinder: a First Step towards XML Data Mining

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Structured multimedia document classification

Proceedings of the 2003 ACM symposium on Document engineering
XML Clustering by Principal Component Analysis

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
A tree-based approach to clustering XML documents by structure

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
XML documents clustering by structures

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Transforming XML trees for efficient classification and clustering

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Clustering XML documents using self-organizing maps for structures

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
SSC: statistical subspace clustering

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Clustering XML documents using structural summaries

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

Report on the XML mining track at INEX 2005 and INEX 2006: categorization and clustering of XML documents

ACM SIGIR Forum
Word Sense Disambiguation for XML Structure Feature Generation

ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Discovering unexpected documents in corpora

Knowledge-Based Systems
HCX: an efficient hybrid clustering approach for XML documents

Proceedings of the 9th ACM symposium on Document engineering
XCFS: an XML documents clustering approach using both the structure and the content

Proceedings of the 18th ACM conference on Information and knowledge management
Semantic clustering of XML documents

ACM Transactions on Information Systems (TOIS)
XML documents clustering using a tensor space model

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Collaborative clustering of XML documents

Journal of Computer and System Sciences
Clust-XPaths: clustering of XML paths

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Clustering XML documents by structure

ADBIS'09 Proceedings of the 13th East European conference on Advances in Databases and Information Systems
Exploring dictionary-based semantic relatedness in labeled tree data

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports on the INRIA group’s approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their sub-paths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as k-means. We use an implementation of the clustering algorithm known as dynamic clouds that can work with distinct groups of independent modalities put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections.