An algorithm for suffix stripping
Readings in information retrieval
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A classifier for semi-structured documents
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A semi-structured document model for text mining
Journal of Computer Science and Technology
BitCube: A Three-Dimensional Bitmap Indexing for XML Documents
Journal of Intelligent Information Systems
TreeFinder: a First Step towards XML Data Mining
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
IEEE Transactions on Knowledge and Data Engineering
XRules: an effective structural classifier for XML data
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Structured multimedia document classification
Proceedings of the 2003 ACM symposium on Document engineering
XML Clustering by Principal Component Analysis
ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
A tree-based approach to clustering XML documents by structure
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
XML documents clustering by structures
INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Transforming XML trees for efficient classification and clustering
INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
Clustering XML documents using self-organizing maps for structures
INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
SSC: statistical subspace clustering
MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Clustering XML documents using structural summaries
EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Word Sense Disambiguation for XML Structure Feature Generation
ESWC 2009 Heraklion Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications
Discovering unexpected documents in corpora
Knowledge-Based Systems
HCX: an efficient hybrid clustering approach for XML documents
Proceedings of the 9th ACM symposium on Document engineering
XCFS: an XML documents clustering approach using both the structure and the content
Proceedings of the 18th ACM conference on Information and knowledge management
Semantic clustering of XML documents
ACM Transactions on Information Systems (TOIS)
XML documents clustering using a tensor space model
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Collaborative clustering of XML documents
Journal of Computer and System Sciences
Clust-XPaths: clustering of XML paths
MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Clustering XML documents by structure
ADBIS'09 Proceedings of the 13th East European conference on Advances in Databases and Information Systems
Exploring dictionary-based semantic relatedness in labeled tree data
Information Sciences: an International Journal
Hi-index | 0.00 |
This paper reports on the INRIA group’s approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their sub-paths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as k-means. We use an implementation of the clustering algorithm known as dynamic clouds that can work with distinct groups of independent modalities put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections.