A semi-structured document model for text mining

Authors:
Yang Jianwu;Chen Xiaoou
Affiliations:
National Key Laboratory for Text Processing, Institute of Computer Science and Technology Peking University, Beijing 100871, P.R. China;National Key Laboratory for Text Processing, Institute of Computer Science and Technology Peking University, Beijing 100871, P.R. China
Venue:
Journal of Computer Science and Technology
Year:
2002

Citing 7
Cited 10

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The XML handbook

The XML handbook
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A classifier for semi-structured documents

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Term Weighting Approaches in Automatic Text Retrieval

Term Weighting Approaches in Automatic Text Retrieval

Using proportional transportation similarity with learned element semantics for XML document clustering

Proceedings of the 15th international conference on World Wide Web
Manual and evolutionary equalization in text mining

SMO'07 Proceedings of the 7th WSEAS International Conference on Simulation, Modelling and Optimization
XML Document Classification Using Extended VSM

Focused Access to XML Documents
Semantic clustering of XML documents

ACM Transactions on Information Systems (TOIS)
Extended VSM for XML document classification using frequent subtrees

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Structure and content similarity for clustering XML documents

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Clust-XPaths: clustering of XML paths

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
PKU at INEX 2010 XML mining track

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
A flexible structured-based representation for XML document mining

INEX'05 Proceedings of the 4th international conference on Initiative for the Evaluation of XML Retrieval
X-Class: Associative Classification of XML Documents by Structure

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents a document, and vectors' elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.