Approximately matching context-free languages
Information Processing Letters
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
New algorithm for ordered tree-to-tree correction problem
Journal of Algorithms
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
IEEE Transactions on Knowledge and Data Engineering
Kernel Methods for Pattern Analysis
Kernel Methods for Pattern Analysis
A Dual Approach to Semidefinite Least-Squares Problems
SIAM Journal on Matrix Analysis and Applications
GE-CKO: A Method to Optimize Composite Kernels for Web Page Classification
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Approximate XML document matching
Proceedings of the 2005 ACM symposium on Applied computing
Least-Squares Covariance Matrix Adjustment
SIAM Journal on Matrix Analysis and Applications
A Quadratically Convergent Newton Method for Computing the Nearest Correlation Matrix
SIAM Journal on Matrix Analysis and Applications
Hi-index | 0.00 |
It has been shown that storing documents having similar structures together can reduce the fragmentation problem and improve query efficiency. Unlike the flat text document, the Web document has no standard vectorial representation, which is required in most existing classification algorithms. In this paper, we propose a vectorization method for XML documents by using multidimensional scaling (MDS) so that Web documents can be fed into an existing classification algorithm. The classical MDS embeds data points into an Euclidean space if the similarity matrix constructed by the data points is semidefinite. The semidefniteness condition, however, may not hold due to the inference technique used in practice. We will find a semi-definite matrix which is the closest to the distance matrix in the Euclidean space. Based on recent developments on strongly semismooth matrix valued functions, we solve the nearest semi-definite matrix problem with a Newton-type method. Experimental studies show that the classification accuracy can be improved.