Semi-structured document categorization with a semantic kernel

Authors:
Sujeevan Aseervatham;Younès Bennani
Affiliations:
LIPN - UMR 7030, CNRS, Université Paris 13, 99, Av. J.B. Clément, F-93430 Villetaneuse, France;LIPN - UMR 7030, CNRS, Université Paris 13, 99, Av. J.B. Clément, F-93430 Villetaneuse, France
Venue:
Pattern Recognition
Year:
2009

Citing 18
Cited 4

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
The nature of statistical learning theory

The nature of statistical learning theory
A vector space model for automatic indexing

Communications of the ACM
A study of thresholding strategies for text categorization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Latent Semantic Kernels

Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Support Vector Machines Based on a Semantic Kernel for Text Categorization

IJCNN '00 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 5 - Volume 5
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Measures of semantic similarity and relatedness in the biomedical domain

Journal of Biomedical Informatics
A shared task involving multi-label classification of clinical free text

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Domain kernels for text categorization

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
A comparison of methods for multiclass support vector machines

IEEE Transactions on Neural Networks

An ontology-based measure to compute semantic similarity in biomedicine

Journal of Biomedical Informatics
Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective

Journal of Biomedical Informatics
Ontology-guided feature engineering for clinical text classification

Journal of Biomedical Informatics
An ontology-based similarity measure for biomedical data - Application to radiology reports

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.01

Visualization

Abstract

Since a decade, text categorization has become an active field of research in the machine learning community. Most of the approaches are based on the term occurrence frequency. The performance of such surface-based methods can decrease when the texts are too complex, i.e., ambiguous. One alternative is to use the semantic-based approaches to process textual documents according to their meaning. Furthermore, research in text categorization has mainly focused on ''flat texts'' whereas many documents are now semi-structured and especially under the XML format. In this paper, we propose a semantic kernel for semi-structured biomedical documents. The semantic meanings of words are extracted using the unified medical language system (UMLS) framework. The kernel, with a SVM classifier, has been applied to a text categorization task on a medical corpus of free text documents. The results have shown that the semantic kernel outperforms the linear kernel and the naive Bayes classifier. Moreover, this kernel was ranked in the top 10 of the best algorithms among 44 classification methods at the 2007 Computational Medicine Center (CMC) Medical NLP International Challenge.