Semantic smoothing for text clustering

Authors:
Jamal A. Nasir;Iraklis Varlamis;Asim Karim;George Tsatsaronis
Affiliations:
-;-;-;-
Venue:
Knowledge-Based Systems
Year:
2013

Citing 38
Cited 0

Using WordNet to disambiguate word senses for text retrieval

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Generalized vector spaces model in information retrieval

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A vector space model for automatic indexing

Communications of the ACM
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Latent Semantic Kernels

Journal of Intelligent Information Systems
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Latent Semantic Kernels

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Learning similarity measures in non-orthogonal space

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Evaluating WordNet-based Measures of Lexical Semantic Relatedness

Computational Linguistics
Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
The phrase-based vector space model for automatic retrieval of free-text medical documents

Data & Knowledge Engineering
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Introduction to Information Retrieval

Introduction to Information Retrieval
Word sense disambiguation: A survey

ACM Computing Surveys (CSUR)
Exploiting noun phrases and semantic relationships for text document clustering

Information Sciences: an International Journal
A comparison of extrinsic clustering evaluation metrics based on formal constraints

Information Retrieval
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A generalized vector space model for text retrieval based on semantic relatedness

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
WordNet-based text document clustering

ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
Semantic smoothing of document models for agglomerative clustering

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
An extensive empirical study of collocation extraction methods

ACLstudent '05 Proceedings of the ACL Student Research Workshop
Document clustering using nonnegative matrix factorization

Information Processing and Management: an International Journal
Automatic evaluation of topic coherence

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Text relatedness based on a word thesaurus

Journal of Artificial Intelligence Research
Knowledge-based vector space model for text clustering

Knowledge and Information Systems
Concept-Based Information Retrieval Using Explicit Semantic Analysis

ACM Transactions on Information Systems (TOIS)
Composite kernels for semi-supervised clustering

Knowledge and Information Systems
A knowledge-based semantic Kernel for text classification

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
On ontology-driven document clustering using core semantic features

Knowledge and Information Systems - Special Issue on "Context-Aware Data Mining (CADM)"
Word sense disambiguation for exploiting hierarchical thesauri in text classification

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Efficient semantic kernel-based text classification using matching pursuit KFDA

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II
Combining vector space model and multi word term extraction for semantic query expansion

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a new semantic smoothing vector space kernel (S-VSM) for text documents clustering. In the suggested approach semantic relatedness between words is used to smooth the similarity and the representation of text documents. The basic hypothesis examined is that considering semantic relatedness between two text documents may improve the performance of the text document clustering task. For our experimental evaluation we analyze the performance of several semantic relatedness measures when embedded in the proposed (S-VSM) and present results with respect to different experimental conditions, such as: (i) the datasets used, (ii) the underlying knowledge sources of the utilized measures, and (iii) the clustering algorithms employed. To the best of our knowledge, the current study is the first to systematically compare, analyze and evaluate the impact of semantic smoothing in text clustering based on 'wisdom of linguists', e.g., WordNets, 'wisdom of crowds', e.g., Wikipedia, and 'wisdom of corpora', e.g., large text corpora represented with the traditional Bag of Words (BoW) model. Three semantic relatedness measures for text are considered; two knowledge-based (Omiotis[1] that uses WordNet, and WLM[2] that uses Wikipedia), and one corpus-based (PMI[3] trained on a semantically tagged SemCor version). For the comparison of different experimental conditions we use the BCubed F-Measure evaluation metric which satisfies all formal constraints of good quality cluster. The experimental results show that the clustering performance based on the S-VSM is better compared to the traditional VSM model and compares favorably against the standard GVSM kernel which uses word co-occurrences to compute the latent similarities between document terms.