Semantically-grounded construction of centroids for datasets with textual attributes

Authors:
Sergio MartıNez;Aida Valls;David SáNchez
Affiliations:
Department of Computer Science and Mathematics, Universitat Rovira i Virgili, Avda. Paısos Catalans 26, 43007 Tarragona, Catalonia, Spain;Department of Computer Science and Mathematics, Universitat Rovira i Virgili, Avda. Paısos Catalans 26, 43007 Tarragona, Catalonia, Spain;Department of Computer Science and Mathematics, Universitat Rovira i Virgili, Avda. Paısos Catalans 26, 43007 Tarragona, Catalonia, Spain
Venue:
Knowledge-Based Systems
Year:
2012

Citing 35
Cited 2

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Practical Data-Oriented Microaggregation for Statistical Disclosure Control

IEEE Transactions on Knowledge and Data Engineering
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Verbs semantics and lexical selection

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Swoogle: a search and metadata engine for the semantic web

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation

Data Mining and Knowledge Discovery
Designing semantics-preserving cluster representatives for scientific input conditions

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
A simple and fast algorithm for K-medoids clustering

Expert Systems with Applications: An International Journal
Rich document representation and classification: An analysis

Knowledge-Based Systems
Advanced ontology management system for personalised e-Learning

Knowledge-Based Systems
WordNet::Similarity: measuring the relatedness of concepts

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Obtaining the consensus and inconsistency among a set of assertions on a qualitative attribute

Expert Systems with Applications: An International Journal
Density-based microaggregation for statistical disclosure control

Expert Systems with Applications: An International Journal
Text clustering using frequent itemsets

Knowledge-Based Systems
A classification algorithm based on local cluster centers with a few labeled training examples

Knowledge-Based Systems
Invited paper: Dynamic visualization of statistical learning in the context of high-dimensional textual data

Web Semantics: Science, Services and Agents on the World Wide Web
Semantic microaggregation for the anonymization of query logs

PSD'10 Proceedings of the 2010 international conference on Privacy in statistical databases
Ontology-driven web-based semantic similarity

Journal of Intelligent Information Systems
Performance of ontology-based semantic similarities in clustering

ICAISC'10 Proceedings of the 10th international conference on Artificial intelligence and soft computing: Part I
Ontology-based information content computation

Knowledge-Based Systems
Ontology-based anonymization of categorical values

MDAI'10 Proceedings of the 7th international conference on Modeling decisions for artificial intelligence
Towards semantic microaggregation of categorical data for confidential documents

MDAI'10 Proceedings of the 7th international conference on Modeling decisions for artificial intelligence
The centroid or consensus of a set of objects with qualitative attributes

Expert Systems with Applications: An International Journal
An ontology-based measure to compute semantic similarity in biomedicine

Journal of Biomedical Informatics
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Knowledge-Based Systems
A dissimilarity measure for the k-Modes clustering algorithm

Knowledge-Based Systems
Dimensionality reduction and main component extraction of mass spectrometry cancer data

Knowledge-Based Systems
Enhanced centroid-based classification technique by filtering outliers

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Ontology-based semantic similarity: A new feature-based approach

Expert Systems with Applications: An International Journal
Privacy protection of textual attributes through a semantic-based masking method

Information Fusion
Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective

Journal of Biomedical Informatics

A modification of the k-means method for quasi-unsupervised learning

Knowledge-Based Systems
A semantic framework to protect the privacy of electronic health records with non-numerical attributes

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Centroids are key components in many data analysis algorithms such as clustering or microaggregation. They are considered as the central value that minimises the distance to all the objects in a dataset or cluster. Methods for centroid construction are mainly devoted to datasets with numerical and categorical attributes, focusing on the numerical and distributional properties of data. Textual attributes, on the contrary, consist of term lists referring to concepts with a specific semantic content (i.e., meaning), which cannot be evaluated by means of classical numerical operators. Hence, the centroid of a dataset with textual attributes should be the term that minimises the semantic distance against the members of the set. Semantically-grounded methods aiming to construct centroids for datasets with textual attributes are scarce and, as it will be discussed in this paper, they are hampered by their limited semantic analysis of data. In this paper, we propose a method that, exploiting the knowledge provided by background ontologies (like WordNet), is able to construct the centroid of multivariate datasets described by means of textual attributes. Special efforts have been put in the minimisation of the semantic distance between the centroid and the input data. As a result, our method is able to provide optimal centroids (i.e., those that minimise the distance to all the objects in the dataset) according to the exploited background ontology and a semantic similarity measure. Our proposal has been evaluated by means of a real dataset consisting on short textual answers provided by visitors of a natural park. Results show that our centroids retain the semantic content of the input data better than related works.