Clustering and semantics preservation in cultural heritage information spaces

Authors:
Javier Pereira;Felipe Schmidt;Pedro Contreras;Fionn Murtagh;Hernan Astudillo
Affiliations:
Universidad Diego Portales, Santiago, Chile;Universidad Diego Portales, Santiago, Chile;University of London, Egham Hill, Surrey, England;University of London, Egham Hill, Surrey, England;Universidad Técnica Federico, Valparaíso, Chile
Venue:
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Year:
2010

Citing 10
Cited 0

Random projection in dimensionality reduction: applications to image and text data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering Algorithms

Clustering Algorithms
Experiments with random projections for machine learning

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Very sparse random projections

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability)

Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability)
Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding

SIAM Journal on Scientific Computing
Clustering

Clustering
Survey of clustering algorithms

IEEE Transactions on Neural Networks
A method for determining ontology-based semantic relevance

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we analyze the preservation of original semantic similarity among objects when dimensional reduction is applied on the original data source and a further clustering process is performed on dimensionally reduced data. An experiment is designed to test Baire, or longest common prefix ultrametric, and K-Means when prior random projection is applied. A data matrix extracted from a cultural heritage database has been prepared for the experiment. Given that the random projection produces a vector with components ranging on the interval [0, 1], clusters are obtained at different precision levels. Next, the mean semantic similarity of clusters is calculated using a modified version of the Jaccard index. Our findings show that semantics is difficult to preserve by these methods. However, a Student's hypothesis test on mean similarity indicates that Baire clusters objects are semantically better than K-Means when we increase the digit precision, but paying an increasing cost for orphan clustered objects. Despite this cost, it is argued that the ultrametric technique provides an efficient process to detect semantic homogeneity on the original data space.