Computing similarity between items in a digital library of cultural heritage

Authors:
Nikolaos Aletras;Mark Stevenson;Paul Clough
Affiliations:
The University of Sheffield;The University of Sheffield;The University of Sheffield
Venue:
Journal on Computing and Cultural Heritage (JOCCH)
Year:
2013

Citing 34
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone

SIGDOC '86 Proceedings of the 5th annual international conference on Systems documentation
Placing search in context: the concept revisited

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
User Modeling for Personalized City Tours

Artificial Intelligence Review
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
ILEX: an architecture for a dynamic hypertext generation system

Natural Language Engineering
Exploratory search: from finding to understanding

Communications of the ACM - Supporting exploratory search
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Understanding cultural heritage experts' information seeking needs

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Local text reuse detection

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
Inter-coder agreement for computational linguistics

Computational Linguistics
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
A personalized walk through the museum: the CHIP interactive tour guide

CHI '09 Extended Abstracts on Human Factors in Computing Systems
WikiRelate! computing semantic relatedness using wikipedia

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Corpus-based and knowledge-based measures of text semantic similarity

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Text-to-text semantic similarity for automatic short answer grading

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A study on similarity and relatedness using distributional and WordNet-based approaches

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Using information content to evaluate semantic similarity in a taxonomy

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
A machine learning approach to textual entailment recognition

Natural Language Engineering
Spatial processes for recommender systems

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Cross-lingual semantic relatedness using encyclopedic knowledge

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Using measures of semantic relatedness for word sense disambiguation

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Social links from latent topics in Microblogs

WSA '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media
LDA based similarity modeling for question answering

SS '10 Proceedings of the NAACL HLT 2010 Workshop on Semantic Search
Short text similarity based on probabilistic topics

Knowledge and Information Systems
Using ontological and document similarity to estimate museum exhibit relatedness

Journal on Computing and Cultural Heritage (JOCCH)
In search of quality in crowdsourcing for search engine evaluation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
SemEval-2012 task 6: a pilot on semantic textual similarity

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large amounts of cultural heritage content have now been digitized and are available in digital libraries. However, these are often unstructured and difficult to navigate. Automatic techniques for identifying similar items in these collections could be used to improve navigation since it would allow items that are implicitly connected to be linked together and allow sets of similar items to be clustered. Europeana is a large digital library containing more than 20 million digital objects from a set of cultural heritage providers throughout Europe. The diverse nature of this collection means that the items do not have standard metadata to assist navigation. A range of methods for computing the similarity between pairs of texts are applied to metadata records in Europeana in order to estimate the similarity between items. Various methods for computing similarity have been proposed and can be classified into two main approaches: (1) knowledge-based, which make use of external knowledge sources and (2) corpus-based approaches, which rely on analyzing the frequency distributions of words in documents. Both techniques are evaluated against manual judgements obtained for this study and a multiple-choice test created from manually generated categories in cultural heritage collections. We find that a combination of corpus and knowledge-based approaches provide the best results in both experiments.