Wikipedia-assisted concept thesaurus for better web media understanding

  • Authors:
  • Huan Wang;Liang-Tien Chia;Shenghua Gao

  • Affiliations:
  • Nanyang Technological University, Singapore, Singapore;Nanyang Technological University, Singapore, Singapore;Nanyang Technological University, Singapore, Singapore

  • Venue:
  • Proceedings of the international conference on Multimedia information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Concept ontology has been used in the area of artificial intelligence, biomedical informatics and library science and it has been shown as an effective approach to better understand data in the respective domains. One main difficulty that hedge against the development of ontology approaches is the extra work required in ontology construction and annotation. With the emergent lexical dictionaries and encyclopedias such as WordNet, Wikipedia, innovations from different directions have been proposed to automatically extract concept ontologies. Unfortunately, many of the proposed ontologies are not fully exploited according to the general human knowledge. We study the various knowledge sources and aim to build a construct scalable concept thesaurus suitable for better understanding of media in the World Wide Web from Wikipedia. With its wide concept coverage, finely organized categories, diverse concept relations, and up-to-date information, the collaborative encyclopedia Wikipedia has almost all the requisite attributes to contribute to a well-defined concept ontology. Besides the explicit concept relations such as disambiguation, synonymy, Wikipedia also provides implicit concept relations through cross-references between articles. In our previous work, we have built ontology with explicit relations from Wikipedia page contents. Even though the method works, mining explicit semantic relations from every Wikipedia concept page content has unsolved scalable issue when more concepts are involved. This paper describes our attempt to automatically build a concept thesaurus, which encodes both explicit and implicit semantic relations for a large-scale of concepts from Wikipedia. Our proposed thesaurus construction takes advantage of both structure and content features of the downloaded Wikipedia database, and defines concept entries with its related concepts and relations. This thesaurus is further used to exploit semantics from web page context to build a more semantic meaningful space. We move a step forward to combine the similarity distance from the image feature space to boost the performance. We evaluate our approach through application of the constructed concept thesaurus to web image retrieval. The results show that it is possible to use implicit semantic relations to improve the retrieval performance.