Ontology-driven web-based semantic similarity

  • Authors:
  • David Sánchez;Montserrat Batet;Aida Valls;Karina Gibert

  • Affiliations:
  • Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Tarragona, Spain 43007;Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Tarragona, Spain 43007;Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Tarragona, Spain 43007;Department of Statistics and Operations Research, Universitat Politècnica de Catalunya, Barcelona, Spain 08034

  • Venue:
  • Journal of Intelligent Information Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge--such as the structure of a taxonomy--or implicit knowledge--such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies ---like specific domain ontologies- and massive corpus ---like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures' dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.