Hypernym discovery based on distributional similarity and hierarchical structures

  • Authors:
  • Ichiro Yamada;Kentaro Torisawa;Jun'ichi Kazama;Kow Kuroda;Masaki Murata;Stijn De Saeger;Francis Bond;Asuka Sumida

  • Affiliations:
  • National Institute of Information and Communications Technology, Keihannna Science City, Japan;National Institute of Information and Communications Technology, Keihannna Science City, Japan;National Institute of Information and Communications Technology, Keihannna Science City, Japan;National Institute of Information and Communications Technology, Keihannna Science City, Japan;National Institute of Information and Communications Technology, Keihannna Science City, Japan;National Institute of Information and Communications Technology, Keihannna Science City, Japan;National Institute of Information and Communications Technology, Keihannna Science City, Japan;Japan Advanced Institute of Science and Technology, Nomi-shi, Ishikawa-ken, Japan

  • Venue:
  • EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a new method of developing a large-scale hyponymy relation database by combining Wikipedia and other Web documents. We attach new words to the hyponymy database extracted from Wikipedia by using distributional similarity calculated from documents on the Web. For a given target word, our algorithm first finds k similar words from the Wikipedia database. Then, the hypernyms of these k similar words are assigned scores by considering the distributional similarities and hierarchical distances in the Wikipedia database. Finally, new hyponymy relations are output according to the scores. In this paper, we tested two distributional similarities. One is based on raw verb-noun dependencies (which we call "RVD"), and the other is based on a large-scale clustering of verb-noun dependencies (called "CVD"). Our method achieved an attachment accuracy of 91.0% for the top 10,000 relations, and an attachment accuracy of 74.5% for the top 100,000 relations when using CVD. This was a far better outcome compared to the other baseline approaches. Excluding the region that had very high scores, CVD was found to be more effective than RVD. We also confirmed that most relations extracted by our method cannot be extracted merely by applying the well-known lexico-syntactic patterns to Web documents.