Handling orthographic varieties in japanese IR: fusion of word-, n-gram-, and yomi-based indices across different document collections

  • Authors:
  • Nina Kummer;Christa Womser-Hacker;Noriko Kando

  • Affiliations:
  • Universität Hildesheim, Germany;Universität Hildesheim, Germany;National Institute of Informatics, Tokyo, Japan

  • Venue:
  • AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Orthographic varieties are common in the Japanese language and represent a serious problem for Japanese information retrieval (IR), as IR systems run the risk of missing documents that contain variant forms of the search term. We propose two different strategies for handling orthographic varieties: pronunciation or yomi-based indexing and “Fuzzy Querying”, comparing katakana terms based on edit distance. Both strategies were integrated into our multiple index and fusion system [1] and tested using two different test collections, newspaper articles (Mainichi Shimbun ’98) and scientific abstracts (NTCIR-1), to compare their performance across text genres.