On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

  • Authors:
  • Jakub Piskorski;Karol Wieloch;Marcin Sydow

  • Affiliations:
  • Joint Research Centre of the European Commission, Ispra, Italy 21027;Poznań University of Economics, Poznan, Poland 61-875;Web Mining Lab, Polish-Japanese Institute of Information Technology, Warszawa, Poland 02-008

  • Venue:
  • Information Retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6---99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.