On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

Authors:
Jakub Piskorski;Karol Wieloch;Marcin Sydow
Affiliations:
Joint Research Centre of the European Commission, Ispra, Italy 21027;Poznań University of Economics, Poznan, Poland 61-875;Web Mining Lab, Polish-Japanese Institute of Information Technology, Warszawa, Poland 02-008
Venue:
Information Retrieval
Year:
2009

Citing 16
Cited 4

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Unsupervised personal name disambiguation

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Weakly supervised named entity transliteration and discovery from multilingual comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Mining for personal name aliases on the web

Proceedings of the 17th international conference on World Wide Web
Lemmatization of Polish person names

ACL '07 Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies
Identification and tracing of ambiguous names: discriminative and generative approaches

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Proceedings of the 4th International Workshop on Semantic Evaluations

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
A probabilistic model for guessing base forms of new words by analogy

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Name discrimination by clustering similar contexts

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Named-entity recognition for polish with SProUT

IMTCI'04 Proceedings of the Second international conference on Intelligent Media Technology for Communicative Intelligence

Current research issues and trends in non-English Web searching

Information Retrieval
Toposław: a lexicographic framework for multi-word units

LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Introducing diversity to log-based query suggestions to deal with underspecified user queries

SIIS'11 Proceedings of the 2011 international conference on Security and Intelligent Information Systems
Accurate unsupervised joint named-entity extraction from unaligned parallel text

NEWS '12 Proceedings of the 4th Named Entity Workshop

Quantified Score

Hi-index	0.01

Visualization

Abstract

Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6---99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.