Rich set of features for proper name recognition in polish texts

  • Authors:
  • Micha$#322/ Marci$#324/czuk;Micha$#322/ Stanek;Maciej Piasecki;Adam Musia$#322/

  • Affiliations:
  • Wroc$#322/aw University of Technology, Wroc$#322/aw, Poland;Wroc$#322/aw University of Technology, Wroc$#322/aw, Poland;Wroc$#322/aw University of Technology, Wroc$#322/aw, Poland;Wroc$#322/aw University of Technology, Wroc$#322/aw, Poland

  • Venue:
  • SIIS'11 Proceedings of the 2011 international conference on Security and Intelligent Information Systems
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic information. To utilize the rich set of features we applied Conditional Random Fields (CRF) -- a modern approach for sequence labelling. We present results of the evaluation on a single domain following the cross-validation scheme and cross-domain evaluation based on training and testing on different corpora. We show that the extended set of features improves the final results for CRF and also this approach outperforms Hidden Markov Models (HMM). On the single domain CRF obtained 92.53% of F-measure for 5 categories of proper names, and 67.72% and 72.62% of F-measure for other two corpora in cross-domain evaluation.