Rich set of features for proper name recognition in polish texts

Authors:
Micha$#322/ Marci$#324/czuk;Micha$#322/ Stanek;Maciej Piasecki;Adam Musia$#322/
Affiliations:
Wroc$#322/aw University of Technology, Wroc$#322/aw, Poland;Wroc$#322/aw University of Technology, Wroc$#322/aw, Poland;Wroc$#322/aw University of Technology, Wroc$#322/aw, Poland;Wroc$#322/aw University of Technology, Wroc$#322/aw, Poland
Venue:
SIIS'11 Proceedings of the 2011 international conference on Security and Intelligent Information Systems
Year:
2011

Citing 10
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Introduction to probabilistic automata (Computer science and applied mathematics)

Introduction to probabilistic automata (Computer science and applied mathematics)
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Accelerated training of conditional random fields with stochastic gradient methods

ICML '06 Proceedings of the 23rd international conference on Machine learning
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A systematic comparison of feature-rich probabilistic classifiers for NER tasks

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Named-entity recognition for polish with SProUT

IMTCI'04 Proceedings of the Second international conference on Intelligent Media Technology for Communicative Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic information. To utilize the rich set of features we applied Conditional Random Fields (CRF) -- a modern approach for sequence labelling. We present results of the evaluation on a single domain following the cross-validation scheme and cross-domain evaluation based on training and testing on different corpora. We show that the extended set of features improves the final results for CRF and also this approach outperforms Hidden Markov Models (HMM). On the single domain CRF obtained 92.53% of F-measure for 5 categories of proper names, and 67.72% and 72.62% of F-measure for other two corpora in cross-domain evaluation.