A hybrid named entity recognizer for Turkish

Authors:
Dilek Küçük;Adnan Yazıcı
Affiliations:
TíBİTAK UZAY, Power Electronics Group, Ankara, Turkey;Middle East Technical University, Dept. of Computer Eng., Ankara, Turkey
Venue:
Expert Systems with Applications: An International Journal
Year:
2012

Citing 11
Cited 1

Machine learning for information extraction in informal domains

Machine learning for information extraction in informal domains
Multimedia indexing through multi-source and multi-language information extraction: the MUMIS project

Data & Knowledge Engineering - NLDB2002
A statistical information extraction system for Turkish

Natural Language Engineering
Message Understanding Conference-6: a brief history

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Web-assisted annotation, semantic indexing and search of television and radio news

WWW '05 Proceedings of the 14th international conference on World Wide Web
An Integrated Architecture for Processing Business Documents in Turkish

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Named Entity Recognition Experiments on Turkish Texts

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Business information extraction from semi-structured webpages

Expert Systems with Applications: An International Journal
Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities

Expert Systems with Applications: An International Journal
An Intelligent information segmentation approach to extract financial data for business valuation

Expert Systems with Applications: An International Journal
RitroveRAI: a web application for semantic indexing and hyperlinking of multimedia news

ISWC'05 Proceedings of the 4th international conference on The Semantic Web

A hybrid approach to Arabic named entity recognition

Journal of Information Science

Quantified Score

Hi-index	12.05

Visualization

Abstract

Named entity recognition is an important subfield of the broader research area of information extraction from textual data. Yet, named entity recognition research conducted on Turkish texts is still rare as compared to related research carried out on other languages such as English, Spanish, Chinese, and Japanese. In this study, we present a hybrid named entity recognizer for Turkish, which is based on a manually engineered rule based recognizer that we have proposed. Since rule based systems for specific domains require their knowledge sources to be manually revised when ported to other domains, we enrich our rule based recognizer and turn it into a hybrid recognizer so that it learns from annotated data when available and improves its knowledge sources accordingly. The hybrid recognizer is originally engineered for generic news texts, but with its learning capability, it is improved to be applicable to that of financial news texts, historical texts, and child stories as well, without human intervention. Both the hybrid recognizer and its rule based predecessor are evaluated on the same corpora and the hybrid recognizer achieves better results as compared to its predecessor. The proposed hybrid named entity recognizer is significant since it is the first hybrid recognizer proposal for Turkish addressing the above porting problem considering that Turkish possesses different structural properties compared to widely studied languages such as English and there is very limited information extraction research conducted on Turkish texts. Moreover, the employment of the proposed hybrid recognizer for semantic video indexing is shown as a case study on Turkish news videos. The genuine textual and video corpora utilized throughout the paper are compiled and annotated by the authors due to the lack of publicly available annotated corpora for information extraction research on Turkish texts.