A statistical information extraction system for Turkish

Authors:
Gökhan Tür;Dilek Hakkani-tür;Kemal Oflazer
Affiliations:
AT&T Labs –– Research, 180 Park Avenue, Florham Park, NJ 07932, USA;AT&T Labs –– Research, 180 Park Avenue, Florham Park, NJ 07932, USA;Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul TR-81474, Turkey
Venue:
Natural Language Engineering
Year:
2003

Citing 13
Cited 7

Morphological parsing and the lexicon

Lexical representation and process
A statistical approach to machine translation

Computational Linguistics
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Prosody-based automatic segmentation of speech into sentences and topics

Speech Communication - Special issue on accessing information in spoken audio
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Integrating prosodic and lexical cues for automatic topic segmentation

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Statistical morphological disambiguation for agglutinative languages

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Message Understanding Conference-6: a brief history

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Dependency parsing with an extended finite state approach

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Integrating morphology with multi-word expression processing in Turkish

MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing
Named Entity Recognition Experiments on Turkish Texts

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
Collocation extraction in Turkish texts using statistical methods

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Automatic rule learning exploiting morphological features for named entity recognition in Turkish

Journal of Information Science
Exploiting morphology in Turkish named entity recognition system

HLT-SS '11 Proceedings of the ACL 2011 Student Session
A hybrid named entity recognizer for Turkish

Expert Systems with Applications: An International Journal
A semi-automatic text-based semantic video annotation system for Turkish facilitating multilingual retrieval

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. In languages like English, there is a very small number of possible word forms with a given root word. However, languages like Turkish have very productive agglutinative morphology. Thus, it is an issue to build statistical models for specific tasks using the surface forms of the words, mainly because of the data sparseness problem. In order to alleviate this problem, we used additional syntactic information, i.e. the morphological structure of the words. We have successfully applied statistical methods using both the lexical and morphological information to sentence segmentation, topic segmentation, and name tagging tasks. For sentence segmentation, we have modeled the final inflectional groups of the words and combined it with the lexical model, and decreased the error rate to 4.34%, which is 21% better than the result obtained using only the surface forms of the words. For topic segmentation, stems of the words (especially nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set according to the weighted TDT-2 segmentation cost metric. This is 32% better than the word-based baseline model. For name tagging, we used four different information sources to model names. Our first information source is based on the surface forms of the words. Then we combined the contextual cues with the lexical model, and obtained some improvement. After this, we modeled the morphological analyses of the words, and finally we modeled the tag sequence, and reached an F-Measure of 91.56%, according to the MUC evaluation criteria. Our results are important in the sense that, using linguistic information, i.e. morphological analyses of the words, and a corpus large enough to train a statistical model significantly improves these basic information extraction tasks for Turkish.