A multilingual named entity recognition system using boosting and c4.5 decision tree learning algorithms

Authors:
György Szarvas;Richárd Farkas;András Kocsor
Affiliations:
Department of Informatics, University of Szeged, Szeged, Hungary;Research Group on Artificial Intelligence, MTA-SZTE, Szeged, Hungary;Research Group on Artificial Intelligence, MTA-SZTE, Szeged, Hungary
Venue:
DS'06 Proceedings of the 9th international conference on Discovery Science
Year:
2006

Citing 10
Cited 8

The Strength of Weak Learnability

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Named Entity Extraction using AdaBoost

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Named entity recognition with a maximum entropy approach

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Named entity recognition through classifier combination

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Named entity recognition for Hungarian using various machine learning algorithms

Acta Cybernetica
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications

Hungarian named entity recognition with a maximum entropy approach

Acta Cybernetica
Sentence alignment of Hungarian-English parallel corpora using a hybrid algorithm

Acta Cybernetica
GYDER: maxent metonymy resolution

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Researcher affiliation extraction from homepages

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Improving a state-of-the-art named entity recognition system using the world wide web

ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
Automatic free-text-tagging of online news archives

Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence
Special semi-supervised techniques for natural language processing tasks

CIMMACS'07 Proceedings of the 6th WSEAS international conference on Computational intelligence, man-machine systems and cybernetics
Learning to detect english and hungarian light verb constructions

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce a multilingual Named Entity Recognition (NER) system that uses statistical modeling techniques. The system identifies and classifies NEs in the Hungarian and English languages by applying AdaBoostM1 and the C4.5 decision tree learning algorithm. We focused on building as large a feature set as possible, and used a split and recombine technique to fully exploit its potentials. This methodology provided an opportunity to train several independent decision tree classifiers based on different subsets of features and combine their decisions in a majority voting scheme. The corpus made for the CoNLL 2003 conference and a segment of Szeged Corpus was used for training and validation purposes. Both of them consist entirely of newswire articles. Our system remains portable across languages without requiring any major modification and slightly outperforms the best system of CoNLL 2003, and achieved a 94.77% F measure for Hungarian. The real value of our approach lies in its different basis compared to other top performing models for English, which makes our system extremely successful when used in combination with CoNLL modells.