Country wise classification of human names

Authors:
Raju Balakrishnan
Affiliations:
India Software Lab, IBM™, Bangalore, India
Venue:
AIKED'06 Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases
Year:
2006

Citing 6
Cited 0

Data structures and algorithm analysis in C (2nd ed.)

Data structures and algorithm analysis in C (2nd ed.)
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Description of a Multilingual Database of Proper Names

PorTAL '02 Proceedings of the Third International Conference on Advances in Natural Language Processing
Computational techniques for improved name search

ANLC '88 Proceedings of the second conference on Applied natural language processing
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
An investigation of various information sources for classifying biological names

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13

Quantified Score

Hi-index	0.00

Visualization

Abstract

Person names in a country follow a particular statistical trend and names of a large set of individuals in a country are derived from a set of names having smaller cardinality. The frequency distribution of person names of different countries varies from each other. The intuitive ability of humans to guess the country of origin of a person from his name is based on these facts. It is possible to design a data mining approach for deciding the country of origin of a person from his name-using the first name and second name as the only independent parameters-and such a tool has wide range of applications. But this is an unexplored problem, complexity and lack of information about human names across different countries may be the reason. In this paper we try to tackle this problem with two data mining algorithms. Firstly, we try a k-nearest neighbor classification for first names and second names, followed by a rule based decision making. The algorithm is trained and tested on person names from nine countries. This method shows accuracy up to 73% for a set of ten countries. Secondly, we try an unsupervised method to improve the knowledge base of the system at runtime. This algorithm can effectively handle the scenarios of 1) a small training set. 2) Apriori probabilities of working set are unknown at training time. The method shows accuracy up to 64% for nine countries.