Name-ethnicity classification from open sources

  • Authors:
  • Anurag Ambekar;Charles Ward;Jahangir Mohammed;Swapna Male;Steven Skiena

  • Affiliations:
  • Stony Brook University, Stony Brook, NY, USA;Stony Brook University, Stony Brook, NY, USA;Stony Brook University, Stony Brook, NY, USA;Stony Brook University, Stony Brook, NY, USA;Stony Brook University, Stony Brook, NY, USA

  • Venue:
  • Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.