Drive-by language identification: a byproduct of applied prototype semantics

  • Authors:
  • Ronald Winnemöller

  • Affiliations:
  • Regional Computer Centre, University of Hamburg, Hamburg

  • Venue:
  • CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

While there exist many effective and efficient algorithms, most of them based on supervised n-gram or word dictionary methods, we propose a semi-supervised approach to language identification, based on prototype semantics. Our method is primarily aimed at noise-rich environments with only very small text fragments to analyze and no training data available, even at analyzing the probable language affiliations of single words. We have integrated our prototype system into a larger web crawling and information management architecture and evaluated the prototype against an experimental setup including datasets in 11 european languages.