Data-driven compound splitting method for english compounds in domain names

  • Authors:
  • Sanjeet Khaitan;Arumay Das;Sandeep Gain;Adithi Sampath

  • Affiliations:
  • Infospace, Bangalore, India;Infospace, Bangalore, India;Infospace, Bangalore, India;Infospace, Bangalore, India

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Significant amount of literature is available on compound splitting of long words albeit for non-English languages- especially European. Not surprisingly, there has been not much work for English as it is not a compounding language like some of its European counterparts. However, Internet domain names in general are compound English words, e.g. bankofamerica.com". Compound splitting can be effectively employed to extract information from domain names. In this paper, an data-driven learning technique for splitting English compound words is described which among others uses features like normalized frequency, length of parts and n-gram. The splitting F-measure is higher than the published approaches. We applied this technique on a real life web search application where the queries are mistyped domain names routed through sources like ISPs and browsers. Relevant and meaningful keywords were extracted out and shown to the user as a value added search option. Results show a very high click-through rate and increased commercial value.