Data-driven compound splitting method for english compounds in domain names

Authors:
Sanjeet Khaitan;Arumay Das;Sandeep Gain;Adithi Sampath
Affiliations:
Infospace, Bangalore, India;Infospace, Bangalore, India;Infospace, Bangalore, India;Infospace, Bangalore, India
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 6
Cited 2

Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Empirical methods for compound splitting

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Decompounding query keywords from compounding languages

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Weblog classification for fast splog filtering: a URL language model segmentation approach

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Unsupervised and knowledge-free learning of compound splits and periphrases

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
German decompounding in a difficult corpus

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing

Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
Extracting advertising keywords from URL strings

Proceedings of the 21st international conference companion on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Significant amount of literature is available on compound splitting of long words albeit for non-English languages- especially European. Not surprisingly, there has been not much work for English as it is not a compounding language like some of its European counterparts. However, Internet domain names in general are compound English words, e.g. bankofamerica.com". Compound splitting can be effectively employed to extract information from domain names. In this paper, an data-driven learning technique for splitting English compound words is described which among others uses features like normalized frequency, length of parts and n-gram. The splitting F-measure is higher than the published approaches. We applied this technique on a real life web search application where the queries are mistyped domain names routed through sources like ISPs and browsers. Relevant and meaningful keywords were extracted out and shown to the user as a value added search option. Results show a very high click-through rate and increased commercial value.