Word identification for Mandarin Chinese sentences
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Empirical methods for compound splitting
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Decompounding query keywords from compounding languages
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Weblog classification for fast splog filtering: a URL language model segmentation approach
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Unsupervised and knowledge-free learning of compound splits and periphrases
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
German decompounding in a difficult corpus
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Web scale NLP: a case study on url word breaking
Proceedings of the 20th international conference on World wide web
Extracting advertising keywords from URL strings
Proceedings of the 21st international conference companion on World Wide Web
Hi-index | 0.00 |
Significant amount of literature is available on compound splitting of long words albeit for non-English languages- especially European. Not surprisingly, there has been not much work for English as it is not a compounding language like some of its European counterparts. However, Internet domain names in general are compound English words, e.g. bankofamerica.com". Compound splitting can be effectively employed to extract information from domain names. In this paper, an data-driven learning technique for splitting English compound words is described which among others uses features like normalized frequency, length of parts and n-gram. The splitting F-measure is higher than the published approaches. We applied this technique on a real life web search application where the queries are mistyped domain names routed through sources like ISPs and browsers. Relevant and meaningful keywords were extracted out and shown to the user as a value added search option. Results show a very high click-through rate and increased commercial value.