Learning to tokenize web domains

Authors:
Sriram Srinivasan;Sourangshu Bhattachaya
Affiliations:
Yahoo! Software Development India, Bangalore, India;Yahoo! Labs India, Bangalore, India
Venue:
Proceedings of the 20th international conference companion on World wide web
Year:
2011

Citing 2
Cited 1

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

Segmenting web-domains and hashtags using length specific models

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Domain Match is an Internet monetization product offered by web companies like Yahoo! The product offers display of ads and search results, when a user requests a webpage from a domain which is non-existent or does not have any content. This product earns significant amount of advertising revenue for major internet companies like Yahoo! Hence it is an important product receiving millions of queries per day. Domain Match (DM) works by tokenizing the input domains and sub-folders into keywords and then displaying ads and search results queried on the keywords. In this poster, we describe a machine learning based solution, which automatically learns to tokenize new domains, given a training dataset containing a set of domains and their tokenizations. We use positional frequency and parts of speech as features for scoring tokens. Tokens are scored combined using various scoring models. We compare two ways of training the models: a simple gain function based training and a large margin training. Experimental results are encouraging.