Web scale NLP: a case study on url word breaking

Authors:
Kuansan Wang;Christopher Thrasher;Bo-June Paul Hsu
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 20th international conference on World wide web
Year:
2011

Citing 21
Cited 8

An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Word segmentation and recognition for web document framework

Proceedings of the eighth international conference on Information and knowledge management
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
A systematic comparison of various statistical alignment models

Computational Linguistics
A statistical model for word discovery in transcribed speech

Computational Linguistics
Using a broad-coverage parser for word-breaking in Japanese

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Empirical methods for compound splitting

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

HLT '01 Proceedings of the first international conference on Human language technology research
Relevance weighting for query independent evidence

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Discriminative pruning of language models for Chinese word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
PSkip: estimating relevance ranking quality from web search clickthrough data

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Decompounding query keywords from compounding languages

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Weblog classification for fast splog filtering: a URL language model segmentation approach

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Data-driven compound splitting method for english compounds in domain names

Proceedings of the 18th ACM conference on Information and knowledge management
Exploring web scale language models for search query processing

Proceedings of the 19th international conference on World wide web
Multi-style language model for web scale information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Exploring URL hit priors for web search

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Review of MSR-Bing web scale speller challenge

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Extracting advertising keywords from URL strings

Proceedings of the 21st international conference companion on World Wide Web
A generalized hidden Markov model with discriminative training for query spelling correction

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Segmenting web-domains and hashtags using length specific models

Proceedings of the 21st ACM international conference on Information and knowledge management
An unsupervised method for author extraction from web pages containing user-generated content

Proceedings of the 21st ACM international conference on Information and knowledge management
What's in a name?: an unsupervised approach to link users across communities

Proceedings of the sixth ACM international conference on Web search and data mining
Effect of grammar on security of long passwords

Proceedings of the third ACM conference on Data and application security and privacy
Beyond clicks: query reformulation as a predictor of search satisfaction

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper uses the URL word breaking task as an example to elaborate what we identify as crucial in designing statistical natural language processing (NLP) algorithms for Web scale applications: (1) rudimentary multilingual capabilities to cope with the global nature of the Web, (2) multi-style modeling to handle diverse language styles seen in the Web contents, (3) fast adaptation to keep pace with the dynamic changes of the Web, (4) minimal heuristic assumptions for generalizability and robustness, and (5) possibilities of efficient implementations and minimal manual efforts for processing massive amount of data at a reasonable cost. We first show that the state-of-the-art word breaking techniques can be unified and generalized under the Bayesian minimum risk (BMR) framework that, using a Web scale N-gram, can meet the first three requirements. We discuss how the existing techniques can be viewed as introducing additional assumptions to the basic BMR framework, and describe a generic yet efficient implementation called word synchronous beam search. Testing the framework and its implementation on a series of large scale experiments reveals the following. First, the language style used to build the model plays a critical role in the word breaking task, and the most suitable for the URL word breaking task appears to be that of the document title where the best performance is obtained. Models created from other language styles, such as from document body, anchor text, and even queries, exhibit varying degrees of mismatch. Although all styles benefit from increasing modeling power which, in our experiments, corresponds to the use of a higher order N-gram, the gain is most recognizable for the title model. The heuristics proposed by the prior arts do contribute to the word breaking performance for mismatched or less powerful models, but are less effective and, in many cases, lead to poorer performance than the matched model with minimal assumptions. For the matched model based on document titles, an accuracy rate of 97.18% can already be achieved using simple trigram without any heuristics.