An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery
Machine Learning - Special issue on natural language learning
Word segmentation and recognition for web document framework
Proceedings of the eighth international conference on Information and knowledge management
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian
CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
A systematic comparison of various statistical alignment models
Computational Linguistics
A statistical model for word discovery in transcribed speech
Computational Linguistics
Using a broad-coverage parser for word-breaking in Japanese
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Empirical methods for compound splitting
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
HLT '01 Proceedings of the first international conference on Human language technology research
Relevance weighting for query independent evidence
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Contextual dependencies in unsupervised word segmentation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Discriminative pruning of language models for Chinese word segmentation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
PSkip: estimating relevance ranking quality from web search clickthrough data
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Decompounding query keywords from compounding languages
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Weblog classification for fast splog filtering: a URL language model segmentation approach
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Data-driven compound splitting method for english compounds in domain names
Proceedings of the 18th ACM conference on Information and knowledge management
Exploring web scale language models for search query processing
Proceedings of the 19th international conference on World wide web
Multi-style language model for web scale information retrieval
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
An overview of Microsoft web N-gram corpus and applications
HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Exploring URL hit priors for web search
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Review of MSR-Bing web scale speller challenge
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Extracting advertising keywords from URL strings
Proceedings of the 21st international conference companion on World Wide Web
A generalized hidden Markov model with discriminative training for query spelling correction
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Segmenting web-domains and hashtags using length specific models
Proceedings of the 21st ACM international conference on Information and knowledge management
An unsupervised method for author extraction from web pages containing user-generated content
Proceedings of the 21st ACM international conference on Information and knowledge management
What's in a name?: an unsupervised approach to link users across communities
Proceedings of the sixth ACM international conference on Web search and data mining
Effect of grammar on security of long passwords
Proceedings of the third ACM conference on Data and application security and privacy
Beyond clicks: query reformulation as a predictor of search satisfaction
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
This paper uses the URL word breaking task as an example to elaborate what we identify as crucial in designing statistical natural language processing (NLP) algorithms for Web scale applications: (1) rudimentary multilingual capabilities to cope with the global nature of the Web, (2) multi-style modeling to handle diverse language styles seen in the Web contents, (3) fast adaptation to keep pace with the dynamic changes of the Web, (4) minimal heuristic assumptions for generalizability and robustness, and (5) possibilities of efficient implementations and minimal manual efforts for processing massive amount of data at a reasonable cost. We first show that the state-of-the-art word breaking techniques can be unified and generalized under the Bayesian minimum risk (BMR) framework that, using a Web scale N-gram, can meet the first three requirements. We discuss how the existing techniques can be viewed as introducing additional assumptions to the basic BMR framework, and describe a generic yet efficient implementation called word synchronous beam search. Testing the framework and its implementation on a series of large scale experiments reveals the following. First, the language style used to build the model plays a critical role in the word breaking task, and the most suitable for the URL word breaking task appears to be that of the document title where the best performance is obtained. Models created from other language styles, such as from document body, anchor text, and even queries, exhibit varying degrees of mismatch. Although all styles benefit from increasing modeling power which, in our experiments, corresponds to the use of a higher order N-gram, the gain is most recognizable for the title model. The heuristics proposed by the prior arts do contribute to the word breaking performance for mismatched or less powerful models, but are less effective and, in many cases, lead to poorer performance than the matched model with minimal assumptions. For the matched model based on document titles, an accuracy rate of 97.18% can already be achieved using simple trigram without any heuristics.