Retrieving customary web language to assist writers

Authors:
Benno Stein;Martin Potthast;Martin Trenkmann
Affiliations:
Bauhaus-Universität Weimar, Germany;Bauhaus-Universität Weimar, Germany;Bauhaus-Universität Weimar, Germany
Venue:
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Year:
2010

Citing 6
Cited 2

Foundations of statistical natural language processing

Foundations of statistical natural language processing
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
IO-Top-k: index-access optimized top-k query processing

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Correcting ESL errors using phrasal SMT techniques

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
The linguist's search engine: an overview

ACLdemo '05 Proceedings of the ACL 2005 on Interactive poster and demonstration sessions
A survey of top-k query processing techniques in relational database systems

ACM Computing Surveys (CSUR)

The power of naive query segmentation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Optimising search engines using evolutionally adapted language models in typed dependency parses

SIDE'12 Proceedings of the 2012 international conference on Swarm and Evolutionary Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces Netspeak, a Web service which assists writers in finding adequate expressions. To provide statistically relevant suggestions, the service indexes more than 1.8 billion n-grams, n≤5, along with their occurrence frequencies on the Web. If in doubt about a wording, a user can specify a query that has wildcards inserted at those positions where she feels uncertain. Queries define patterns for which a ranked list of matching n-grams along with usage examples are retrieved. The ranking reflects the occurrence frequencies of the n-grams and informs about both absolute and relative usage. Given this choice of customary wordings, one can easily select the most appropriate. Especially second-language speakers can learn about style conventions and language usage. To guarantee response times within milliseconds we have developed an index that considers occurrence probabilities, allowing for a biased sampling during retrieval. Our analysis shows that the extreme speedup obtained with this strategy (factor 68) comes without significant loss in retrieval quality.