Predicting the readability of short web summaries

Authors:
Tapas Kanungo;David Orr
Affiliations:
Yahoo! Labs, Santa Clara, CA;Yahoo! Labs, Santa Clara, CA
Venue:
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Year:
2009

Citing 16
Cited 16

A general language model for information retrieval

Proceedings of the eighth international conference on Information and knowledge management
Statistical Pattern Recognition: A Review

IEEE Transactions on Pattern Analysis and Machine Intelligence
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical model for scientific readability

Proceedings of the tenth international conference on Information and knowledge management
Stochastic gradient boosting

Computational Statistics & Data Analysis - Nonlinear methods and data mining
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Comparing link marker visualization techniques: changes in reading behavior

WWW '03 Proceedings of the 12th international conference on World Wide Web
Automated scoring using a hybrid feature identification technique

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Discriminative models for information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Automatic summarization of search engine hit lists

RANLPIR '00 Proceedings of the ACL-2000 workshop on Recent advances in natural language processing and information retrieval: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 11
A framework to predict the quality of answers with non-textual features

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Summary attributes and perceived search quality

Proceedings of the 16th international conference on World Wide Web
The influence of caption features on clickthrough patterns in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Evaluating web search result summaries

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Finding support sentences for entities

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
EUSUM: extracting easy-to-understand english summaries for non-native readers

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
SMS-based web search for low-end mobile devices

Proceedings of the sixteenth annual international conference on Mobile computing and networking
Learning to predict readability using diverse linguistic features

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Quality-biased ranking of web documents

Proceedings of the fourth ACM international conference on Web search and data mining
Identifying enrichment candidates in textbooks

Proceedings of the 20th international conference companion on World wide web
ViewSer: enabling large-scale remote user studies of web search examination and interaction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Enhanced results for web search

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Search snippet evaluation at yandex: lessons learned and future directions

CLEF'11 Proceedings of the Second international conference on Multilingual and multimodal information access evaluation
Measuring Comprehensibility of Web Pages Based on Link Analysis

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
To each his own: personalized content selection based on text comprehensibility

Proceedings of the fifth ACM international conference on Web search and data mining
Non-linear models for confidence estimation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Improving search result summaries by using searcher behavior data

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Term level search result diversification

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
How unfamiliar words in smartphone manuals affect senior citizens

UAHCI'13 Proceedings of the 7th international conference on Universal Access in Human-Computer Interaction: applications and services for quality of life - Volume Part III
Quality estimation for machine translation: some lessons learned

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Readability is a crucial presentation attribute that web summarization algorithms consider while generating a querybaised web summary. Readability quality also forms an important component in real-time monitoring of commercial search-engine results since readability of web summaries impacts clickthrough behavior, as shown in recent studies, and thus impacts user satisfaction and advertising revenue. The standard approach to computing the readability is to first collect a corpus of random queries and their corresponding search result summaries, and then each summary is then judged by a human for its readabilty quality. An average readability score is then reported. This process is time consuming and expensive. Besides, the manual evaluation process can not be used in the real-time summary generation process. In this paper we propose a machine learning approach to the problem. We use the corpus as described above and extract summary features that we think may characterize readability. We then estimate a model (gradient boosted decision tree) that predicts human judgments given the features. This model can then be used in real time to estimate the readability of new (unseen) web search summaries and also be used in the summary generation process. We present results on approximately 5000 editorial judgments collected over the course of a year and show examples where the model predicts the quality well and where it disagrees with human judgments. We compare the results of the model to previous models of readability, most notably Collins-Thompson-Callan, Fog and Flesch-Kincaid, and see that our model shows substantially better correlation with editorial judgments as measured by Pearson's correlation coefficient. The learning algorithm also provides us with the relative importance of the features used.