On measuring the lexical quality of the web

Authors:
Ricardo Baeza-Yates;Luz Rello
Affiliations:
Yahoo! Research & Web Research Group, Universitat Pompeu Fabra Barcelona, Spain;NLP & Web Research Groups Universitat Pompeu Fabra Barcelona, Spain
Venue:
Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Year:
2012

Citing 7
Cited 4

Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Computational Linguistics
Exploring linguistic features for web spam detection: a preliminary study

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
A "quick and dirty" website data quality indicator

Proceedings of the 2nd ACM workshop on Information credibility on the web
Modern Information Retrieval

Modern Information Retrieval
Estimating dyslexia in the web

Proceedings of the International Cross-Disciplinary Conference on Web Accessibility
Lexical quality as a proxy for web text understandability

Proceedings of the 21st international conference companion on World Wide Web

Lexical quality as a proxy for web text understandability

Proceedings of the 21st international conference companion on World Wide Web
Lexical quality as a measure for textual web accessibility

ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Measuring web quality

Proceedings of the 22nd international conference on World Wide Web companion
Using statistics, visualization and data mining for monitoring the quality of meta-data in web portals

Information Systems and e-Business Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.