On measuring the lexical quality of the web

  • Authors:
  • Ricardo Baeza-Yates;Luz Rello

  • Affiliations:
  • Yahoo! Research & Web Research Group, Universitat Pompeu Fabra Barcelona, Spain;NLP & Web Research Groups Universitat Pompeu Fabra Barcelona, Spain

  • Venue:
  • Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.