Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora
Computational Linguistics
Exploring linguistic features for web spam detection: a preliminary study
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
A "quick and dirty" website data quality indicator
Proceedings of the 2nd ACM workshop on Information credibility on the web
Modern Information Retrieval
Estimating dyslexia in the web
Proceedings of the International Cross-Disciplinary Conference on Web Accessibility
Lexical quality as a proxy for web text understandability
Proceedings of the 21st international conference companion on World Wide Web
Lexical quality as a proxy for web text understandability
Proceedings of the 21st international conference companion on World Wide Web
Lexical quality as a measure for textual web accessibility
ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Proceedings of the 22nd international conference on World Wide Web companion
Information Systems and e-Business Management
Hi-index | 0.00 |
In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.