Is accessibility conformance an elusive property? A study of validity and reliability of WCAG 2.0

Authors:
Giorgio Brajnik;Yeliz Yesilada;Simon Harper
Affiliations:
University of Udine, Italy;Middle East Technical University, Northern Cyprus Campus, Turkey;University of Manchester, UK
Venue:
ACM Transactions on Accessible Computing (TACCESS)
Year:
2012

Citing 17
Cited 3

Heuristic evaluation of user interfaces

CHI '90 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Finding usability problems through heuristic evaluation

CHI '92 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
The evaluator effect in usability tests

CHI 98 Cconference Summary on Human Factors in Computing Systems
Usability inspections by groups of specialists: perceived agreement in spite of disparate observations

CHI '02 Extended Abstracts on Human Factors in Computing Systems
Usability Engineering

Usability Engineering
The Evaluator Effect during First-Time Use of the Cognitive Walkthrough Technique

Proceedings of HCI International (the 8th International Conference on Human-Computer Interaction) on Human-Computer Interaction: Ergonomics and User Interfaces-Volume I - Volume I
Comparing accessibility evaluation tools: a method for tool effectiveness

Universal Access in the Information Society
Reliability of severity estimates for usability problems found by heuristic evaluation

CHI '92 Posters and Short Talks of the 1992 SIGCHI Conference on Human Factors in Computing Systems
Web Accessibility: Web Standards and Regulatory Compliance

Web Accessibility: Web Standards and Regulatory Compliance
The relationship between accessibility and usability of websites

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Handbook of Usability TestingXXX: Howto Plan, Design, and Conduct Effective Tests

Handbook of Usability TestingXXX: Howto Plan, Design, and Conduct Effective Tests
Beyond Conformance: The Role of Accessibility Evaluation Methods

WISE '08 Proceedings of the 2008 international workshops on Web Information Systems Engineering
Effect of evaluators' cognitive style on heuristic evaluation: Field dependent and field independent evaluators

International Journal of Human-Computer Studies
How much does expertise matter?: a barrier walkthrough study with experts and non-experts

Proceedings of the 11th international ACM SIGACCESS conference on Computers and accessibility
Number of people required for usability evaluation: the 10±2 rule

Communications of the ACM
On the testability of WCAG 2.0 for beginners

Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A)
Testability and validity of WCAG 2.0: the expertise effect

Proceedings of the 12th international ACM SIGACCESS conference on Computers and accessibility

Web accessibility as a side effect

Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility
International and national standard harmonization and achievement effort of web accessibility in Japan

ACM SIGACCESS Accessibility and Computing
Progress on Website Accessibility?

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web Content Accessibility Guidelines (WCAG) 2.0 separate testing into both “Machine” and “Human” audits; and further classify “Human Testability” into “Reliably Human Testable” and “Not Reliably Testable”; it is human testability that is the focus of this paper. We wanted to investigate the likelihood that “at least 80% of knowledgeable human evaluators would agree on the conclusion” of an accessibility audit, and therefore understand the percentage of success criteria that could be described as reliably human testable, and those that could not. In this case, we recruited twenty-five experienced evaluators to audit four pages for WCAG 2.0 conformance. These pages were chosen to differ in layout, complexity, and accessibility support, thereby creating a small but variable sample. We found that an 80% agreement between experienced evaluators almost never occurred and that the average agreement was at the 70--75% mark, while the error rate was around 29%. Further, trained—but novice—evaluators performing the same audits exhibited the same agreement to that of our more experienced ones, but a reduction on validity of 6--13% ; the validity that an untrained user would attain can only be a conjecture. Expertise appears to improve (by 19%) the ability to avoid false positives. Finally, pooling the results of two independent experienced evaluators would be the best option, capturing at most 76% of the true problems and producing only 24% of false positives. Any other independent combination of audits would achieve worse results. This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators, when working on pages similar to the ones used in this experiment; that the error rate even for experienced evaluators is relatively high and further, that untrained accessibility auditors be they developers or quality testers from other domains, would do much worse than this.