Crowdsourcing performance evaluations of user interfaces

Authors:
Steven Komarov;Katharina Reinecke;Krzysztof Z. Gajos
Affiliations:
Harvard University, Cambridge, Massachusetts, USA;Harvard University, Cambridge, Massachusetts, USA;Harvard University, Cambridge, Massachusetts, USA
Venue:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Year:
2013

Citing 14
Cited 4

Split menus: effectively using selection frequency to organize menus

ACM Transactions on Computer-Human Interaction (TOCHI)
The bubble cursor: enhancing target acquisition by dynamic resizing of the cursor's activation area

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Exploring the design space for adaptive graphical user interfaces

Proceedings of the working conference on Advanced visual interfaces
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Predictability and accuracy in adaptive user interfaces

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
How well do line drawings depict shape?

ACM SIGGRAPH 2009 papers
Financial incentives and the "performance of crowds"

Proceedings of the ACM SIGKDD Workshop on Human Computation
Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Crowdsourcing graphical perception: using mechanical turk to assess visualization design

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
TurKit: human computation algorithms on mechanical turk

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Perceptual Guidelines for Creating Rectangular Treemaps

IEEE Transactions on Visualization and Computer Graphics
Designing incentives for inexpert human raters

Proceedings of the ACM 2011 conference on Computer supported cooperative work
Platemate: crowdsourcing nutritional analysis from food photographs

Proceedings of the 24th annual ACM symposium on User interface software and technology
Instrumenting the crowd: using implicit behavioral measures to predict task performance

Proceedings of the 24th annual ACM symposium on User interface software and technology

CrowdStudy: general toolkit for crowdsourced evaluation of web interfaces

Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems
Keep it simple: reward and task design in crowdsourcing

Proceedings of the Biannual Conference of the Italian Chapter of SIGCHI
Voyant: generating structured feedback on visual designs using a crowd of non-experts

Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing
Adaptive click-and-cross: adapting to both abilities and task improves performance of users with impaired dexterity

Proceedings of the 19th international conference on Intelligent User Interfaces

Quantified Score

Hi-index	0.01

Visualization

Abstract

Online labor markets, such as Amazon's Mechanical Turk (MTurk), provide an attractive platform for conducting human subjects experiments because the relative ease of recruitment, low cost, and a diverse pool of potential participants enable larger-scale experimentation and faster experimental revision cycle compared to lab-based settings. However, because the experimenter gives up the direct control over the participants' environments and behavior, concerns about the quality of the data collected in online settings are pervasive. In this paper, we investigate the feasibility of conducting online performance evaluations of user interfaces with anonymous, unsupervised, paid participants recruited via MTurk. We implemented three performance experiments to re-evaluate three previously well-studied user interface designs. We conducted each experiment both in lab and online with participants recruited via MTurk. The analysis of our results did not yield any evidence of significant or substantial differences in the data collected in the two settings: All statistically significant differences detected in lab were also present on MTurk and the effect sizes were similar. In addition, there were no significant differences between the two settings in the raw task completion times, error rates, consistency, or the rates of utilization of the novel interaction mechanisms introduced in the experiments. These results suggest that MTurk may be a productive setting for conducting performance evaluations of user interfaces providing a complementary approach to existing methodologies.