A methodology for testing spreadsheets
ACM Transactions on Software Engineering and Methodology (TOSEM)
Outlier finding: focusing user attention on possible errors
Proceedings of the 14th annual ACM symposium on User interface software and technology
Software Testing Techniques
Modern Information Retrieval
Semantic anomaly detection in online data sources
Proceedings of the 24th International Conference on Software Engineering
An Experimental Comparison of the Effectiveness of Branch Testing and Data Flow Testing
IEEE Transactions on Software Engineering
First Steps in Programming: A Rationale for Attention Investment Models
HCC '02 Proceedings of the IEEE 2002 Symposia on Human Centric Computing Languages and Environments (HCC'02)
Communications of the ACM - End-user development: tools that empower users to create their own software solutions
Digital Family Portrait Field Trial: Support for Aging in Place
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Integrating automated test generation into the WYSIWYT spreadsheet testing methodology
ACM Transactions on Software Engineering and Methodology (TOSEM)
AutoTest: A Tool for Automatic Test Case Generation in Spreadsheets
VLHCC '06 Proceedings of the Visual Languages and Human-Centric Computing
Active EM to reduce noise in activity recognition
Proceedings of the 12th international conference on Intelligent user interfaces
How it works: a field study of non-technical users interacting with an intelligent system
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Active Learning with Feedback on Features and Instances
The Journal of Machine Learning Research
Toward establishing trust in adaptive agents
Proceedings of the 13th international conference on Intelligent user interfaces
Fixing the program my computer learned: barriers for end users, challenges for the machine
Proceedings of the 14th international conference on Intelligent user interfaces
EnsembleMatrix: interactive visualization to support machine learning with multiple classifiers
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Why and why not explanations improve the intelligibility of context-aware intelligent systems
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Can feature design reduce the gender gap in end-user software development environments?
VLHCC '08 Proceedings of the 2008 IEEE Symposium on Visual Languages and Human-Centric Computing
Toolkit to support intelligibility in context-aware applications
Proceedings of the 12th ACM international conference on Ubiquitous computing
Explanatory Debugging: Supporting End-User Debugging of Machine-Learned Programs
VLHCC '10 Proceedings of the 2010 IEEE Symposium on Visual Languages and Human-Centric Computing
End-user feature labeling: a locally-weighted regression approach
Proceedings of the 16th international conference on Intelligent user interfaces
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
How Programmers Debug, Revisited: An Information Foraging Theory Perspective
IEEE Transactions on Software Engineering
An explanation-centric approach for personalizing intelligent agents
Proceedings of the 2012 ACM international conference on Intelligent User Interfaces
Hi-index | 0.00 |
Intelligent assistants are handling increasingly critical tasks, but until now, end users have had no way to systematically assess where their assistants make mistakes. For some intelligent assistants, this is a serious problem: if the assistant is doing work that is important, such as assisting with qualitative research or monitoring an elderly parent's safety, the user may pay a high cost for unnoticed mistakes. This paper addresses the problem with WYSIWYT/ML (What You See Is What You Test for Machine Learning), a human/computer partnership that enables end users to systematically test intelligent assistants. Our empirical evaluation shows that WYSIWYT/ML helped end users find assistants' mistakes significantly more effectively than ad hoc testing. Not only did it allow users to assess an assistant's work on an average of 117 predictions in only 10 minutes, it also scaled to a much larger data set, assessing an assistant's work on 623 out of 1,448 predictions using only the users' original 10 minutes' testing effort.