Abstract
Think-aloud protocols are a highly valued usability testing method for identifying usability problems. Despite the value of conducting think-aloud usability test sessions, analyzing think-aloud sessions is often time-consuming and labor-intensive. Consequently, previous research has urged the community to develop techniques to support fast-paced analysis. In this work, we took the first step to design and evaluate machine learning (ML) models to automatically detect usability problem encounters based on users’ verbalization and speech features in think-aloud sessions. Inspired by recent research that shows subtle patterns in users’ verbalizations and speech features tend to occur when they encounter problems, we examined whether these patterns can be utilized to improve the automatic detection of usability problems. We first conducted and recorded think-aloud sessions and then examined the effect of different input features, ML models, test products, and users on usability problem encounters detection. Our work uncovers several technical and user interface design challenges and sets a baseline for automating usability problem detection and integrating such automation into UX practitioners’ workflow.
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), 265--283.Google Scholar
- Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machine learning. AI Mag. 35, 4 (2014), 105. DOI:https://doi.org/10.1609/aimag.v35i4.2513Google ScholarDigital Library
- Morten Sieker Andreasen, Henrik Villemann Nielsen, Simon Ormholt Schrøder, and Jan Stage. 2007. What happened to remote usability testing?: An empirical study of three methods. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1405--1414. DOI:https://doi.org/10.1145/1240624.1240838Google ScholarDigital Library
- Martin Blanchard, Nathaniel |D'Mello, Sidney |Olney, Andrew M.|Nystrand. 2015. Automatic classification of question 8 answer discourse segments from teacher's speech in classrooms. Int. Educ. Data Min. Soc. (2015). Retrieved from https://eric.ed.gov/?id=ED560555.Google Scholar
- Liora Bresler, Judy Davidson Wasser, Nancy B. Hertzog, and Mary Lemons. 1996. Beyond the lone ranger researcher: Team work in qualitative research. Res. Stud. Music Educ. 7, 1 (1996), 13--27. DOI:https://doi.org/10.1177/1321103X9600700102Google ScholarCross Ref
- Anders Bruun, Peter Gull, Lene Hofmeister, and Jan Stage. 2009. Let your users do the testing: A comparison of three remote asynchronous usability testing methods. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI’09). 1619--1628. DOI:https://doi.org/10.1145/1518701.1518948Google ScholarDigital Library
- Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). 18. DOI:https://doi.org/10.1145/1015330.1015432Google ScholarDigital Library
- Kapil Chalil Madathil and Joel S. Greenstein. 2011. Synchronous remote usability testing: A new approach facilitated by virtual worlds. In Proceedings of the Conference on Human Factors in Computing Systems (CHI’11). 2225--2234. DOI:https://doi.org/10.1145/1978942.1979267Google Scholar
- Elizabeth Charters. 2003. The use of think-aloud methods in qualitative research an introduction to think-aloud methods. Brock Educ. J. 12, 2 (2003), 68--82. DOI:https://doi.org/10.26522/brocked.v12i2.38Google ScholarCross Ref
- Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R. Aragon. 2018. Using machine learning to support qualitative coding in social science: Shifting the focus to ambiguity. ACM Trans. Interact. Intell. Syst. 8, 2 (2018). 9:1--9:20. DOI:https://doi.org/10.1145/3185515Google ScholarDigital Library
- Parmit K. Chilana, Jacob O. Wobbrock, and Andrew J. Ko. 2010. Understanding usability practices in complex domains. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2337--2346. DOI:https://doi.org/10.1145/1753326.1753678Google Scholar
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. (2014). Retrieved from http://arxiv.org/abs/1406.1078.Google Scholar
- Torkil Clemmensen, Qingxin Shi, Jyoti Kumar, Huiyang Li, Xianghong Sun, and Pradeep Yammiyavar. 2007. Cultural usability tests—How usability tests are not the same all over the world. In Usability and Internationalization. HCI and Culture. Springer Berlin, 281--290. DOI:https://doi.org/10.1007/978-3-540-73287-7_35Google Scholar
- Lynne Cooke. 2010. Assessing concurrent think-aloud protocol as a usability test method: A technical communication approach. IEEE Trans. Prof. Commun. 53, 3 (2010), 202--215. DOI:https://doi.org/10.1109/TPC.2010.2052859Google ScholarCross Ref
- Kevin Crowston, Eileen E. Allen, and Robert Heckman. 2012. Using natural language processing technology for qualitative data analysis. Int. J. Soc. Res. Methodol. 15, 6 (2012), 523--543. DOI:https://doi.org/10.1080/13645579.2011.625764Google ScholarCross Ref
- I. Dey. 1993. Qualitative Data Analysis: A User-Friendly Guide for Social Scientists. Routledge. DOI:https://doi.org/10.4324/9780203879276Google Scholar
- Thomas G. Dietterich. 2000. Ensemble methods in machine learning. Springer, Berlin, 1--15. DOI:https://doi.org/10.1007/3-540-45014-9_1Google ScholarCross Ref
- Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. 2017. UX design innovation: Challenges for working with machine learning as a design material. In Proceedings of the Chi Conference on Human Factors in Computing Systems. 278--288.Google ScholarDigital Library
- Margaret Drouhard, Nan Chen Chen, Jina Suh, Rafal Kocielnik, Vanessa Pena-Araya, Keting Cen, Xiangyi Zheng, and Cecilia R. Aragon. 2017. Aeonium: Visual analytics to support collaborative qualitative coding. In Proceedings of the IEEE Pacific Visualization Symposium. 220--229. DOI:https://doi.org/10.1109/PACIFICVIS.2017.8031598Google Scholar
- Upol Ehsan, Pradyumna Tambwekar, Larry Chan, Brent Harrison, and Mark O. Riedl. 2019. Automated rationale generation: A technique for explainable AI and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI’19). 263--274. DOI:https://doi.org/10.1145/3301275.3302316Google Scholar
- Elling Sanne, Lentz Leo, and Menno De Jong. 2012. Combining concurrent think-aloud protocols and eye-tracking observations: An analysis of verbalizations. IEEE Trans. Prof. Commun. 55, 3 (2012), 206--220. DOI:https://doi.org/10.1109/TPC.2012.2206190Google ScholarCross Ref
- K. Anders Ericsson and Herbert A. Simon. 1984. Protocol Analysis: Verbal Reports as Data. The MIT Press, Cambridge, MA.Google Scholar
- Mingming Fan, Jinglan Lin, Christina Chung, and Khai N. Truong. 2019. Concurrent think-aloud verbalizations and usability problems. ACM Trans. Comput. Interact. 26, 5 (2019), 1--35. DOI:https://doi.org/10.1145/3325281Google ScholarDigital Library
- Asbjørn Følstad, Effie Law, and Kasper Hornbæk. 2012. Analysis in practical usability evaluation: A survey study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2127--2136. DOI:https://doi.org/10.1145/2207676.2208365Google ScholarDigital Library
- Mark C. Fox, K. Anders Ericsson, and Ryan Best. 2011. Do procedures for verbal reporting of thinking have to be reactive? A meta-analysis and recommendations for best reporting methods. Psychol. Bull. 137, 2 (2011), 316.Google ScholarCross Ref
- Palash Goyal, Sumit Pandey, Karan Jain, Palash Goyal, Sumit Pandey, and Karan Jain. 2018. Research paper implementation: Sentiment classification. In Deep Learning for Natural Language Processing. Apress, 231--268. DOI:https://doi.org/10.1007/978-1-4842-3685-7_5Google Scholar
- Jonathan Grizou, I. Iturrate, Luis Montesano, Pierre-Yves Oudeyer, and Manuel Lopes. 2014. Interactive learning from unlabeled instructions. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI’14). Retrieved from http://auai.org/uai2014/proceedings/individuals/198.pdf.Google Scholar
- Jan Gulliksen, Inger Boivie, Jenny Persson, Anders Hektor, and Lena Herulf. 2004. Making a difference: A survey of the usability profession in Sweden. In Proceedings of the 3rd Nordic Conference on Human-Computer Interaction (NordiCHI ’04). 207--215. DOI:https://doi.org/10.1145/1028014.1028046Google ScholarDigital Library
- Morten Hertzum, Pia Borlund, and Kristina B. Kristoffersen. 2015. What do thinking-aloud participants say? A comparison of moderated and unmoderated usability sessions. Int. J. Hum. Comput. Interact. 31, 9 (2015), 557--570. DOI:https://doi.org/10.1080/10447318.2015.1065691Google ScholarCross Ref
- Morten Hertzum and Kristin Due Holmegaard. 2013. Thinking aloud in the presence of interruptions and time constraints. Int. J. Hum. Comput. Interact. 29, 5 (2013), 351--364. DOI:https://doi.org/10.1080/10447318.2012.711705Google ScholarCross Ref
- Morten Hertzum and Niels Ebbe Jacobsen. 2001. The evaluator effect: A chilling fact about usability evaluation methods. Int. J. Hum. Comput. Interact. 13, 4 (2001), 421--443. DOI:https://doi.org/10.1207/S15327590IJHC1304_05Google ScholarCross Ref
- Masahiro Hori, Yasunori Kihara, and Takashi Kato. 2011. Investigation of indirect oral operation method for think aloud usability testing. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 38--46. DOI:https://doi.org/10.1007/978-3-642-21753-1_5Google Scholar
- Paula Jarzabkowski, Rebecca Bednarek, and Laure Cabantous. 2015. Conducting global team-based ethnography: Methodological challenges and practical methods. Hum. Relations 68, 1 (2015), 3--33. DOI:https://doi.org/10.1177/0018726714535449Google ScholarCross Ref
- Claire-Marie Karat, Robert Campbell, and Tarra Fiegel. 1992. Comparison of empirical testing and walkthrough methods in user interface evaluation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’92). 397--404. DOI:https://doi.org/10.1145/142750.142873Google ScholarDigital Library
- Jesper Kjeldskov, Mikael B. Skov, and Jan Stage. 2004. Instant data analysis: Conducting usability evaluations in a day. In Proceedings of the 3rd Nordic Conference on Human-Computer Interaction (NordiCHI’04). 233--240. DOI:https://doi.org/10.1145/1028014.1028050Google ScholarDigital Library
- Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will you accept an imperfect AI? exploring designs for adjusting end-user expectations of AI systems. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’19). 1--14. DOI:https://doi.org/10.1145/3290605.3300641Google Scholar
- Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Conference on Artificial IntellIgence. 333, (2015), 2267--2273. DOI:https://doi.org/10.1145/2808719.2808746Google Scholar
- Megh Marathe and Kentaro Toyama. 2018. Semi-automated coding for qualitative research. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’18). 1--12. DOI:https://doi.org/10.1145/3173574.3173922Google ScholarDigital Library
- Sharon McDonald, Helen M. Edwards, and Tingting Zhao. 2012. Exploring think-alouds in usability testing: An international survey. IEEE Trans. Prof. Commun. 55, 1 (2012), 2--19. DOI:https://doi.org/10.1109/TPC.2011.2182569Google ScholarCross Ref
- Sharon McDonald and Helen Petrie. 2013. The effect of global instructions on think-aloud testing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’13). 2941--2944. DOI:https://doi.org/10.1145/2470654.2481407Google ScholarDigital Library
- David Meignan, Sigrid Knust, Jean-Marc Frayret, Gilles Pesant, and Nicolas Gaud. 2015. A review and taxonomy of interactive optimization methods in operations research. ACM Trans. Interact. Intell. Syst. 5, 3 (2015), 1--43. DOI:https://doi.org/10.1145/2808234Google ScholarDigital Library
- Jakob Nielsen. 1993. Usability Engineering. Elsevier.Google ScholarDigital Library
- Mie Nørgaard and Kasper Hornbæk. 2006. What do usability evaluators do in practice? An explorative study of think-aloud testing. In Proceedings of the 6th ACM Conference on Designing Interactive Systems (DIS’06). 209. DOI:https://doi.org/10.1145/1142405.1142439Google ScholarDigital Library
- Erica Olmsted-Hawala and Jennifer Romano Bergstrom. 2012. Think-aloud protocols: Does age make a difference? In Proceedings of the STC Technical Communication Summit.Google Scholar
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg et al. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825--2830.Google ScholarDigital Library
- Jonathon Read, Rebecca Dridan, Stephan Oepen, and Lars Jørgen Solberg. 2012. Sentence boundary detection: A long solved problem? In Proceedings of the International Conference on Computational Linguistics (COLING’12). 985--994.Google Scholar
- Qingxin Shi. 2008. A field study of the relationship and communication between Chinese evaluators and users in thinking aloud usability tests. In Proceedings of the 5th Nordic Conference on Human-computer Interaction Building Bridges (NordiCHI’08). 344. DOI:https://doi.org/10.1145/1463160.1463198Google ScholarDigital Library
- Andreas Sonderegger, Sven Schmutz, and Juergen Sauer. 2016. The influence of age in usability testing. Appl. Ergon. 52, (2016), 291--300. DOI:https://doi.org/10.1016/j.apergo.2015.06.012Google ScholarCross Ref
- Howard Tamler. 1998. How (much) to intervene in a usability testing session. Common Gr. 8, 3 (1998), 11--15.Google Scholar
- Karel Vredenburg, Ji-Ye Mao, Paul W. Smith, and Tom Carey. 2002. A survey of user-centered design practice. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems Changing Our World, Changing Ourselves (CHI’02). 471. DOI:https://doi.org/10.1145/503457.503460Google ScholarDigital Library
- Zuowei Wang, Xingyu Pan, Kevin F. Miller, and Kai S. Cortina. 2014. Automatic classification of activities in classroom discourse. Comput. Educ. 78, (2014), 115--123. DOI:https://doi.org/10.1016/J.COMPEDU.2014.05.010Google Scholar
- Brad Wuetherick. 2010. Basics of qualitative research: Techniques and procedures for developing grounded theory. Can. J. Univ. Contin. Educ. 36, 2 (2010). DOI:https://doi.org/10.21225/D5G01TGoogle Scholar
- Jasy Liew Suet Yan, Nancy McCracken, and Kevin Crowston. 2014. Semi-automatic content analysis of qualitative data. In Proceedings of the iConference. DOI:https://doi.org/10.9776/14399Google Scholar
- Jasy Liew Suet Yan, Nancy McCracken, Shichun Zhou, and Kevin Crowston. 2014. Optimizing features in active machine learning for complex qualitative content analysis. In Proceedings of the ACL Workshop on Language Technologies and Computational Social Science 56, Ml (2014), 44--48. DOI:https://doi.org/10.3115/v1/w14-2513Google Scholar
- Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018. Investigating how experienced UX designers effectively work with machine learning. In Proceedings of the Designing Interactive Systems Conference (DIS’18). 585--596. DOI:https://doi.org/10.1145/3196709.3196730Google ScholarDigital Library
- Qian Yang, John Zimmerman, Aaron Steinfeld, and Anthony Tomasic. 2016. Planning adaptive mobile experiences when wireframing. In Proceedings of the ACM Conference on Designing Interactive Systems. 565--576.Google ScholarDigital Library
- Tingting Zhao, Sharon McDonald, and Helen M. Edwards. 2014. The impact of two different think-aloud instructions in a usability test: A case of just following orders? Behav. Inf. Technol. 33, 2 (2014), 162--182. DOI:https://doi.org/10.1080/0144929X.2012.708786Google ScholarCross Ref
- Haiyi Zhu, Robert E. Kraut, Yi-Chia Wang, and Aniket Kittur. 2011. Identifying shared leadership in Wikipedia. In Proceedings of the Conference on Human Factors in Computing Systems (CHI’11). 3431. DOI:https://doi.org/10.1145/1978942.1979453Google ScholarDigital Library
- Christian's Python Library: A Python library for voice analysis. Retrieved from https://homepage.univie.ac.at/christian.herbst/python/namespacepraat_util.html.Google Scholar
- Praat: Doing Phonetics by Computer. Retrieved from http://www.fon.hum.uva.nl/praat/.Google Scholar
- Sound: To Pitch (ac)… Retrieved from http://www.fon.hum.uva.nl/praat/manual/Sound__To_Pitch__ac____.html.Google Scholar
- tf.random.uniform | TensorFlow Core r2.0. Retrieved from https://www.tensorflow.org/api_docs/python/tf/random/uniform.Google Scholar
Index Terms
- Automatic Detection of Usability Problem Encounters in Think-aloud Sessions
Recommendations
Older Adults’ Think-Aloud Verbalizations and Speech Features for Identifying User Experience Problems
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsSubtle patterns in users’ think-aloud (TA) verbalizations and speech features are shown to be telltale signs of User Experience (UX) problems. However, such patterns were uncovered among young adults. Whether such patterns apply for older adults remains ...
Concurrent Think-Aloud Verbalizations and Usability Problems
The concurrent think-aloud protocol—in which participants verbalize their thoughts when performing tasks—is a widely employed approach in usability testing. Despite its value, analyzing think-aloud sessions can be onerous because it often entails ...
Think-aloud protocols: a comparison of three think-aloud protocols for use in testing data-dissemination web sites for usability
CHI '10: Proceedings of the SIGCHI Conference on Human Factors in Computing SystemsWe describe an empirical, between-subjects study on the use of think-aloud protocols in usability testing of a federal data-dissemination Web site. This double-blind study used three different types of think-aloud protocols: a traditional protocol, a ...
Comments