skip to main content
research-article

Automatic Detection of Usability Problem Encounters in Think-aloud Sessions

Authors Info & Claims
Published:30 May 2020Publication History
Skip Abstract Section

Abstract

Think-aloud protocols are a highly valued usability testing method for identifying usability problems. Despite the value of conducting think-aloud usability test sessions, analyzing think-aloud sessions is often time-consuming and labor-intensive. Consequently, previous research has urged the community to develop techniques to support fast-paced analysis. In this work, we took the first step to design and evaluate machine learning (ML) models to automatically detect usability problem encounters based on users’ verbalization and speech features in think-aloud sessions. Inspired by recent research that shows subtle patterns in users’ verbalizations and speech features tend to occur when they encounter problems, we examined whether these patterns can be utilized to improve the automatic detection of usability problems. We first conducted and recorded think-aloud sessions and then examined the effect of different input features, ML models, test products, and users on usability problem encounters detection. Our work uncovers several technical and user interface design challenges and sets a baseline for automating usability problem detection and integrating such automation into UX practitioners’ workflow.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), 265--283.Google ScholarGoogle Scholar
  2. Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machine learning. AI Mag. 35, 4 (2014), 105. DOI:https://doi.org/10.1609/aimag.v35i4.2513Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Morten Sieker Andreasen, Henrik Villemann Nielsen, Simon Ormholt Schrøder, and Jan Stage. 2007. What happened to remote usability testing?: An empirical study of three methods. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1405--1414. DOI:https://doi.org/10.1145/1240624.1240838Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Martin Blanchard, Nathaniel |D'Mello, Sidney |Olney, Andrew M.|Nystrand. 2015. Automatic classification of question 8 answer discourse segments from teacher's speech in classrooms. Int. Educ. Data Min. Soc. (2015). Retrieved from https://eric.ed.gov/?id=ED560555.Google ScholarGoogle Scholar
  5. Liora Bresler, Judy Davidson Wasser, Nancy B. Hertzog, and Mary Lemons. 1996. Beyond the lone ranger researcher: Team work in qualitative research. Res. Stud. Music Educ. 7, 1 (1996), 13--27. DOI:https://doi.org/10.1177/1321103X9600700102Google ScholarGoogle ScholarCross RefCross Ref
  6. Anders Bruun, Peter Gull, Lene Hofmeister, and Jan Stage. 2009. Let your users do the testing: A comparison of three remote asynchronous usability testing methods. In Proceedings of the 27th International Conference on Human Factors in Computing Systems (CHI’09). 1619--1628. DOI:https://doi.org/10.1145/1518701.1518948Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). 18. DOI:https://doi.org/10.1145/1015330.1015432Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kapil Chalil Madathil and Joel S. Greenstein. 2011. Synchronous remote usability testing: A new approach facilitated by virtual worlds. In Proceedings of the Conference on Human Factors in Computing Systems (CHI’11). 2225--2234. DOI:https://doi.org/10.1145/1978942.1979267Google ScholarGoogle Scholar
  9. Elizabeth Charters. 2003. The use of think-aloud methods in qualitative research an introduction to think-aloud methods. Brock Educ. J. 12, 2 (2003), 68--82. DOI:https://doi.org/10.26522/brocked.v12i2.38Google ScholarGoogle ScholarCross RefCross Ref
  10. Nan-Chen Chen, Margaret Drouhard, Rafal Kocielnik, Jina Suh, and Cecilia R. Aragon. 2018. Using machine learning to support qualitative coding in social science: Shifting the focus to ambiguity. ACM Trans. Interact. Intell. Syst. 8, 2 (2018). 9:1--9:20. DOI:https://doi.org/10.1145/3185515Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Parmit K. Chilana, Jacob O. Wobbrock, and Andrew J. Ko. 2010. Understanding usability practices in complex domains. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2337--2346. DOI:https://doi.org/10.1145/1753326.1753678Google ScholarGoogle Scholar
  12. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. (2014). Retrieved from http://arxiv.org/abs/1406.1078.Google ScholarGoogle Scholar
  13. Torkil Clemmensen, Qingxin Shi, Jyoti Kumar, Huiyang Li, Xianghong Sun, and Pradeep Yammiyavar. 2007. Cultural usability tests—How usability tests are not the same all over the world. In Usability and Internationalization. HCI and Culture. Springer Berlin, 281--290. DOI:https://doi.org/10.1007/978-3-540-73287-7_35Google ScholarGoogle Scholar
  14. Lynne Cooke. 2010. Assessing concurrent think-aloud protocol as a usability test method: A technical communication approach. IEEE Trans. Prof. Commun. 53, 3 (2010), 202--215. DOI:https://doi.org/10.1109/TPC.2010.2052859Google ScholarGoogle ScholarCross RefCross Ref
  15. Kevin Crowston, Eileen E. Allen, and Robert Heckman. 2012. Using natural language processing technology for qualitative data analysis. Int. J. Soc. Res. Methodol. 15, 6 (2012), 523--543. DOI:https://doi.org/10.1080/13645579.2011.625764Google ScholarGoogle ScholarCross RefCross Ref
  16. I. Dey. 1993. Qualitative Data Analysis: A User-Friendly Guide for Social Scientists. Routledge. DOI:https://doi.org/10.4324/9780203879276Google ScholarGoogle Scholar
  17. Thomas G. Dietterich. 2000. Ensemble methods in machine learning. Springer, Berlin, 1--15. DOI:https://doi.org/10.1007/3-540-45014-9_1Google ScholarGoogle ScholarCross RefCross Ref
  18. Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. 2017. UX design innovation: Challenges for working with machine learning as a design material. In Proceedings of the Chi Conference on Human Factors in Computing Systems. 278--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Margaret Drouhard, Nan Chen Chen, Jina Suh, Rafal Kocielnik, Vanessa Pena-Araya, Keting Cen, Xiangyi Zheng, and Cecilia R. Aragon. 2017. Aeonium: Visual analytics to support collaborative qualitative coding. In Proceedings of the IEEE Pacific Visualization Symposium. 220--229. DOI:https://doi.org/10.1109/PACIFICVIS.2017.8031598Google ScholarGoogle Scholar
  20. Upol Ehsan, Pradyumna Tambwekar, Larry Chan, Brent Harrison, and Mark O. Riedl. 2019. Automated rationale generation: A technique for explainable AI and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI’19). 263--274. DOI:https://doi.org/10.1145/3301275.3302316Google ScholarGoogle Scholar
  21. Elling Sanne, Lentz Leo, and Menno De Jong. 2012. Combining concurrent think-aloud protocols and eye-tracking observations: An analysis of verbalizations. IEEE Trans. Prof. Commun. 55, 3 (2012), 206--220. DOI:https://doi.org/10.1109/TPC.2012.2206190Google ScholarGoogle ScholarCross RefCross Ref
  22. K. Anders Ericsson and Herbert A. Simon. 1984. Protocol Analysis: Verbal Reports as Data. The MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  23. Mingming Fan, Jinglan Lin, Christina Chung, and Khai N. Truong. 2019. Concurrent think-aloud verbalizations and usability problems. ACM Trans. Comput. Interact. 26, 5 (2019), 1--35. DOI:https://doi.org/10.1145/3325281Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Asbjørn Følstad, Effie Law, and Kasper Hornbæk. 2012. Analysis in practical usability evaluation: A survey study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2127--2136. DOI:https://doi.org/10.1145/2207676.2208365Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mark C. Fox, K. Anders Ericsson, and Ryan Best. 2011. Do procedures for verbal reporting of thinking have to be reactive? A meta-analysis and recommendations for best reporting methods. Psychol. Bull. 137, 2 (2011), 316.Google ScholarGoogle ScholarCross RefCross Ref
  26. Palash Goyal, Sumit Pandey, Karan Jain, Palash Goyal, Sumit Pandey, and Karan Jain. 2018. Research paper implementation: Sentiment classification. In Deep Learning for Natural Language Processing. Apress, 231--268. DOI:https://doi.org/10.1007/978-1-4842-3685-7_5Google ScholarGoogle Scholar
  27. Jonathan Grizou, I. Iturrate, Luis Montesano, Pierre-Yves Oudeyer, and Manuel Lopes. 2014. Interactive learning from unlabeled instructions. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI’14). Retrieved from http://auai.org/uai2014/proceedings/individuals/198.pdf.Google ScholarGoogle Scholar
  28. Jan Gulliksen, Inger Boivie, Jenny Persson, Anders Hektor, and Lena Herulf. 2004. Making a difference: A survey of the usability profession in Sweden. In Proceedings of the 3rd Nordic Conference on Human-Computer Interaction (NordiCHI ’04). 207--215. DOI:https://doi.org/10.1145/1028014.1028046Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Morten Hertzum, Pia Borlund, and Kristina B. Kristoffersen. 2015. What do thinking-aloud participants say? A comparison of moderated and unmoderated usability sessions. Int. J. Hum. Comput. Interact. 31, 9 (2015), 557--570. DOI:https://doi.org/10.1080/10447318.2015.1065691Google ScholarGoogle ScholarCross RefCross Ref
  30. Morten Hertzum and Kristin Due Holmegaard. 2013. Thinking aloud in the presence of interruptions and time constraints. Int. J. Hum. Comput. Interact. 29, 5 (2013), 351--364. DOI:https://doi.org/10.1080/10447318.2012.711705Google ScholarGoogle ScholarCross RefCross Ref
  31. Morten Hertzum and Niels Ebbe Jacobsen. 2001. The evaluator effect: A chilling fact about usability evaluation methods. Int. J. Hum. Comput. Interact. 13, 4 (2001), 421--443. DOI:https://doi.org/10.1207/S15327590IJHC1304_05Google ScholarGoogle ScholarCross RefCross Ref
  32. Masahiro Hori, Yasunori Kihara, and Takashi Kato. 2011. Investigation of indirect oral operation method for think aloud usability testing. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 38--46. DOI:https://doi.org/10.1007/978-3-642-21753-1_5Google ScholarGoogle Scholar
  33. Paula Jarzabkowski, Rebecca Bednarek, and Laure Cabantous. 2015. Conducting global team-based ethnography: Methodological challenges and practical methods. Hum. Relations 68, 1 (2015), 3--33. DOI:https://doi.org/10.1177/0018726714535449Google ScholarGoogle ScholarCross RefCross Ref
  34. Claire-Marie Karat, Robert Campbell, and Tarra Fiegel. 1992. Comparison of empirical testing and walkthrough methods in user interface evaluation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’92). 397--404. DOI:https://doi.org/10.1145/142750.142873Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jesper Kjeldskov, Mikael B. Skov, and Jan Stage. 2004. Instant data analysis: Conducting usability evaluations in a day. In Proceedings of the 3rd Nordic Conference on Human-Computer Interaction (NordiCHI’04). 233--240. DOI:https://doi.org/10.1145/1028014.1028050Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will you accept an imperfect AI? exploring designs for adjusting end-user expectations of AI systems. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’19). 1--14. DOI:https://doi.org/10.1145/3290605.3300641Google ScholarGoogle Scholar
  37. Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of the 29th AAAI Conference on Artificial IntellIgence. 333, (2015), 2267--2273. DOI:https://doi.org/10.1145/2808719.2808746Google ScholarGoogle Scholar
  38. Megh Marathe and Kentaro Toyama. 2018. Semi-automated coding for qualitative research. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’18). 1--12. DOI:https://doi.org/10.1145/3173574.3173922Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sharon McDonald, Helen M. Edwards, and Tingting Zhao. 2012. Exploring think-alouds in usability testing: An international survey. IEEE Trans. Prof. Commun. 55, 1 (2012), 2--19. DOI:https://doi.org/10.1109/TPC.2011.2182569Google ScholarGoogle ScholarCross RefCross Ref
  40. Sharon McDonald and Helen Petrie. 2013. The effect of global instructions on think-aloud testing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’13). 2941--2944. DOI:https://doi.org/10.1145/2470654.2481407Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. David Meignan, Sigrid Knust, Jean-Marc Frayret, Gilles Pesant, and Nicolas Gaud. 2015. A review and taxonomy of interactive optimization methods in operations research. ACM Trans. Interact. Intell. Syst. 5, 3 (2015), 1--43. DOI:https://doi.org/10.1145/2808234Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jakob Nielsen. 1993. Usability Engineering. Elsevier.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mie Nørgaard and Kasper Hornbæk. 2006. What do usability evaluators do in practice? An explorative study of think-aloud testing. In Proceedings of the 6th ACM Conference on Designing Interactive Systems (DIS’06). 209. DOI:https://doi.org/10.1145/1142405.1142439Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Erica Olmsted-Hawala and Jennifer Romano Bergstrom. 2012. Think-aloud protocols: Does age make a difference? In Proceedings of the STC Technical Communication Summit.Google ScholarGoogle Scholar
  45. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg et al. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825--2830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jonathon Read, Rebecca Dridan, Stephan Oepen, and Lars Jørgen Solberg. 2012. Sentence boundary detection: A long solved problem? In Proceedings of the International Conference on Computational Linguistics (COLING’12). 985--994.Google ScholarGoogle Scholar
  47. Qingxin Shi. 2008. A field study of the relationship and communication between Chinese evaluators and users in thinking aloud usability tests. In Proceedings of the 5th Nordic Conference on Human-computer Interaction Building Bridges (NordiCHI’08). 344. DOI:https://doi.org/10.1145/1463160.1463198Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Andreas Sonderegger, Sven Schmutz, and Juergen Sauer. 2016. The influence of age in usability testing. Appl. Ergon. 52, (2016), 291--300. DOI:https://doi.org/10.1016/j.apergo.2015.06.012Google ScholarGoogle ScholarCross RefCross Ref
  49. Howard Tamler. 1998. How (much) to intervene in a usability testing session. Common Gr. 8, 3 (1998), 11--15.Google ScholarGoogle Scholar
  50. Karel Vredenburg, Ji-Ye Mao, Paul W. Smith, and Tom Carey. 2002. A survey of user-centered design practice. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems Changing Our World, Changing Ourselves (CHI’02). 471. DOI:https://doi.org/10.1145/503457.503460Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zuowei Wang, Xingyu Pan, Kevin F. Miller, and Kai S. Cortina. 2014. Automatic classification of activities in classroom discourse. Comput. Educ. 78, (2014), 115--123. DOI:https://doi.org/10.1016/J.COMPEDU.2014.05.010Google ScholarGoogle Scholar
  52. Brad Wuetherick. 2010. Basics of qualitative research: Techniques and procedures for developing grounded theory. Can. J. Univ. Contin. Educ. 36, 2 (2010). DOI:https://doi.org/10.21225/D5G01TGoogle ScholarGoogle Scholar
  53. Jasy Liew Suet Yan, Nancy McCracken, and Kevin Crowston. 2014. Semi-automatic content analysis of qualitative data. In Proceedings of the iConference. DOI:https://doi.org/10.9776/14399Google ScholarGoogle Scholar
  54. Jasy Liew Suet Yan, Nancy McCracken, Shichun Zhou, and Kevin Crowston. 2014. Optimizing features in active machine learning for complex qualitative content analysis. In Proceedings of the ACL Workshop on Language Technologies and Computational Social Science 56, Ml (2014), 44--48. DOI:https://doi.org/10.3115/v1/w14-2513Google ScholarGoogle Scholar
  55. Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018. Investigating how experienced UX designers effectively work with machine learning. In Proceedings of the Designing Interactive Systems Conference (DIS’18). 585--596. DOI:https://doi.org/10.1145/3196709.3196730Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Qian Yang, John Zimmerman, Aaron Steinfeld, and Anthony Tomasic. 2016. Planning adaptive mobile experiences when wireframing. In Proceedings of the ACM Conference on Designing Interactive Systems. 565--576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Tingting Zhao, Sharon McDonald, and Helen M. Edwards. 2014. The impact of two different think-aloud instructions in a usability test: A case of just following orders? Behav. Inf. Technol. 33, 2 (2014), 162--182. DOI:https://doi.org/10.1080/0144929X.2012.708786Google ScholarGoogle ScholarCross RefCross Ref
  58. Haiyi Zhu, Robert E. Kraut, Yi-Chia Wang, and Aniket Kittur. 2011. Identifying shared leadership in Wikipedia. In Proceedings of the Conference on Human Factors in Computing Systems (CHI’11). 3431. DOI:https://doi.org/10.1145/1978942.1979453Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Christian's Python Library: A Python library for voice analysis. Retrieved from https://homepage.univie.ac.at/christian.herbst/python/namespacepraat_util.html.Google ScholarGoogle Scholar
  60. Praat: Doing Phonetics by Computer. Retrieved from http://www.fon.hum.uva.nl/praat/.Google ScholarGoogle Scholar
  61. Sound: To Pitch (ac)… Retrieved from http://www.fon.hum.uva.nl/praat/manual/Sound__To_Pitch__ac____.html.Google ScholarGoogle Scholar
  62. tf.random.uniform | TensorFlow Core r2.0. Retrieved from https://www.tensorflow.org/api_docs/python/tf/random/uniform.Google ScholarGoogle Scholar

Index Terms

  1. Automatic Detection of Usability Problem Encounters in Think-aloud Sessions

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Interactive Intelligent Systems
        ACM Transactions on Interactive Intelligent Systems  Volume 10, Issue 2
        June 2020
        155 pages
        ISSN:2160-6455
        EISSN:2160-6463
        DOI:10.1145/3403610
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 May 2020
        • Online AM: 7 May 2020
        • Revised: 1 February 2020
        • Accepted: 1 February 2020
        • Received: 1 May 2019
        Published in tiis Volume 10, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format