skip to main content
10.1145/3290605.3300356acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation

Published:02 May 2019Publication History

ABSTRACT

With the rise of big data, there has been an increasing need for practitioners in this space and an increasing opportunity for researchers to understand their workflows and design new tools to improve it. Data science is often described as data-driven, comprising unambiguous data and proceeding through regularized steps of analysis. However, this view focuses more on abstract processes, pipelines, and workflows, and less on how data science workers engage with the data. In this paper, we build on the work of other CSCW and HCI researchers in describing the ways that scientists, scholars, engineers, and others work with their data, through analyses of interviews with 21 data science professionals. We set five approaches to data along a dimension of interventions: Data as given; as captured; as curated; as designed; and as created. Data science workers develop an intuitive sense of their data and processes, and actively shape their data. We propose new ways to apply these interventions analytically, to make sense of the complex activities around data practices.

References

  1. Sebastian Abt and Harold Baier (2014). A plea for utilizing synthetic data when performing machine learning based cyber-security experiments. Proc. AISec 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ritu Agarwal and Vasant Dhar (2014). Editorial -- Big data, data science, and analytics: The opportunity and challenge for IS research. Info. Sys. Res. 25(3), 443--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ashton Anderson, Jon Kleinberg, and Sendhil Mullainathan (2017). Assessing human error against a benchmark of perfection. TKDD 11(4), Art. 45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jesse Anderson (2018). Data engineers vs. data scientists. O'Reilly. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists .Google ScholarGoogle Scholar
  5. Lora Aroyo and Chris Welty (2013). Crowd truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard. Proc. Web Science 2013.Google ScholarGoogle Scholar
  6. Karen S. Baker and Geoffrey C. Bowker (2007). Information ecology: Open system environments for data, memories, and knowing. J. Intell. Inf. Syst. 29, 127--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Karen S. Baker and Helena Karasti (2018). Data care and its politics: Designing for local collective data management as a neglected thing. Proc. PDC 2018, Art. 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jo Bates, Yu-Wei Lin, and Paula Goodale (2016). Data journeys: Capturing the socio-material constitution of data objects and flows. Big Data & Soc. 3(2), 112.Google ScholarGoogle ScholarCross RefCross Ref
  9. Steve Benford, Gabriella Giannachi, Boriana Koleva, and Tom Rodden (2009). From interaction to trajectories: Designing coherent journeys through user experiences. Proc. CHI 2009, 709--718, Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hélène Bilis (2018). Mapping fiction: Social networks and the novel. Presentation at Shifting (the) boundaries conference, Wellesley College.Google ScholarGoogle Scholar
  11. Herbert Blumer (1954). What is wrong with social theory? American Sociological Review 18, 3--1.Google ScholarGoogle ScholarCross RefCross Ref
  12. Glenn A. Bowen (2006). Grounded theory and sensitizing concepts. Int. J. Qual. Meth. 5(3), 12--23.Google ScholarGoogle ScholarCross RefCross Ref
  13. danah boyd and Kate Crawford (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenonomenon. Info. Comm. Soc. 15(5), 662--679.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ciara Byrne (2013). The rise of the DIY data scientist. Fast Company. https://www.fastcompany.com/3014018/the-rise-of-the-diy-data-scientist .Google ScholarGoogle Scholar
  15. Jennie Carroll. 2004. Completing design in use: closing the appropriation cycle. In Proceedings of the 12th European Conference on Information Systems (ECIS 2004). 337--347.Google ScholarGoogle Scholar
  16. Kathy Charmaz (2015). Constructing grounded theory. Sage.Google ScholarGoogle Scholar
  17. Akemi T. Chatfield, Vivian N. Shlemoon, Wilbur Redublado, and Faizur Rahman (2014). Data scientists as game changers in big data environments. Proc. ACIS 2014, 1--11.Google ScholarGoogle Scholar
  18. Amy Cheatle and Steven J. Jackson (2015) "Digital entanglements: Craft, computation and collaboration in fine art furniture production. Proc. CSCW 2015, 958--968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Madeline T.H. Chi, Robert Glaser, and Marshall Farr (1988/2014) (eds.). The nature of expertise. Psychology Press.Google ScholarGoogle Scholar
  20. Søren Christensen, Jens Bæk Jørgensen, and Kim Halskov Madsen (1997). Design as interaction with computer based materials. Proc. DIS 1997, 65--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Juliet Corbin and Anselm L. Strauss (2007). Basics of qualitative research: Techniques and procedures for developing grounded theory. 3rd edition. Newbury Park, CA, USA: Sage.Google ScholarGoogle Scholar
  22. Andrew Dearden (2006). Design as a conversation with digital materials. Des. Stud. 27(3), 399--421.Google ScholarGoogle ScholarCross RefCross Ref
  23. Alan Dix (2007). Designing for appropriation. Proc. BCS-HCI 2007, 27--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Dobre and F. Xhafa (2014), Intelligent services for big data science. Fut. Gen. Comp. Sys. 37, 267--291.Google ScholarGoogle ScholarCross RefCross Ref
  25. Anca Dumitrache, Lora Aroyo, and Chris Welty (2018). Crowdsourcing ground truth for medical relation extraction. TIIS 8(2), art. 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ciarán Dunne (2011). The place of literature review in grounded theory research. Int.J. Soc. Res. Meth. 14(2), 111--124.Google ScholarGoogle ScholarCross RefCross Ref
  27. Hugh Durrante-Whyte (2015), Data, knowledge and discovery: Machine learning meets natural science. Proc. KDD 2015, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Melanie Feinberg (2017a). A design perspective on data. Proc. CHI 2017, 29522963. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Melanie Feinberg, Daniel Carter, and Julia Bullard (2014b). A story without end: Writing the residual into descriptive infrastructure. Proc. DIS 2014, 385394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Melanie Feinberg, Daniel Carter, Julia Bullard, and Ayse Gursoy (2017b). Translating texture: Design as integration. Proc. DIS 2017, 297--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Batya Friedman, Peter H. Kahn, and Alan Borning (2006). Value sensitive design and information systems. In P. Zhang and D. Galletta (eds.), HumanComputer Interaction and Management Information Systems: Foundations. M.E. Sharpe.Google ScholarGoogle Scholar
  32. Lisa Gitelman (2013) (ed.), "Raw data" is an oxymoron. MIT Press.Google ScholarGoogle ScholarCross RefCross Ref
  33. Barney G. Glaser (1998). Doing grounded theory: Issues and discussions. Mill Valley, CA: Sociology Press.Google ScholarGoogle Scholar
  34. Barney G. Glaser (2005). The grounded theory perspective III: Theoretical coding. Mill Valley, CA, USA: Sociology Press.Google ScholarGoogle Scholar
  35. Robert Glaser and Micheline T.H. Chi (1988/2014). Overview. In Michelene T.H. Chi, Robert Glaser, and Marshall J. Farr (eds). (1988/2014). The nature of expertise. Taylor and Francis.Google ScholarGoogle Scholar
  36. Michele Goetz (2015). 3 ways data preparation tools help you get ahead of big data. Forrester. https://go.forrester.com/blogs/15-02--17--3_ways_data _preparation_ tools_help_you_get_ahead_of_big_data/ .Google ScholarGoogle Scholar
  37. Jonathan Gray, Carolyn Gerlitz, and Liliana Bounegru (2018). Data infrastructure literacy. Big Data & Soc. 5(2), 1--13.Google ScholarGoogle ScholarCross RefCross Ref
  38. Shad Gross, Jeoffrey Bardzell and Shaowen Bardzell (2014). Structures, forms, and stuff: The materiality and medium of interaction. Pers. Ubiquit. Comput. 18(3), 637--649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer (2011). Proactive wrangling: Mixed-initiative end-user programming of data transformation scripts. Proc. UIST 2011, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Bob Hayes (2018a). A majority of data scientists lack competency in advanced machine learning areas and techniques. BusinessOverBroadway, http://businessoverbroadway.com/a-majority-of-data-scientists-lackcompetency-in-advanced-machine-learning-areas-and-techniques .Google ScholarGoogle Scholar
  41. Bob Hayes (2018c). Most used data science tools and technologies in 2017 and what to expect for 2018. BusinessOverBroadway. http://businessoverbroadway.com/most-used-data-science-tools-andtechnologies-in-2017-and-what-to-expect-for-2018 .Google ScholarGoogle Scholar
  42. Bob Hayes (2018b). Top 10 challenges to practicing data science at work. BusinessOverBroadway. http://businessoverbroadway .com/top-10challenges-to-practicing-data-science-at-work .Google ScholarGoogle Scholar
  43. Jeffrey Heer, Joseph M. Hellerstein, and Sean Kandel (2015). Predictive interaction for data transformation. Proc. CIDR 2015.Google ScholarGoogle Scholar
  44. Tony Hey, Stewart Tansley, and Kristin Tolle (2009). The fourth paradigm: Data-intensive scientific discovery. Microsoft Research.Google ScholarGoogle Scholar
  45. Ming-Tung Hong and Claudia Müller-Birn (2017). Conceptualization of computer-supported collaborative sensemaking. CSCW 2017 Companion, 199--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Marjin Janssen and George Kuk (2016). The challenges and limits of big data algorithms in technocratic governance. Gov. Info. Quart., 33(3), 371--377.Google ScholarGoogle ScholarCross RefCross Ref
  47. Kaggle (2017). Kaggle ML and data science survey, 2017: A big picture view of the state of data science and machine learning. Kaggle. https://www.kaggle.com/kaggle/kaggle-survey-2017 .Google ScholarGoogle Scholar
  48. KDNuggets. (2018). Doing data science: A Kaggle walkthrough -- Cleaning data. https://www.kdnuggets.com/2016/03/doing-data-science-kaggleawalkthrough-cleaning-data.html .Google ScholarGoogle Scholar
  49. Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer (2011). Wrangler: Interactive visual specification of data transformation scripts. Proc. CHI 2011, 3363--3372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. James Max Kanter and Kalyan Veeramachaneri (2015). Deep feature synthesis: Towards automating data science endeavors. Proc. DSAA 2015, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  51. Jakko Kemper and Daan Kolkman (2018). Transparent to whom? No algorithmic accountability without a critical audience. Info. Comm. & Soc.Google ScholarGoogle Scholar
  52. Allison Kidd (1994). The marks are on the knowledge worker. Proc. CHI 1994, 186--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel (2016). The emerging role of data scientists on software development teams. Proc. IEEE CSE 2016, 96--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. John King and Roger Magoulas (2015). 2015 data science salary survey: Tools, trends, what pays (and what doesn't) data professionals. O,Reilly. http://www.eli.sdsu.edu/courses/fall16/cs696/2015-data-science-salarysurvey.pdf .Google ScholarGoogle Scholar
  55. Ákos Kiss and Tamás Szirányi (2013). Evaluation of manually created ground truth for multi-view people localization. Proc. VIGTA 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Peter Gall Krogh, Marianne Graves Petersen, Kenton O'Hara, and Jens Emil Grønbæk (2017). Sensitizing concepts for socio-spatial literacy in HCI. Proc. CHI 2017, 6449--6460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Cheng Han Lee (2014). Data career paths: Data analyst vs. data scientist vs. data engineer: 3 data careers decoded and what it means for you. Udacity. https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-dataengineer.htmlGoogle ScholarGoogle Scholar
  58. Alan Lesgold, Harriet Rubinson, Paul Feltovich, Robert Glaser, Dale Klopfer, and Yen Wang (1988/2014). Expertise in a complex skill: Diagnosing X-ray pictures. In Michelene T.H. Chi, Robert Glaser, and Marshall J. Farr (eds). (1988/2014). The nature of expertise. Taylor and Francis.Google ScholarGoogle Scholar
  59. Jessica Lingel (2016). The poetics of socio-technical space: Evaluating the Internet of things through craft. Proc. CHI 2016, 815--826. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jessica Lingel and Tim Regan (2014). "it's in your spinal cord, it's in your fingertips: practices of tools and craft in building software." Proc. CSCW 2014, 295--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Paul Luo Li, Andrew J. Ko, and Jiamin Zhu (2015). What makes a great software engineer? Proc. ICSE 2015, 700--710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Karen Grace Martin (2018). Preparing data for analysis is (more than) half the battle. Analysis Factor. https://www.theanalysisfactor.com/preparing-dataanalysis/ .Google ScholarGoogle Scholar
  63. Gerry McGhee, Glenn R. Marland, and Jacqueline Atkinson (2007). Grounded theory research: Literature reviewing and reflexivity. J. Adv. Nurs. 60(3), 334342.Google ScholarGoogle ScholarCross RefCross Ref
  64. Helena M. Mentis, Ahmed Rahim, and Pierre Theodore (2016). Crafting the image in surgical telemedicine. Proc. CSCW 2016, 744--755. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Renée J. Miller (2017). The future of data integration. Proc. KDD 2017, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Steven Miller (2014). Collaborative approaches needed to close the big data knowledge and skills gap. J. Org. Des. 3(1), 26--30.Google ScholarGoogle Scholar
  67. Julia Moehrmann and Gunther Heidemann (2012). Efficient annotation of image data sets for computer vision applications. Proc. VIGTA 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Michael Muller (2014). Curiosity, creativity, and surprise as analytic tools: Grounded theory method. In Judith Olson and Wendy A. Kellogg (eds.), Ways of knowing in HCI. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  69. Syed Sadat Nazrul (2018). DevOps for data scientists: Taming the unicorn. Medium: Towards Data Science, https://towardsdatascience.com/devopsfor-data-scientists-taming-the-unicorn-6410843990de .Google ScholarGoogle Scholar
  70. Gina Neff, Ahissa Tanweer, Brittany Fiore-Gartland, and Laura Osburn (2017). Critique and contribute: A practice-based framework for improving critical data studies and data science. Big Data 5(2), 85--97.Google ScholarGoogle ScholarCross RefCross Ref
  71. Samir Passi and Steven J. Jackson (2017). Data vision: Learning to see through algorithmic abstraction. Proc. CSCW 2017, 2436--2447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Kayur Patel, James Fogarty, James A. Landay, and Beverly Harrison (2008). Investigating statistical machine learning as a tool for software development. Proc. CHI 2008, 667--676. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Kathleen H. Pine and Max Liboiron (2015). The politics of measurement and action. Proc. CHI 2015, 3147--3156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Sarah Pink, Minna Ruckenstein, Robert Willim, and Melisa Duque (2018). Broken data: Conceptualizing data in an emerging world. Big Data & Soc. 5(1), 1--13.Google ScholarGoogle ScholarCross RefCross Ref
  75. Michael I. Posner (1988/2014). Introduction. In Michelene T.H. Chi, Robert Glaser, and Marshall J. Farr (eds). (1988/2014). The nature of expertise. Taylor and Francis.Google ScholarGoogle ScholarCross RefCross Ref
  76. Krishna Rajan (2013). Informatics for materials science and engineering: Datadriven discovery for materials science and engineering. Elsevier.Google ScholarGoogle Scholar
  77. Vijayshankar Raman and Joseph M. Hellerstein (2001). Potter's wheel: An interactive data cleaning system. Proc. VLDB 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Tye Rattenbury, Joseph M. Hellerstein, Jeffrey Heer, Sean Kandel, and Connor Carreras (2017). Principles of data wrangling: Practical techniques for data preparation. O'Reilly. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. David Ribes (2017). Notes on the concept of data interoperability: Cases from an ecology of AIDS research infrastructures. Proc. CSCW 2017, 1514--1526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Daniela K. Rosner, Miwa Ikemiya, and Tim Regan (2015). Resisting alignment: Code and clay. Proc. TEI 2015, 181--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Evelyn Ruppert (2013). Rethinking empirical social sciences. Dial. Hum. Geo. 3(3), 268--273.Google ScholarGoogle ScholarCross RefCross Ref
  82. Evelyn Ruppert, Penny Harvey, Celia Lury, Adrian Mackenzie, Ruth McNally, Stephanie Alice Baker, Yannis Kallianos, and Camilla Lewis (2015). Socializing big data: From concept to practice. CRESC, U. Manchester, Open U.Google ScholarGoogle Scholar
  83. Daniel M. Russell, George Furnas, Mark Stefik, Stuart Card, and Peter Pirolli (2008). Sensemaking workshop 2008. CHI EA 2008, 4751--4754.Google ScholarGoogle Scholar
  84. Donald Schön (1983). The reflective practitioner. How professionals think in action. Basic Books.Google ScholarGoogle Scholar
  85. Scikit-Learn (2017). scikit-learn Tutorials. http://scikit-learn.org/stable/ tutorial/index.html .Google ScholarGoogle Scholar
  86. Shventank Shah, Andrew Horne, and Jaime Capella (2012). Good data won't guarantee good decisions. Harv Bus Rev, Apr 2012.Google ScholarGoogle Scholar
  87. Susan Elliott Sim, Marisa Levitt Cohn, and Kavita Philip. (2009). The work of software development as an assemblage of computing practices. Proc. CHASE 2009, 92--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana (2018). Data diff: Interpretable, executable summaries of changes in distributions for datq wrangling. Proc. KDD 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Alex S. Taylor, Siân Lindley, Tim Regan, David Sweeney, Vasilis Vlachokyriakos, Lillie Grainger, and Jessa Lingel (2015). Data-in-place: Thinking through relations between data and community. Proc. CHI 2015, 2863--2872. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Jakob Tholander, Maria Normack, and Chiara Rossitto (2012). Understanding agency in interaction design materials. Proc. CHI 2012, 2499--2508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Paul F. Uhlir and Peter Schoder (2007). Open data for global science. Data Sci. J. 6, 36--53.Google ScholarGoogle Scholar
  92. Wil M.P. van der Aalst (2014). Data scientist: The engineer of the future. Proc. I-ESA 7, 13--26.Google ScholarGoogle Scholar
  93. Ruben Verborgh and Max De Wilde (2013). Using OpenRefine. Packt. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Leonard J. Waks (2001). Donald Schon's {sic} philosophy of design and design education. Int. J. Tech. Des. Educ. 11, 37--51.Google ScholarGoogle ScholarCross RefCross Ref
  95. Samuel F. Way, Daniel B. Larremore, and Aaron Clauset (2016). Gender, productivity, and prestige in computer science faculty hiring networks. Proc. WWW 2016, 1169--1179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Mikael Wiberg (2014). Methodology for materiality: Interaction design through a material lens. Pers. Ubiquit. Comput. 18(3), 625--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Fo Wilson (2010). The new materiality: Digital dialogues at the boundaries of contemporary craft. Cultura Visual 1(14), 83--88.Google ScholarGoogle Scholar
  98. Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld (2018a). Investigating how experienced UX designers effectively work with machine learning. Proc. DIS 2018, Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Qian Yang, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos (2018b). Grounding interactive machine learning tool design in how non-experts actually build models. Proc. DIS 2018, 573--584. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
      May 2019
      9077 pages
      ISBN:9781450359702
      DOI:10.1145/3290605

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 May 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CHI '19 Paper Acceptance Rate703of2,958submissions,24%Overall Acceptance Rate6,199of26,314submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format