Data Science as an Interdiscipline: Historical Parallels from Information Science

dsj-2023-016


INTRODUCTION
Data science has been called the "sexiest job of the 21st century" (Davenport & Patil 2012). Data science has also been extensively critiqued by scholars across numerous fields. One particularly vivid critique labels data science as "machinic Neoplatonism," stating that data science techniques encourage and enable thoughtlessness in the context of decision-making and societal analysis (McQuillan 2018a). Other commentary on the nature of data science is similarly divergent. Data science has been characterized as being little more than statistics relabeled (Statistics Views 2013) while also being characterized as encompassing almost every kind of science (Fowler 2015).
Within this bubbling commentary, considerable debate can be found on almost every facet of data science. The one thing that most commentators agree on, however, is that data science must be characterized as having an interdisciplinary and/or metadisciplinary nature. To be doing data science, according to almost every description, one must be pulling tools, skills, algorithms, concepts, or data from multiple disciplinary or methodological frameworks. As one of the above quoted scholars phrased it, "The imaginary ideal data scientist is a Renaissance figure with a mastery of all these arts," referring to programming, statistics, mathematics, and data visualization, among other skills (McQuillan 2018a: 255). Looking across various academic and popular press descriptions of data science, significant differences can be found in the characterizations of the appropriate mélange of skills and tools that constitute a "data scientist." But the fundamental interdisciplinary nature of data science-the fact that people who do data science cross or transcend traditional disciplinary boundaries, tools, and methods-seems to be a consensus view.
This paper aims to build a lens for understanding the diversity, complexity, and interdisciplinarity or metadisciplinarity of data science by drawing lessons from the history of information science and its precursors. This analysis highlights the historical parallels between the emergence of data science in the 21st century and the emergence and evolution of information science over the past 100 years to provide insight into interdisciplinary challenges facing data science as a professional and academic endeavor.
This comparison is particularly timely. Debates about the disciplinary status of data science are growing within government, corporate, and higher education institutions. A number of recent consensus reports have been written to help shape the present and future of data science (Berman et al. 2016; National Academies of Sciences, Engineering, and Medicine [NASEM] 2017; NASEM 2018).
Disciplines are social, organizational, and institutional constructs that often emerge around nascent problems or topics where resources like funding and students are in a growth period (Jacobs 2013;Hammarfelt 2019). Such is certainly the case for data science.
The term interdisciplinarity can be a linguistic stand-in for modern, creative, and/or progressive ways of working (Madsen 2017). As such, interdisciplinarity can be a way of working and/or a way of talking. This tension is one manifestation of how interdisciplinarity can encompass many different things (Huutoniemi et al. 2010). Because of this variation in meanings, "good interdisciplinary work requires a strong degree of epistemological reflexivity" (Klein 1996: 214). Epistemological reflexivity may have value in moving data science toward becoming a "critical technical practice" (Agre 1997), that is, an area of work that actively examines and engages with its own limitations and inherent challenges. People working within and around information science have repeatedly debated the relative merits and drawbacks of interdisciplinarity and metadisciplinarity throughout the past century and up to the present (Borko 1968;Bates 1999;Buckland 2012;Madsen 2016). This paper begins by characterizing data science as an inter-and metadiscipline by highlighting a number of key features of recent research and professional work in the area. I then depict similar characteristics of information science and finish with a discussion of the following set of questions related to the interdisciplinary pros and cons of current data science:

METHODOLOGICAL APPROACH
This paper is based on a review of the literature in the information and data sciences related to interdisciplinarity. Many of the sources used in the characterizations below are personal narratives that present the perspective of a single individual. In the case of data science, some are white papers, blog posts, or opinion papers. In the case of information science, many relevant sources are papers published in peer-reviewed journals by prominent people in the field, including scholars, educators, and administrators. Any single perspective among these voices may have particular limitations or biases. Taken together, however, such sources prove extremely valuable for tracking the evolution of interdisciplinary research areas (Klein 1996). Personal narratives serve as indicators of the ways that particular issues were discussed at different points in time. These personal narratives have been compared and contrasted with relevant research papers appearing in peer-reviewed journals discussing the nature of information science and data science as disciplines and professions.
The method for gathering relevant peer-reviewed materials for this paper included systematic queries of article databases such as the Web of Science and Google Scholar for articles related to "information science," "data science," "interdisciplin*," and "metadisciplin*." These sources are useful to find relevant materials related to information science given the long history of the topic area, but they are less useful for tracing longer-term developments relevant to data science given its nascent development as a named entity. For example, as of February 10, 2023, the Web of Science Core Collection returns 14,448 total results when searching for the phrase "data science," of which only 26 date from 2009 or earlier. Of these, 19 are spurious hits, and 5 are book reviews or news items. Only 2 peer-reviewed articles discuss data science in a way that is close to current understandings: Cleveland (2001), discussed below, is a foundational article for the statistical aspects of data science. Mezey et al. (2001: 375) discuss a number of aspects of database analysis that "provide challenging tasks and opportunities for data science" but otherwise do not directly discuss data science itself. As a point of comparison, the Google NGram viewer, which quantifies usage of particular words or phrases across the Google Books corpus, shows almost zero use of the term "data science" through 2008, but there is significant year-over-year growth in use of the phrase since 2009 (https://books.google.com/ngrams/).
Notably, as of February 2023, the Web of Science does not index any journals that include the term "data science" in the title. Thus, an additional method for finding relevant articles was to directly investigate journals that focus on data science but are not indexed by the Web of Science, either by using Google Scholar or by visiting the journals' websites and examining issues. Some specific journals that were investigated in this fashion are described further in the section that follows. Another method for finding materials relevant to this article's discussion, perhaps the most valuable, was citation chaining. Once a relevant article was found, following citation networks both forward and backward in time frequently resulted in the discovery of more relevant articles.

DATA SCIENCE
At the time of this writing, historical literature related to data science is scant. The best chronological depictions of the development of data science are found in Cao (2017) and Press (2013). Both articles illustrate how the trajectory of what we now call data science can be dated back at least 50 years, encompassing developments in data analytics and visualization, statistics, database design, and other topics. Phillips (2019) points to other trends related to data gathering and analytics that extend back a century or more. The phrase data science, however, has only been in use for about 20 years. In this section, I highlight particular developments in the past two decades related to the emergence of data science, focusing on characterizations of data science's boundaries and participants.

CONCEPTIONS OF DATA SCIENCE
The recent growth of data science has been stimulated by the large volumes and varieties of data being made public on the internet via the explosive growth in digital technologies, such as personal computers, cell phones, social media, smart devices, and sensor networks. As the generation of data by these technologies has increased, the need for methods of storing, accessing, analyzing, and presenting data has also increased. Data science has emerged as a panoply of techniques, tools, and skills that can be applied to derive value (economical or intellectual) out of the growing piles of data. The concomitant need for people with skills to work in these areas within the commercial and public sectors has also been a significant driver of the growth of data science (Manyika et al. 2011).
From different points of view, data science can be viewed as (1) a proto-discipline, (2) a toolkit of analytical pipelines and platforms, (3) a bundle of transformative forces at work inside and outside the academy (Carson et al. 2016), or even (4) "a community of practice of data-driven scientists of whatever scientific discipline they ask questions about" (Scheider et al. 2020: 8). Statistician David Donoho's (2017) recent paper, "50 Years of Data Science," provides a useful starting point for this discussion about the scope of data science. This paper has been highly cited since its initial publication, and Donoho is a prominent figure in many discussions of data science. Donoho describes his view of six divisions of "data science" activity: 1. Data gathering, preparation, and exploration

Data visualization and presentation
6. Science about data science Donoho explicitly excludes this engineering component from his typology, namely, the activities involved in building systems to effectively deal with data, move data, and distribute data at different scales. Other commentators, however, call out infrastructure development as a core component of data science and computational work in general (Blanchette 2012; Fox & Hendler 2014; Gray, Gerlitz and Bounegru 2018).
As Donoho acknowledges, his typology builds on earlier works by the prominent statisticians Chambers (1993) and Cleveland (2001), who stimulated the idea of a "data science" by urging the field of statistics to broaden its focus beyond its traditional emphasis on theoretical analyses. Cleveland's paper, for example, lays out a precursor to Donoho's data science typology and depicts how statistics curricula could be expanded to better train people as "data scientists." The Journal of Data Science, launched in 2003 by two statisticians, provides a publication venue for statistically focused data science research very much in line with Cleveland's call, focusing on the applications of statistical methods in a variety of contexts. Many prominent articles about data science, however, including Donoho's, have been published in core statistics journals.
At around the same time as Cleveland's call for data science within the field of statistics, data science was also becoming a named entity in other sectors. CODATA, the Committee on Data of the International Council for Science (ICSU), published the first issue of its Data Science Journal in 2002. 1 CODATA was established in 1966 "to promote throughout the world the evaluation, compilation and dissemination of data for science and technology and to foster international collaboration in this field" (CODATA 2012). The six founders of CODATA included chemists, physicists, and an engineer. The Data Science Journal was formed to facilitate the dissemination of scholarly work on topics related to the committee. The launch of the journal was also specifically motivated by disciplinary aspirations. As stated in a retrospective on the 45th anniversary of CODATA, "A journal gives identity to a discipline" (Lide & Wood 2012: 55, italics in original). The first editor of the journal, F. Jack Smith, outlined his view of the key topics of interest within the new discipline and journal: … the study of the capture of data, their analysis, metadata, fast retrieval, archiving, exchange, mining to find unexpected knowledge and data relationships, visualization in two and three dimensions including movement, and management. Also included are intellectual property rights and other legal issues. (Smith 2006: 163) As of 2023, the scope of the Data Science Journal had not varied significantly from Smith's initial focus (Rumble 2023;Smith 2023). Mayernik Data Science Journal DOI: 10.5334/dsj-2023-016 The Data Science Journal's emphases were largely disjointed with Donoho's typology of data science and the goals of the aforementioned Journal of Data Science. Some of the Data Science Journal's areas of emphasis fall into the engineering category that Donoho acknowledges but does not include in his typology, but some others are much further afield, such as the Data Science Journal's mention of legal issues related to data. Data science is often depicted as a nexus of certain kinds of skills. Drew Conway's (2010) data science Venn diagram is a commonly referenced visualization for this view, in which data science is depicted as the amalgamation between (1) math and statistics knowledge, (2) "hacking skills," and (3) "substantive knowledge," referring to knowledge within a particular disciplinary specialization. Conway is careful to note that this Venn diagram is intended to apply to data science broadly, not necessarily any specific data scientist (Conway 2018). But others, such as Davenport and Patil (2012), take this view further by stating that having computer science and statistical expertise are the defining features in distinguishing a data scientist from a traditional scientist. Blei and Smyth (2017) also note this distinction between data scientists and "domain scientists," but they emphasize that the two groups should be partners (or integrated) whenever possible: Crucially, the data scientist solves the problem iteratively and collaboratively with the domain expert. (We note they do not need to be two different people; the data scientist and domain expert could simply be two "hats" for the same person). (Blei & Smyth 2017: 8691) This conceptual separation between regular (or domain) science from data science is in fact necessary for the data scientist to exist as a distinct type of person (Ribes 2018). There would be no need to create a new label like data scientist if there was no conceptual or practical distinction between what a data scientist does and what a typical researcher would be doing within chemistry, astronomy, or meteorology. While the tools and methods used are one notable distinction, another could be that data scientists are expected to be able to apply their skills to data regardless of the disciplinary focus of those data. In other words, in the characterization of the above authors, data scientists are expected to be able to work with data for which they have no specific disciplinary training (Feder 2016), while domain scientists are only expected to be able to work with data from within their own discipline, such as chemistry, astronomy, or meteorology.

KEY CHARACTERISTICS OF DATA SCIENCE AS AN INTER-AND METADISCIPLINE
In looking at recent discussions of the trajectory of data science, three key issues related to interdisciplinarity repeatedly manifest: (1) the diversity in participants and communities, (2) the diffuse and contested boundaries of data science, and (3) the debated disciplinary status of data science. This section expands on these points.

The diversity in participants and communities
It is clear that data science, however bounded, is a topic area that encompasses many participants and communities and involves people with a multiplicity of skills and backgrounds. The statistics-centric view emphasizes the need for data scientists to be knowledgeable about data representation, transformation, modeling, and visualization. The data management and engineering conception of data science spans computational infrastructure building, metadata development, data retrieval and archiving, and intellectual property regimes for software and data products. societal sector, including government, industry, nonprofit organizations, and higher education (Cao 2017;Carter & Sholler 2016). Many people who could be characterized as data scientists, however, do not fully identify as such, as noted by a recent survey of data scientists in academia, in which many respondents "somewhat" identified as a data scientist (Geiger et al. 2018).

The diffuse and contested boundaries of data science
With this diversity of people involved, few individuals follow the same path into the field. A former editor for the Data Science Journal hoped that the journal could serve as "a saloon for data scientists and experts in other fields" (Iwata 2008). As such, the boundaries between data science and other fields are porous. Commentators have drawn parallels between data science and numerous other disciplines, ranging from statistics and information science to computer science (Mattmann 2013) and journalism (Keegan 2016).
The diffuse and contested boundaries of data science manifest clearly as departments and schools jockey for position to own data science within academic institutions. Educational programs for data science are blooming, albeit in highly heterogeneous ways, which makes identifying any broad trends in curriculum development problematic (Wing et al. 2018). The US National Academies of Science report on data science undergraduate curricula provides little closure around what should or should not be part of data science education (NASEM 2017: 33). The summary lists nine central conceptual areas within the scope of data science and asks, "Which key components should be included in data science curriculum, both now and in the future? How could these components be prioritized or best conveyed for differing types of data science programs?" The report does not attempt to answer these questions directly.
De Veaux et al. (2017), on the other hand, define an undergraduate curriculum for data science in great detail, encompassing mathematical and statistical components, as well as data modeling, description, and curation. Their proposed curricula also includes a significant emphasis on communication, reproducibility, and data ethics. The EDISON Framework likewise breaks data science curricula into a number of competency areas, specifically (1) data analytics, (2) data engineering, (3) data management, (4) research methods and project management, and (5)  Many of the curriculum topics listed in the previous paragraph are already being taught within statistics, engineering, computer science, and information science programs. Data science students and instructors alike have diverse backgrounds, and it is common for instructors to be active practitioners, not tenured faculty (Kross & Guo 2019). One model is to create data science institutes as distinct entities while drawing faculty from multiple existing campus departments (Moore-Sloan Data Science Environments 2018). These institutes provide forums for building coalitions of faculty, student interest, and financial investments and provide testing grounds for broader data science undertakings across a campus (Carson et al. 2016).
Significant diversity exists, however, in how data science has been instituted within university structures. An intensive study conducted by the University of California, Berkeley, assessed 16 different options for providing organizational support for data science, including forming new schools or colleges, creating new divisions within existing schools, creating programs that are spread across multiple schools, and creating new research units or centers (Carson et al. 2016). Many universities, for-profit companies, and nonprofit organizations have also started online data science courses and certification programs (Fox et al. 2015;Pournaras 2017;Cao 2018b;Bezuidenhout et al. 2021). These online programs have been able to reach much larger numbers of students, including populations beyond the typical undergraduate student . These programs are responding to the need to scale up the number of graduates to meet employment demands in the private and public sectors (Miller & Hughes 2017).
The debated disciplinary status of data science All these factors contribute to the contested disciplinary status of data science (Stodden 2020 means. Defining data is itself an area of active scholarly research, though mostly by philosophers and information scientists (Floridi 2005;Borgman 2015;Leonelli 2015;Furner 2017;Hjørland 2018). Many discussions of data science that are otherwise very comprehensive, such as Donoho (2017), Cao (2017), and the EDISON Project (2017), do not engage in the fundamental question of defining the core concept of the emerging field. Nonetheless, numerous definitions of data can be found, ranging from disciplinary or technology-centric perspectives to abstract conceptualizations (Furner 2016). The ubiquity of the concept of data, combined with its elusiveness, frame the ongoing debates about the formalization of data science as a discipline.
Here it is important to note the distinction between (a) a formally defined discipline and (b) sets of people who are interested in, working on, or conducting research related to a particular topic or phenomena. The latter kinds of groups, which might be characterized from different points of view as "invisible colleges," "epistemic communities," or "communities of practice" (De Solla Price & Beaver 1966;Knorr-Cetina 1991;Wenger 1998), encompass groups of people who are connected via social and/or intellectual networks but may have different formal disciplinary affiliations. This distinction is important in relation to the question about whether the goals of data science should be to develop a "science with data" or a "science of data" (Fox & Hendler 2014). In the next section, I return to the question of the degree to which data science as a discipline will encompass broader areas of research that focus on data as a phenomenon of interest.

DISCUSSION
This section presents a discussion of the literature review and works through the three central questions of the paper in detail, focusing on the comparison between data science and information science. Table 1 presents high-level parallels between data science today and current and past information science. As shown in the table, the notion that there is an explosion of information and data that is outpacing our ability to manage, use, and understand them is not new to the "big data" or "data science" era. Rhetoric of "information overload" has been used to motivate new developments in information and data management techniques at least as far back as the early 20th century (Day 2009).
Information science became a distinct disciplinary and professional label in the 1960s. The prehistory of information science, however, centers on international efforts in the first half of the 20th century that focused on "documentation," the initial predominant name for the topic (Farkas-Conn 1990). After World War II ended, the interest and activity related to information and documentation increased dramatically. the public domain after the war. In addition, the victorious Allied forces seized a huge number of government documents from Nazi Germany and other Axis countries (Richards 1994). The challenge of organizing these documents stimulated interest and activity in documentation, information organization, and information retrieval. Information and intelligence work related to the growing Cold War with the Soviet Union likewise stimulated growth in information research and professionalization (Johnson 2017;Burke 2018). Many organizations undertook information-related work during this time, and the number of information workers grew rapidly, including many scientists who encountered information work during the war (Burke 2007).
The information science educational ecosystem expanded through the 1970s, often (though not exclusively) through programs based in library schools. The library and information sciences coalesced enough during this period for a number of specializations to become prominent, if somewhat disconnected (White & McCain 1998). The following decades saw a serious retrenchment of the educational landscape, as over 20 library and information science schools closed or went through administrative realignments in the 1980s and 1990s (Ostler, Dahlin & Willardson 1995;Hildreth & Koenig 2002). This retrenchment slowed in the 2000s as the internet emerged as a social and technological phenomenon, causing renewed interest in information within governments, universities, and the private sector. In 2005, in the midst of the dot-com period, the "iSchool" caucus was formed by a group of nine library and information science schools (Larsen 2008). As of mid-2020, it contained 114 members across six tiers of membership. The iSchool membership is diverse, intellectually and programmatically. Some schools retained strong connections to the earlier information science focus areas, but many differ substantially from what an information science school looked like in prior decades (Wiggins & Sawyer 2012;Craig Finlay, Ni & Sugimoto 2018).
Throughout the past century, information science has demonstrated the same characteristics analyzed above for data science: (1) diversity in participants and communities, (2) diffuse and contested boundaries, and (3) debated disciplinary status. As shown in Table 1, as long as information science and its precursors have existed, there has been a diverse and clumpish mix of participants. This diversity of participants and intellectual approaches has provided a constant source of new ideas and contributors within information science, but it has inevitably engendered boundary arguments about what the discipline should (or should not) include. During the 1960s, information science emerged as a contested space, and it has continued to face boundary disputes to this day (Kline 2004;Burnett & Bonnici 2013).
Articulating and negotiating the unique value and niche of information science within the ecosystem of constitutive and related disciplines has been an ongoing challenge (Van House & Sutton 1996;Cox et al. 2012;Bates 2015). The diverse and evolving sets of participants and ongoing boundary challenges have repeatedly engendered debates about the disciplinary status of information science. Many commentators within these debates have noted that interdisciplines face continuous struggles to achieve power and legitimacy inside academic and government institutions that favor (implicitly or explicitly) traditional disciplines. Some question the wisdom of arguing for the field explicitly by championing its interdisciplinary nature (Buckland 2012;King 2006). In part, ongoing challenges in articulating the common thread(s) within information science stem from the elusiveness of information as a topic. Definitions and characterizations of the concept of information abound by individuals inside and outside of information science (Cornelius 2002;Capurro & Hjørland 2003).
What can be drawn from the parallels between the ongoing evolution of information science and the emergence of data science? The disciplinary ecosystems between the two fields are not identical. Marchionini (2023) argues that information science can be considered an academic discipline because it has developed distinct "principles, key research questions, and communities of practice that have given rise to subspecialties, professional standards, curricula and degrees; whereas data science at present consists of a set of techniques that have arisen out of allied fields such as statistics, computer science, and information science and is driven by applications and problems from a variety of endeavors of modern life." To build insight from this comparison, the following sections discuss three key questions regarding the future of data science. The intention behind these questions is to identify issues that are either already important in the data science landscape or will be likely to be important in the near future. For the stakeholders involved in data science, there is benefit in discussing how these debates can be turned into productive discussions rather than having them manifest as impediments going forward. Mayernik Data Science Journal DOI: 10.5334/dsj-2023-016 1. WHAT WILL BE THE FOCAL POINTS AROUND WHICH "DATA SCIENCE" AND ITS STAKEHOLDERS COALESCE?
Given the general vagueness described above around the conceptualization of data within data science, why has the term data emerged as the focal point for this conglomeration of activity? Here, the historical comparison to information science may shed light. Information superseded documentation as the central concept of the field in the 1950s, but a considerable body of work since that time has argued that other concepts provide more theoretically robust entities, including "documents" (Frohmann 2004), "literatures" (White & McCain 1998), "relevance" (Furner 2004), or, more recently, people and their use of networked computers (Shaw 2019).
What then holds information as the central concept of the field? Certainly, the formalization of information science in the 1950s and 1960s was related in large part due to the success of Shannon's "information theory" within the fields of mathematics and electrical signal processing (Shannon & Weaver 1949). Information theory, as developed by Shannon and many others (Aspray 1985), provided conceptual metaphors of information "senders," "channels," and "receivers" that persist within information science research and education to this day (Day 2000; Gorichanaz 2017). Perhaps equally important, the success of information theory brought attention and resources to the study of information. Governments, private foundations, and for-profit companies invested in a wide range of information-focused research during the postwar period (Geoghegan 2008). As such, it is tempting to attribute the movement from documentation to information science as being one of status seeking, that is, adopting the term information to align preexisting bodies of work under the documentation label with emergent and highly prestigious research focused on information (Spang-Hanssen 1970/2001. Such alignments are inevitable and are certainly happening today in the movement to data science. But the information concept provides more than just status. The vagueness of the term provides affordances in how it can be used and understood. As Agre (1995) illustrated, "information" provides a neutral term that enables research and professional communities to make broad intellectual territorial claims without overaligning to any particular technology, institution, or knowledge area.
Many of these same characteristics can be seen in the centering of data within data science, namely, foundational metaphors, pragmatic alignment with trending topics, and a vagueness that enables broad territorial claims. Data certainly comes with a foundational metaphor, detailed by Rosenberg (2013) and Frické (2019), that is at the root of work in many fields: data being that which underlies facts, evidence, truth, and information. Roseberg, Frické, and others (c.f. Borgman 2015) point out conceptual problems with this metaphor, but it undoubtedly remains strong in most sectors of academia and society. On a practical level, the label data also serves as a pragmatic sign of alignment with emergent and resource-heavy research areas, such as big data, the Internet of Things, and social media analytics. Finally, like information, the term data lacks conceptual baggage that would tie it to any specific technology, institution, or knowledge area. This characteristic is at the root of the "data science as metadiscipline" commentary, namely, that its techniques (whether data organization, processing, analytics, or visualization) can have application in almost any setting.
As such, information and data serve a number of functions, even if they lack unifying conceptual clarity. Their conceptual vagueness presents both benefits and drawbacks. As Hjørland (2013) describes, both centripetal and centrifugal forces exist with regard to the formation of a coherent discipline based on such a diffuse topic. Even as this tension has caused ongoing practical and institutional challenges for people involved in the study of information, it has stimulated considerable intellectual advancement related to the understanding of information as a conceptual and theoretical entity. Whether the development of data science stimulates similar advances in the understanding of the concept of data remains a question for future research.
In its current formative period, data science is perhaps most coherent as a platter of methods and tools, not as a grouping of research or professional areas. Data science projects tend to be distinguished by the kinds of tools and methods used, not the disciplinary topic on which they are working (Saltz, Shamshurin & Connors 2017). As vividly shown in the "Periodic Table  of Data Science" (Willems 2017), data scientists may engage with a variety of tools for data collection, cleaning, processing, analysis, archiving, and distribution, including programming languages like Python and R, frameworks for analyzing large data sets like Apache Hadoop Mayernik Data Science Journal DOI: 10.5334/dsj-2023-016 and Pangeo, and machine learning and artificial intelligence approaches like decision trees and neural networks (Gil 2017; Ma 2023). As with any discipline, some specializations within data science will have little crossover with each other. It remains to be seen, however, whether data science specializations will continue to be structured around particular tool sets and methods, or whether particular theoretical developments, topical interests, or social problems will become more prominent focal points.

CAN DATA SCIENCE STAKEHOLDERS USE THE LACK OF DISCIPLINARY CLARITY AS A STRENGTH?
This leads to the next point of discussion. Conceptual advances in the understanding of data as a foundational concept are taking place, but as noted above, this work is largely being conducted by non-data scientists. This is one demonstration of the blurriness of the boundaries around data science. Boundaries between disciplines are always blurry. Scholars tend to interact most closely with people who work on similar topics and/or with similar methods, regardless of their disciplinary affiliation. Porous boundaries mean that new participants with diverse backgrounds will move into or across the field. This will inevitably lead to rediscovery or reinvention of particular ideas or approaches and periodic circularity in the topics of current interest. This trend has been noted by many prominent information scientists (Herner 1984;Soergel 1999;Bates 2004).
Such reinvention and circularity can stem from a lack of knowledge of historical predecessors, but it is also reflective of the ongoing nature of many information and data challenges. Some problems reemerge repeatedly, despite the best efforts of many experts. Within information science, it has long been known that information organization and retrieval methods that once worked well will break down if not regularly revisited due to changes in how language is used across space and time (Shera 1970). Data scientists encounter such circularity as they attempt to standardize data within and across organizations, leading to the well-documented fact that a significant portion of recurring data science work involves data wrangling and cleaning (Beaton et al. 2017;Kross & Guo 2019;Keller et al. 2020).
For information science, these characteristics have been viewed as problems that limit the field from gaining status within the broader ecosystem of academic disciplines (Hjørland 2013). In contrast, sociologist Jerry Jacobs (2013) has argued that innovation in the face of diffuse boundaries is what ensures the vitality of disciplines over time. Attempting to demark disciplinary boundaries is counterproductive when the grounds to claim such boundaries are uneven, as is the case with information and data science (Burnett & Bonnici 2013). Disciplinary boundaries are important to demark educational, professional, and funding institutions, but focusing too much on the need to form and define disciplinary boundaries implies a discourse of "weakness," which can cause unnecessary and repetitious debates about how to make the discipline stronger (Madsen 2016).
For data science, embracing porous boundaries by evincing openness to new ideas and people could be a means for continually refreshing the field and for broadening the diversity of data science participants generally. There are positive movements in this direction already. The Academic Data Science Alliance, for example, was created in part to "advocates for justice, diversity, equity, and inclusion of all backgrounds and lived experiences in data science and more broadly in academia" (Academic Data Science Alliance 2023). In another example, the development of the CARE principles has been an important motivator and signpost in bringing Indigenous voices into data-focused discussions (Carroll et al. 2020). This set of principles outlines key approaches to working with any data related to Indigenous peoples or communities, namely, that there should be collective benefit for the relevant Indigenous communities, the Indigenous communities should have authority to control their data, there is a responsibility to engage respectfully with Indigenous communities regarding any data collection or use, and that ethics (of the researchers and Indigenous people) should inform data use (Carroll et al. 2021). This has been an important addition to the discourse around data science over the past decade.
The extensive debates on these boundary issues have not "solved" the problems of interdisciplinarity within information science over the past 50-plus years. Views on extant or desired boundaries are inevitably dependent on one's viewpoint and will thus evolve in concert with the participants involved. But as noted in the discussion of question 1 above, these debates have been highly generative intellectually within information science. Studies Data Science Journal DOI: 10.5334/dsj-2023-016 focused on data and data science may be able to take a lesson from this duality, namely, that though the recurrence of such debates will cause frustration and occasional points of circular argumentation, new voices adding to these discussions have the potential to significantly advance understanding of the nature of data as a foundational concept.

Mayernik
Finally, understanding data in all of its facets requires coupling (or at least embracing) multiple kinds of research, development, and analytical methods. Buckland (2012) has argued that the ability to be "methodologically versatile" must be a calling card for studies of data and information. Some data phenomena can only be studied via statistical methods, while other phenomena can best be studied via engineering, bibliometric, survey, or ethnographic research methods. Qualitative and quantitative methods used in complementary ways may be more effective in solving data-related problems than either type of method individually (Aragon et al. 2022). Versatility to shift between or combine these methods allows those who work in interdisciplinary areas to be flexible in the face of new societal and technical developments (Szostak 2013). Because of this need for methodological flexibility and versatility, information or data science programs that emphasize only a single methodological approach may be less resilient over time.
The openness to the new voices and methodological versatility described in this section is particularly critical as society becomes ever more data driven. As of this writing, in mid-2020, questions about data are at the center of national and international politics (use of social media data for targeted election advertising), public health (COVID-19 disease data gathering, sharing, and analysis), and global environmental change (measurements and projections of climate change). Twenty years ago, Saracevic (1997: 20) noted that "contemporary information problems are too important to be left to any one discipline." To paraphrase this for today, contemporary data problems may be too important to be able to be gathered under any one discipline or professional group.

CAN DATA SCIENCE FEED INTO AN "EMPOWERING PROFESSION"?
Given the broad importance of data within a range of societal sectors, there are many calls for data science stakeholders to embrace ethics as a core competency (Floridi & Taddeo 2016; Bowne-Anderson 2018; Cao 2018a). Machine learning, in particular, is under considerable scrutiny as a tool that can be used for ethically questionable purposes (McQuillan 2018b). These techniques may also produce unethical results, even with good intentions (Poirier 2021). If, as noted above, data scientists often work with "domain experts" or stakeholders in a client-like relationship, this connection with ethics will manifest on a day-to-day basis through questions about data bias, reliability, integrity, and quality. Instead of trying to deal with this issue indirectly by building better data products (e.g., visualizations or representations), data scientists have an opportunity to embrace the idea of becoming an "empowering profession" (Maack 1997). "Empowering professions" promote their client's growth and competence. They do not withhold information or stand behind a bulwark of "expertise" in limiting what is shared with a client.
This does not mean that data scientists should be trying to train everybody else to be data scientists. Instead, it suggests that data scientists could promote data literacy as a means toward the personal empowerment of the people that they work with (Monroe-White 2022). "Data literacy" in this context refers to enabling clients to understand that the products of a data science project (whether machine learning outputs or data infrastructure developments) come with certain embedded assumptions, limitations, and ethical concerns. It also refers to helping clients understand that data are embedded within particular situational and relational contexts, both when collected and when analyzed (Wilkerson & Polman 2019). This also would involve finding ways to ensure transparency and interpretability of the outcomes of data science workflows, particularly when they are used for decision-making (Stoyanovich & Lewis 2019).
A move toward empowerment would seek to understand information and data in relation to concepts like vulnerability, trust, autonomy, and agency and would work to support people in approaching the use of technology, documents, information, and data from their own cultural viewpoints, personal interests, and social settings (Srinivasan 2017). As an example, Pierre (2019) studied social media use by children and showed how digital technologies serve as sources of social support, self-expression, and self-assurance, as much as (or more so than) tools for information, data, or knowledge creation. Mayernik Data Science Journal DOI: 10.5334/dsj-2023-016 Data scientists are inevitably political and ethical actors, even if they do not intend or desire to be (Green 2018). Open research questions remain about the extent to which the data science profession embraces political, ethical, and empowerment-focused research agendas and professional norms. This may be critical to the future development of the field, as empowering clients-that is, enabling them to better understand their own data, and what can (or cannot) be done with them, without needing the data scientist to shepherd every step-will help data science to build a reputation for trustworthiness and social responsibility.

CONCLUSION
Information and data work must draw from multiple conceptual and practical domains. Understanding and using information and data involves articulations between people, their societies and institutions, and the technologies they create and use (Rayward 1996). Because of the vague and contested nature of data as a central organizing concept, new people, institutions, and technologies will continually enter the ecosystem of data science, engendering continual discussion of its disciplinary status. One response to this dynamic is to push for formalization of a discipline and profession with agreed-upon curricula, skills, and professional responsibilities. This may result in periodic stabilizations of disciplinary characteristics, but the discussion within this paper suggests that the definition of data science and its boundaries will be a source of contestation and debate for the foreseeable future. Similar issues have manifested in information science for a century or more and continue to be points of debate today.
Understanding the potential pitfalls in focusing too much effort on disciplinary formalization is critical for data science moving forward. Creating a discipline involves significant institutional work of establishing social and organizational support structures (Lenoir 1997). The new technologies and analytical methods depicted in Table 1, such as microfilm photography, early digital computers, the internet, and social media, emerged out of, and into, an interconnected web of social institutions (Agre 2003). If data science is to continue to grow as a distinct entity, attendant institutions must likewise be developed. These may include formally organized entities, such as professional associations or caucuses of academic programs. But institutional development also encompasses the emergence of professions and professional norms of conduct, processes for governing standards and tools, and the development of consortia that mediate institutional interactions (Mayernik 2016;Cutcher-Gershenfeld et al. 2017).
Many current initiatives assume that solidifying the boundaries around data science is possible or desirable. Examining these kinds of assumptions is central to building data science to be a "critical technical practice" (Agre 1997). The stakeholders who are developing the present and future of data science will need to examine the relative merits of embracing porous boundaries and methodological versatility, and they will have to deal with reinvention and circularity of central topics. Leaders in the many data science communities will also have to address whether ethics and empowerment are central to strengthening the foundation of the emerging field and associated set of professional roles. Finally, over time, funders, universities, and professional leaders will need to identify the kinds of institutional developments that will make data science more robust when it encounters the inevitable societal and technological changes of the next few decades.