Education for Real-World Data Science Roles ( Part 2 ) : A Translational Approach to Curriculum Development

This study reports on the findings from Part 2 of a small-scale analysis of requirements for real-world data science positions and examines three further data science roles: data analyst, data engineer and data journalist. The study examines recent job descriptions and maps their requirements to the current curriculum within the graduate MLIS and Information Science and Technology Masters Programs in the School of Information Sciences (iSchool) at the University of Pittsburgh. From this mapping exercise, model ‘course pathways’ and module ‘stepping stones’ have been identified, as well as course topic gaps and opportunities for collaboration with other Schools. Competency in four specific tools or technologies was required by all three roles (Microsoft Excel, R, Python and SQL), as well as collaborative skills (with both teams of colleagues and with clients). The ability to connect the educational curriculum with real-world positions is viewed as further validation of the translational approach being developed as a foundational principle of the current MLIS curriculum review process. Received 20 October 2015 ~ Accepted 24 February 2016 Correspondence should be addressed to Liz Lyon, School of Information Sciences, University of Pittsburgh. Email: elyon@pitt.edu An earlier version of this paper was presented at the 11 th International Digital Curation Conference. The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by the University of Edinburgh on behalf of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ Copyright rests with the authors. This work is released under a Creative Commons Attribution (UK) Licence, version 2.0. For details please see http://creativecommons.org/licenses/by/2.0/uk/ International Journal of Digital Curation 2016, Vol. 11, Iss. 2, 13–26 13 http://dx.doi.org/10.2218/ijdc.v11i2.417 DOI: 10.2218/ijdc.v11i2.417 14 | Education for Real-World Data Science Roles doi:10.2218/ijdc.v11i2.417 Introduction and Context This paper reports on the findings of a study that is framed as an analysis of requirements for real-world data science positions. The study is the outcome of an exploration of current and future curriculum developments within the graduate MLIS program in the School of Information Sciences (iSchool) at the University of Pittsburgh. The study examines a suite of data science roles based on recent job descriptions and maps their requirements to the current curriculum. The study was conducted in two parts, each using the same methodology. Part 1 investigated three specific data science roles: data librarian, data archivist and data steward, and the findings have been reported elsewhere (Lyon, Mattern, Acker and Langmead, 2015). The current study forms Part 2 of the analysis and examines three further data science roles: data analyst, data engineer and data journalist. We address the three research questions also explored in Part 1: 1. What are the skills, competencies, knowledge, experience and education required for the distinct data science roles? 2. How do these data science role requirements map to current curriculum topics and course offerings? 3. What opportunities emerge for new collaborations and partnerships in developing the data science curriculum?


Literature Review
The challenges in developing workforce capacity and capability for data science and data stewardship have been well-documented (Bakhshi, Mateos-Garcia and Whitby 2014;BRDI, 2015), with an acknowledged data talent gap identified.In particular, there are new curriculum components associated with the range of emerging data science roles.A distinctive approach that draws on translational principles (Woolf, 2008) has been applied to data science education -'from iSchool to marketplace' (Lyon and Brenner, 2015) -and is adopted in both parts of this study.This recognises the need for higher education providers in the data science arena to take a pragmatic and marketaware view to ensure continuing relevance and compatibility with current workforce demands.Prior commentary on the three data science roles explored in this study highlights the different perspectives on their associated tasks and skills; this commentary includes perspectives on building data science teams (Patil, 2011), a brief review of three data science careers (Lee, 2014), an articulation of data scientist vs data analyst (Rivera and Haverson, 2014) and data scientists vs data engineers (Walker, 2013) and a handbook about data journalism (Gray, Bounegru and Chambers, 2011).
Consideration of data science roles from an educational perspective was addressed by Stanton, Palmer, Blake and Allard (2012) reporting on a workshop; they discussed the concept of a 'T-shaped professional' where broad data knowledge is complemented by deep knowledge in one of three areas (Data Curation, Analytics/Visualisation/Preservation, Networks/Infrastructure).An 'I-shaped' model was also proposed, which included domain knowledge at the base.The paper explores doi:10.2218/ijdc.v11i2.417Lyon and Mattern | 15 educational models and recommends a continuing education model beginning with an undergraduate degree (e.g.Computer Science, Information Science, Applied Statistics or Mathematics) or a graduate degree (e.g.MLIS).The student then moves on to acquire domain knowledge through an internship or on-the-job experience.

Methodology
The methodology applied was based on the qualitative workflow described in detail in Part 1 of this job analysis study (Lyon, Mattern, Acker and Langmead, 2015), and comprised the use of keyword searching of job banks to locate and select ten positions within the specified timeframe (i.e. the last 12 months) in each of the three data roles.The job bank used in Part 2 was indeed.comand the postings are listed in the Appendix.This step was followed by a content analysis of the job descriptions using a coding scheme for five categories: a) Education -academic qualifications; b) Experiencedirect hands-on practice; c) Knowledge -understanding of/familiarity with topics/subjects/issues; d) Skills -ability to do an action well; e) Competenciesproficiency with specific tools/technologies/programming languages.
The requirements were identified, sorted and examined for patterns across the three roles.We designated requirements that appeared in at least three of the positions as 'Key Requirements.'The next step was to consider the graduate courses provided within the Masters in Library and Information Science (MLIS) Program and also by the Information Science and Technology Program in the School of Information Sciences, University of Pittsburgh, during academic year 2015-2016 to determine which options would support the requirements indicated in the job descriptions.From this mapping exercise, we were able to identify model 'course pathways' and module 'stepping stones'; it also informed our approach to meeting employer expectations in preparing iSchool students for real-world positions.The requirements mappings enabled the identification of course topic gaps and highlighted opportunities for collaboration with other Schools.

Results
Firstly, we record the prolific number of positions available in these three job categories at the point of sampling in October 2015.This is in stark contrast with at least one of the roles, the data archivist, which was analysed earlier in 2015 in Part 1 of this study.The majority of the positions in the sample were located within the private sector and came from a mix of large corporate businesses and smaller companies.There were relatively few positions within universities or other public sector bodies.
Detailed mappings of the requirements for the three roles in each of the five categories listed above are presented in Tables 1, 3 and 5.Note that competency in four specific tools or technologies was required by all three roles: a) Microsoft Excel, b) R, c) Python and d) SQL.Collaborative skills were also highlighted as a requirement in each of the three roles.Position requirements for the three roles referenced the ability to work well with both teams of colleagues and with clients; the ability to work with the latter reflecting the business/corporate nature of the employers.In addition, three of the categories (Experience, Knowledge and Competencies) were found to have overlapping content within the job descriptions, i.e. the categories were blurred with no clear doi:10.2218/ijdc.v11i2.417delineation between them.The results are therefore presented based on the best semantic matching (e.g.'understanding of…' was interpreted as 'Knowledge of….').A sixth category ('Other') was introduced to include additional requirements that did not fall under any of the five themes listed previously.An example is 'Security Requirements' and these are referenced under the appropriate role.

Data Analyst
The Data Analyst jobs seek candidates with a Bachelor's degree, but with no consistently specified subject domain.There was little emphasis on education within the job requirements (Table 1).In contrast, experience working as an analyst or in data analysis was repeatedly highlighted as a Key Requirement.
A broad range of additional experience is frequent in the narrative of the job descriptions, including experience with data management, data acquisition or sourcing, and statistical work.There is also relatively limited emphasis on knowledge requirements for Analyst positions, though relevant domain knowledge was cited in some job descriptions.In contrast, a relatively broad range of skills were listed, with particular emphasis on writing, attention to detail and accuracy; time management and collaborative skills were also required.Whilst a range of competencies were specified, the primary requirement was for expertise with data analysis software tools such as R, SAS, Alteryx and Stata.In the additional Other Requirements category, we observed 'Background verification check' as a security requirement for some positions, but this was not designated a Key Requirement (i.e. it occurred in less than three Data Analyst positions).
Our recommended course pathways through the MLIS and Information Science and Technology Masters Programs for a prospective Data Analyst include the essential and desirable course stepping stones listed in Table 2.Additional courses from other University of Pittsburgh schools and departments are proposed.

Data Engineer
The Data Engineer positions seek candidates with a Bachelor's degree in Computer Science, Mathematics, Statistics or Information Systems as a preferred domain (Table 3).A degree in Business or Information Technology was specified in some positions.In other positions a Masters degree was preferred or an Advanced Certificate in areas such as Agile Systems, Big Data, Data Science.The three Key Requirements for experience for these positions were core data engineering/data processing/data warehousing or ETL (Extract/Transform/Load) capability.There was also an emphasis on data at scale (i.e.large IT implementations or large amounts of raw data).This experience is frequently quantified and is a primary requirement of these roles.However, there was little focus on knowledge requirements, beyond business intelligence and database technologies.
The ability to work collaboratively was highlighted in many positions, alongside written communication skills and the ability to solve problems or trouble-shoot in the working environment.Whilst a broad range of technical competencies were listed, there was a strong focus on Hadoop/MapReduce and associated technologies, such as Hive and Pig.Non-relational databases were also cited, including MongoDB and neo4j, accompanying requirements for a selection of programming and scripting languages.
In the additional Other Requirements category, we observed 'TS/SCI Polygraph' and 'Background verification check' as security requirements for some positions, but once again these were not designated as Key Requirements, as they occurred in less than three Data Engineer positions.

Data Journalist
The Data Journalist positions seek candidates with similarly substantive and quantified experience as a journalist or reporter (Table 5).Experience with statistical work, data visualization or graphics, were also listed.However, there is no specific education requirement beyond a Bachelor's degree.Stated knowledge requirements are rare, doi:10.2218/ijdc.v11i2.417although mathematics or statistics or a particular domain area relevant to the position were listed in some job descriptions.The skills requirements reflected those observed in the other two positions: oral and written communication skills, collaborative skills and an attention to detail.Time management/ability to meet deadlines was also a Key Requirement.The widest range of competencies was observed for the Data Journalist roles, encompassing programming and scripting languages, visualisation and graphics software, cartographic or mapping tools, web authoring, data analysis packages and database query methods (SQL).In the additional Other Requirements category, we observed that employers desired the submission of a portfolio, via either clippings or a link to a web-published portfolio.This was not designated as a Key Requirement as it occurred in less than three Data Journalist positions.
Our recommended course pathways through the MLIS and Information Science and Technology Masters Programs for a prospective Data Journalist, include the essential, desirable and additional course stepping stones listed in Table 6.

Discussion
Whilst this is a modest study, the methodology has been effective in identifying the features and characteristics of each of the three positions investigated.The wealth of Data Analyst, Data Engineer and Data Journalist positions within the job bank searched is evidence of the continuing investment in and growth of data-driven markets and the accompanying huge demand for a workforce with the critical skills in these areas.The distribution of positions in the sample highlights the value placed on these data roles by private sector organisations; to some extent, universities and other public sector bodies appear to be slower in investing in these particular data roles.

Comparing the Data Roles
The results from this study indicate that these are three clearly differentiated data roles, but with overlapping requirements and a common core set of critical competencies and skills.The commonalities and differences in requirements have been summarised in a Venn diagram shown in Figure 1.The focus on quantified experience for a Data Engineer and a Data Journalist may reflect parallel foundations in professional practice: both fields have an established 'hands-on' approach with strong traditions of learning on-the-job.Similarly, the requirement for domain knowledge for a Data Analyst (e.g. in health or finance or aquatic sciences) and a Data Journalist may reflect their situation within a particular disciplinary field or sector, where an understanding of the established practices, politics and culture will be an advantage.Other bilateral commonalities, such as the requirement to source or acquire data for a Data Analyst and a Data Journalist, reflects the importance of being able to 'find data' from external sources, e.g. government datasets, for subsequent exploration, visualization and insight development.
The focus on large volumes of data or data aggregation observed in the data analyst and doi:10.2218/ijdc.v11i2.417data engineer requirements highlights the importance of working at scale; many of the roles in the sample were based in very large multi-national companies with millions of clients generating huge data volumes through retail, business or leisure transactions (in other words, big data).The relevance of statistical skills for the Data Journalist roles in addition to the Data Analyst roles was a surprise; however quantitative data is critical for both roles and statistical techniques provide the essential tools and protocols to demonstrate significance, trends and insights from the evidence base.The value of mathematics, statistics and quantitative thinking was identified in an earlier Data Science Venn Diagram (Conway, 2010).The suite of competencies required by each of the three roles (Python, R, Excel and SQL) form a foundational 'technical data toolkit' and highlight the relevance of coding and querying proficiency.Demand for Python programming expertise was found to have increased by almost 100% in big data related positions in 2014 in an analysis of big data hiring trends (Columbus, 2014).Python, R and Excel were also highlighted as key tools for data analysts for the data wrangling process -'the process of making data useful' (Kandel et al., 2011).However, these technical abilities need to be blended with other attributes such as research skills (Data Analyst), documentation skills (Data Engineer) and an ability to meet deadlines (Data Journalist).A blended or rounded set of skills was also highlighted as a desirable feature by UK business representatives in the Nesta Model Workers Report (Bakhshi at al., 2014).
The relative lack of commonality in requirements with the three roles previously studied is striking (Lyon, Mattern, Acker and Langmead, 2015).Whilst there are some requirement intersections (e.g.data management, relational databases and data visualization), overall these form two largely separate groups, each with three interrelated roles.However, within a data-intensive marketplace, the roles are interdependent: a Data Analyst, Journalist or Engineer requires high quality, curated data to work with, whilst a Data Librarian, Archivist, or Steward/Curator require the data in their care to be used, wrangled and analysed to demonstrate their value.The paper demonstrates that the curriculum requirements for the data analyst and the data engineer roles are very well-matched to the iSchool curriculum, with potential collaborative opportunities with other academic schools, such as the School of Engineering to enrich and supplement the iSchool offer.However, it can be argued that the delivery of the educational curriculum for the Data Journalist role may be best positioned within a school providing journalism, media, communication, English or creative writing programs, with the additional collaborative opportunities arising in reverse with iSchools.There is also further scope for the development of Advanced Certificates; whilst this was not identified as a Key Requirement, it was highlighted as a requirement for some positions across each of the roles.Such qualifications provide an effective route for up-skilling of current professionals.
The model pathways described in this study appear to be similar to the concept of a 'trajectory' posited by Furst, Isbell and Guzdial (2007), who also present a 'threads' approach to reviewing the Computer Science curriculum at The Georgia Institute of Technology in Atlanta.Threads takes a view beginning with courses or modules and leading out to generic career roles such as 'Practitioner' (software engineer); in contrast the translational approach at the University of Pittsburgh begins with real-world roles and tracks back through the role requirements to the courses and modules offered by the graduate programs.Both methodologies have their value, since in each case they join up the educational offerings with career options, current workforce trends and future market demands.

Conclusions
Whilst this is a modest study, the methodology is transferable and may be applied within other iSchools and by other education providers.The findings emphasise the inter-disciplinary, blended or hybrid character of the curriculum requirements for the data science roles.Higher education providers will need to carefully customise and modify their curricula to optimally match these complex real-world requirements.However, there are significant opportunities to develop new partnerships, both across campus and beyond, to create exciting translational curricula to meet current and future data workforce capacity and capability challenges.

Table 1 .
Key requirements for data analyst.

Table 1 .
Key requirements for data analyst (continued)

Table 2 .
Course pathways for a data analyst.

Table 3 .
Key requirements for data engineer.

Table 3 .
Key requirements for data engineer (continued)Our recommended course pathways through the MLIS and Information Science and Technology Masters Programs for a prospective Data Engineer, include the essential, desirable and additional course stepping stones listed in Table4.

Table 4 .
Course pathways for a data engineer.

Table 5 .
Key requirements for data journalist.

Table 6 .
Course pathways for a data journalist.