What issues are data scientists talking about? Identification of current data science issues using semantic content analysis of Q&A communities

Background Because of the growing involvement of communities from various disciplines, data science is constantly evolving and gaining popularity. The growing interest in data science-based services and applications presents numerous challenges for their development. Therefore, data scientists frequently turn to various forums, particularly domain-specific Q&A websites, to solve difficulties. These websites evolve into data science knowledge repositories over time. Analysis of such repositories can provide valuable insights into the applications, topics, trends, and challenges of data science. Methods In this article, we investigated what data scientists are asking by analyzing all posts to date on DSSE, a data science-focused Q&A website. To discover main topics embedded in data science discussions, we used latent Dirichlet allocation (LDA), a probabilistic approach for topic modeling. Results As a result of this analysis, 18 main topics were identified that demonstrate the current interests and issues in data science. We then examined the topics’ popularity and difficulty. In addition, we identified the most commonly used tasks, techniques, and tools in data science. As a result, “Model Training”, “Machine Learning”, and “Neural Networks” emerged as the most prominent topics. Also, “Data Manipulation”, “Coding Errors”, and “Tools” were identified as the most viewed (most popular) topics. On the other hand, the most difficult topics were identified as “Time Series”, “Computer Vision”, and “Recommendation Systems”. Our findings have significant implications for many data science stakeholders who are striving to advance data-driven architectures, concepts, tools, and techniques.


INTRODUCTION
The volume and variety of data produced and shared in today's digital life cycle, also known as the "age of big data", is increasing exponentially on a daily basis. Over the last decade, big data has been at the forefront of the technology-driven revolution in computing ecosystems. Big data applications, including but not limited to search engines, social networks, e-commerce, and multimedia streaming services, have achieved unprecedented success (Assunção et al., 2015;Donoho, 2017;Vicario & Coleman, 2019). As analyzing data from Q&A platforms will have significant implications for understanding the current state of data science.
Taking into account the current background, this study aims to conduct an in-depth and comprehensive analysis of the common issues and challenges faced by data scientists, as well as to fill a gap in the literature. To that end, a semantic content analysis based on probabilistic topic modeling was applied to posts shared on DSSE, a data science-specific Q&A platform, over the last nine years between 2014(Stack Exchange, 2022. The semi-automatic methodology proposed in this study includes a series of semantic topic modeling processes based on latent Dirichlet allocation (LDA), a probabilistic topic modeling algorithm used to automatically discover topics (Blei, Ng & Jordan, 2003). Using this methodology, we identified the main issues and themes that data scientists are discussing, as well as their underlying dependencies and indicators and long-term trends. In summary, we motivate and present the following research questions for our empirical study.
RQ1. What topics are discussed by data scientists? RQ2. How do the data science topics evolve over time? RQ3. How do the popularity and difficulty of the topics vary? RQ4. What are the most commonly used tasks, techniques, and tools in data science? RQ5. How do data science topics relate to data-driven technologies?

BACKGROUND AND RELATED WORK
To provide a better understanding of the contextual dimensions of data science, we base our research on three pillars and discuss relevant studies here under the headings of data science, Q&A communities, and topic modeling.

Studies related to the data science
In recent years, the rapid development of data-oriented services and applications has increased the number of studies aimed at understanding the data science paradigms. The majority of domain-specific research focuses on data science concepts, architectures, practices, application areas, big data, and related fields (Donoho, 2017). These studies provided insights into the use of data science in academia and industry, as well as data science issues and challenges (Vicario & Coleman, 2019;Karbasian & Johri, 2020).
Previous studies have discussed the use and benefits of data science in various disciplines, ranging from business science to computer sciences, finance to medicine, bioinformatics to natural sciences (Donoho, 2017;Vicario & Coleman, 2019). In today's data-driven world, common data science application areas include business analytics, supply chain, social media analytics, smart cities, business logistics, business intelligence, recommendation systems, decision support systems, natural language processing, financial fraud detection, advertising, behavioral analytics, and manufacturing (Schoenherr & Speier-Pero, 2015;Vicario & Coleman, 2019;Sarker, 2021). Several studies have emphasized the close relationship of data science with big data, predictive analytics, and data-driven decision making (Cao, 2017;Sarker, 2021).
A number of studies, similar to our current study, have been conducted based on the analysis of posts from Q&A communities to investigate the issues and challenges of data science (Bagherzadeh & Khatchadourian, 2019;Hin, 2020;Karbasian & Johri, 2020). An empirical study was conducted in which data science-related posts on two Q&A websites, Stack Overflow and Kaggle, were analyzed using the topic modeling approach, and 24 data science-related discussion topics were identified (Hin, 2020). In another study, Karbasian & Johri (2020) used a topic modeling-based approach to analyze data science-specific posts from two popular Q&A communities, StackExchange and Reddit. The discussion topics discovered on both platforms were compared using their analysis (Karbasian & Johri, 2020). Furthermore, Bagherzadeh & Khatchadourian (2019) examined Stack Overflow posts to investigate big data-related issues, and as a result, they identified 30 topics. Our study differs from theirs in that we only analyze posts shared on DSSE, a data sciencespecific Q&A platform, and we propose a semi-automated methodology based on unsupervised machine learning for such analysis. Furthermore, our work complements theirs by adding dimension and depth.
Stack Overflow data has also been used in a number of studies that revealed discussion topics in specific subfields of software development such as testing (Kochhar, 2016), security (Yang et al., 2016), mobile development (Linares-Vásquez, Dit & Poshyvanyk, 2013;Rosen & Shihab, 2016), chatbot development (Abdellatif et al., 2020), IOT development , and machine learning (Alshangiti et al., 2019). As previously stated, a small number of studies have also been conducted based on the analysis of posts shared on the Q&A platforms to explore the issues and challenges related to data science (Bagherzadeh & Khatchadourian, 2019;Hin, 2020;Karbasian & Johri, 2020). Our study expands on previous research by analyzing the comprehensive dimensions of posts shared on the DSSE platform, which only includes data science-related posts.

Topic modeling
Topic modeling is a generative approach for comprehending and summarizing the semantic content of large collections of documents (Blei, 2012;Gurcan et al., 2022b). It ensures that the word groups that best describe the semantic map of the documents are identified as a separate topic (Blei, 2012). Topic modeling is becoming more popular for mining unstructured textual data due to its capabilities and benefits for semantic content analysis (Gurcan & Cagiltay, 2022). Moreover, topic modeling is a generative approach actively used to reveal emerging trends in specific contexts of technical sciences such as software engineering, computer sciences, and information sciences (Silva, Galster & Gilson, 2021;Gurcan et al., 2022a). The amount of data produced in today's software repositories and social networks is growing at an exponential rate. Because of the rapid increase in the amount of unstructured data, semantic content analysis of this data has become difficult but important, even though it provides an opportunity for research on this data (Silva, Galster & Gilson, 2021).
From its inception to the present, topic modeling procedures have been successfully applied to various types of data, including textual documents, genetic data, web archives, log files, source codes, images, videos, forums, blogs, Q&A platforms, software repositories, and social networks (Blei, 2012;Silva, Galster & Gilson, 2021). Furthermore, topic modeling was employed to reveal the major trends in software engineering research (Mathew & Menzies, 2018;Gurcan et al., 2022b); to explore essential competencies and skills for software engineers by analyzing online jobs (Gurcan & Kose, 2017;Gurcan & Cagiltay, 2019); to identify what issues are software developers are discussing about specific contexts of software engineering, such as IOT development , mobile development (Rosen & Shihab, 2016), testing (Kochhar, 2016), chatbot development (Abdellatif et al., 2020) security (Yang et al., 2016) and programming languages (Chakraborty et al., 2021).
In data science research, the LDA-based topic modeling method was also used to analyze data scientists' discussions on Q&A platforms (Bagherzadeh & Khatchadourian, 2019;Hin, 2020;Karbasian & Johri, 2020). Apart from these aforementioned studies, numerous studies based on topic modeling procedures have been conducted in subcontexts of various disciplines (Silva, Galster & Gilson, 2021). In conclusion, the effectiveness and suitability of the topic model approach for research in the technical sciences has increased our motivation to investigate data science issues using topic models.

METHOD Data collection and extraction
To provide an objective methodology, we used a data dump that included all posts shared on data science stack exchange (DSSE), a data science-specific Q&A platform (Stack Exchange, 2022). The datasets created and analyzed during this current study are publicly available in the Internet Archive repository (Internet Archive, 2022) as a data dump in XML format. The first step was to download the most recent XML data dump (last updated on June 8, 2022) and parse it into a database. Each question post in the data dump contains various metadata elements such as a title, a body, tags, answers, comments, and other indicators (Yang et al., 2016).
From May 2014 to June 2022, this parsed data dump contained 70,283 posts (33,492 questions and 36,791 answers). Figure 1 depicts the number of data science-related posts in our experimental data set by year. According to Fig. 1, the number of questions and answers increased up until 2019 and then decreased. Although the number of answers is generally greater than the number of questions, it has been lower since 2020. It was also discovered that an average of 3,721 questions and 4,088 answers were shared on the DSSE platform each year.

Data preprocessing
Data preprocessing consists of a series of operations used to convert textual raw data into structured datasets. We considered the preprocessing steps that have been widely preferred in previous work to reduce noise and structure textual data in order to implement an efficient preprocessing (Rosen & Shihab, 2016;Yang et al., 2016;Uddin et al., 2021). Initially, we removed all non-semantic paragraphs and content from a post, such as code snippets marked with <code></code> and non-text blocks marked with HTML tags like <p></p> and <a></a>. In the second step, we removed stop words (e.g., "a", "an", "the", "with", etc.), numbers, punctuation, and non-alphabetic characters (Řehůřek & Sojka, 2011). In this way, we have removed words and characters from the posts that do not make sense on their own. As a result of this process, we only kept the meaningful words required for semantic topic modeling in the corpus. Following that, we used the lemmatization process to reduce the words to their roots while keeping the various meanings derived from the same root (Řehůřek & Sojka, 2011;Gurcan & Cagiltay, 2022).

Implementation of LDA-based topic modeling
Topic modeling is a statistical and computational approach that uses unsupervised machine learning to extract latent semantic patterns from a large collection of documents. For text mining and natural language processing research, several topic modeling algorithms have been proposed, including latent semantic indexing (LSI), latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), Dirichlet multinomial regression (DMR), hierarchical latent Dirichlet allocation (HLDA), hierarchical Dirichlet process (HDP), dynamic topic model (DTM), and correlated topic model (CTM) (Řehůřek & Sojka, 2011;Gurcan & Cagiltay, 2022;Gurcan et al., 2022b). Unfortunately, the majority of these algorithms do not provide a widely accepted method for calculating the consistency score, which is used to estimate the optimal number of topics (Gurcan & Cagiltay, 2022). Among all topic modeling algorithms, LDA provides the fundamental background of topic modeling and is thus more widely used than the others. LDA is widely regarded as one of the most effective techniques for discovering hidden semantic structures known as "topics" in natural language text documents (Blei, 2012). In fact, LDA is the most widely used topic modeling algorithm in text mining and natural language processing (Gurcan et al., 2022a). In this regard, in this study, we used latent Dirichlet allocation (LDA), a probabilistic model, for topic modeling-based semantic content analysis of a textual corpus of data science posts. The theoretical background and graphical model representations of the LDA algorithm is presented in the study by Blei, Ng & Jordan (2003), which introduced the LDA (see Figures 1 and 7 in Blei, Ng & Jordan (2003)).
With the aim of implementing the LDA-based topic modeling procedures to our data science corpus, we used Gensim (Řehůřek & Sojka, 2011), a comprehensive library for text mining and topic modeling. The LDA topic modeling algorithm generates a list of topics from which to group the input preprocessed posts based on the specified K (number of topics). Taking into account previous work , we used the C V metric, which is included in the Gensim package (Řehůřek & Sojka, 2011), to calculate a coherence score for each K value and to determine the optimal number of K topics for our corpus. Consistent with previous work for topic modeling of short texts (Zuo et al., 2016;Gurcan et al., 2022b), the prior parameters of α = 0.1 and β = 0.01 were used to fine-tune the distribution of topics per document and distribution of words per topic, respectively. The LDA model was implemented using these prior parameters, with K values ranging from 10 to 40. (increasing one at a time). Concurrently, as shown in Fig. 2, we calculated a coherence score (C V ) for each topic model implemented for each K (Řehůřek & Sojka, 2011). For the number of topics K = 18, the maximum consistency score (C V = 0.52589) was obtained within these C V metrics, revealing the optimal semantic consistency of the topics. As a result, we determined that 18 was the optimal number of topics.

EMPIRICAL STUDY
In this section, we answer our five research questions: RQ1. What topics are discussed by data scientists? RQ2. How do the data science topics evolve over time? RQ3. How do the popularity and difficulty of the topics vary? RQ4. What are the most commonly used tasks, techniques, and tools in data science? RQ5. How do data science topics relate to data-driven technologies?
What topics are discussed by data scientists? (RQ1)

Motivation
Because of the rapid development of data science paradigms, many innovative architectures, techniques, and tools to support data science development have been developed. To keep up with this rapid change, data scientists must first understand the issues and challenges they face when employing these architectures, techniques, and tools. The number of empirical studies that comprehensively investigate the issues facing data scientists is quite limited. As a result, this type of analysis can aid in understanding the issues that data scientists are discussing.

Approach
We used LDA-based topic modeling to investigate the questions that data scientists are asking on the DSSE platform. LDA is a probabilistic and generative topic modeling technique that represents topics as a probability distribution over words in a corpus. In the method section, we describe in detail how we adapted and applied LDA-based topic modeling to our corpus. We discovered 18 topics as a result of this LDA analysis, with each topic represented by 30 descriptive keywords. Two independent domain-specific experts assessed the consistency of the topics described with descriptive keywords. Experts evaluated 20 random samples of posts for each of the 18 topics to see if the posts were consistent with the dominant topic to which they were assigned. The topics were then given names based on their keywords. The total percentages of each topic in the entire corpus were then calculated, taking into account the dominant topic to which each post is assigned. For example, if a topic has a rate of 10%, it is the dominant topic in 10% of all question posts, and those posts are assigned to that topic.

Finding
As a result, Table 1 shows the 18 topics discovered by LDA-based topic modeling, along with their topic names, descriptive keywords, and rates. As shown in Table 1, the topics (issues) are listed in descending order by percentage, and the keywords for each topic are also listed in descending order. Because the topics in Table 1 reveal major issues specific to data science, the terms topic and issue are used interchangeably throughout this article. The topics indicated that data scientists face a wide range of issues, from "Machine Learning" to "Model Training", "Neural Networks" to "Time Series", "NLP" to "Computer Vision". As a result, the top five most frequently asked topics were "Model Training", "Machine Learning", "Neural Networks", "NLP", and "Time Series". The least asked topics, on the other hand, were "Clustering", "Coding Errors" and "Dimensionality Reduction". Based on the issues discovered (see Table 1), we can conclude that machine learning and its subtasks such as "Model Training", "Neural Networks", "Feature Engineering", "Classification", and "Clustering" are dominant topics for data science. Furthermore, the topics of "NLP", "Computer Vision", and "Recommendation Systems" have emerged as the most common application areas of data science.
How do the data science topics evolve over time? (RQ2)

Motivation
Data science is a constantly evolving discipline that includes new or outdated tasks, architectures, techniques, and tools. Therefore, the interests, recommendations, and experiences of data scientists may change over time. From this perspective, we attempted to examine how the topics discussed by data scientists have evolved over time. Such an analysis will help to improve understanding of chronic and unsolvable data science issues, as well as generate new solutions to them.

Approach
At this stage, we examined how data science problems have evolved over time. To achieve this, we looked at the distribution of the number of questions for each topic over time. The annual number of questions for each topic was calculated. Then, for each year, we divided the number of questions per topic by the total number of questions for that year. In this way, we normalized the question distribution of each topic as a percentage for that year. In other words, we calculated the percentage of questions for each topic per year. Then we subtract the previous year's percentages from the current year's percentages. As a result, we calculated how much the topics changed in the current year compared to the previous year. Finally, we calculated the overall temporal trend of the topics by adding the annual percentage changes for each topic. Finding Table 2 shows the annual percentages, total trends, and trend directions for the topics, which are listed in descending order by overall percentage. Table 2 shows the annual percentage changes for each topic in each row, and a number of inferences can be drawn about how the topics have evolved over time. Figure 3 depicts the total trend values of the topics in descending order to provide a broader understanding. Figure 3 demonstrates 11 topics with an increasing trend and seven topics with a decreasing trend. As seen in Fig. 3, "Model Training", "Neural Networks", "Regression Models", "Time Series", and "Computer Vision" are the top five topics with the most increasing trend, while "Tools", "Machine Learning", The topics "Recommendation Systems", "NLP" and "Clustering" have the most decreasing trend.
How do the popularity and difficulty of the topics vary? (RQ3)

Motivation
Our findings at RQ1 revealed the broad scope and diversity of data science issues. Data scientists discuss various technical issues at various levels in order to understand specific tasks, techniques, and tools and how they operate effectively. As a result, these issues are unlikely to have the same level of popularity and difficulty. Some questions are displayed numerous times, while others are not displayed at all. Identifying the popularity and difficulty of data science issues can help prioritize efforts to improve data science. Taking

Approach
A question post on DSSE includes descriptive indicators such as the number of views, answers, accepted answers, score, favorites, and comments. After a user posts a question on DSSE, other users respond with answers. When a user realizes an answer that resolves their question, that user marks that answer as the accepted answer for that question. In this way, the user indicates that the problem has been resolved, and other users who encounter the same issue can view this solution. At this stage, we performed a series of computational analyses using these indicators to determine the difficulty and popularity of each topic. We began by calculating the number of questions assigned to each topic. Next, we divided the total number of views for each topic by the total number of questions for each topic to calculate the average number of views for each topic. We calculated the popularity of the topics in this way. We then calculated the average of the number of accepted answers for each topic by dividing the number of accepted answers for that topic by the number of questions for that topic. We determined the difficulty of the topics in this manner. Similarly, we calculated the average number of answers, favorites, comments, and score for each topic. Since the average number of accepted answers ranges from 0 to 1, we presented this indicator as a percentage.
Finding Table 3 shows the calculated averages of the questions, views, answers, scores, comments, accepted answers, and favorites for each topic in order to highlight all dimensions of data scientists' interest in domain-specific issues. The topics in this table are listed in descending order of their percentages. The findings in Table 3 provide insight into the interests and perspectives of data scientists on various issues. Moreover, we calculated the popularity of the topics based on the average number of views and presented it in Fig. 4. As shown in Fig. 4, the top five most viewed (most popular) topics are "Data Manipulation", "Coding Errors", "Tools", "Regression Models", and "Neural Networks". On the other hand, the least viewed (least popular) topics are "Recommendation Systems", "Time Series", and "Computer Vision". Also, we determined the difficulty of the topics by considering the average number of accepted answers and presented them in Fig. 5. According to the difficulty metrics of the topics shown in Fig. 5, the most difficult topic is "Time Series", which has the lowest rate (21%) based on the number of accepted answers. The other most difficult topics are "Computer Vision", "Recommendation Systems", "NLP", and "Statistics".
What are the most commonly used tasks, techniques, and tools in data science? (RQ4)

Motivation
The growing interest in data-driven applications and services has resulted in an expansion and diversification of data science technologies. Innovative data technologies, which cover a wide range of tasks, methods, and tools, are now widely used in today's data science environments. As a result, the majority of data science issues and challenges are closely related to these tasks, techniques, and tools. It is highly possible that trends in data science technologies will evolve in concurrently with the technological transformations experienced in data-driven ecosystems. Identifying the most commonly used tasks, techniques, and tools in data science can provide important insights into various dimensions of data science issues. An analysis of this nature will also help in understanding the relative popularity of data-driven technologies over time. Revealing useful and popular data technologies can help data scientists choose the appropriate tools to advance data science.

Approach
Each DSSE question post contains tags that provide context and background for that question. These tags are chosen and added to the question by the user who asked it. The tags are descriptive keywords that display data science-related themes, tasks, techniques, and tools that users associate with their questions. To extract a definitive set of all tags used in data science, we first separated the tags of each post into individual tags and calculated the tag frequencies across all posts. Following that, we identified the tags with the highest  frequency across the entire corpus. Then, taking into account their annual frequency distributions, we calculated the annual percentages of these tags. We calculated how each tag changed in that year compared to the previous year by subtracting the percentage of each tag in the previous year from the percentage in the current year. In this way, we determined the percentage increase or decrease of tag frequencies for each year. Following that, we calculated the overall trend for each tag over time by adding these annual changes. Finally, we identified the most commonly used tasks, techniques, and tools in data science by categorizing the tags according to their functions and contexts.

Finding
We identified 665 unique tags used in data science by analyzing the tags of all posts in the corpus. We calculated the annual percentages and total percentages of the top 50 tags with the highest frequency and presented them in Table 4 in descending order of their percentages to provide a clearer understanding of the tags. Table 4 demonstrates the top 50 tags' key findings, including yearly rates (from 2014 to 2022), total rates, trend values, and trend directions. Figure 6 also shows the top 20 tags with the highest frequency and their percentages. According to Fig. 6, "machine-learning" is the most prominent tag, followed by "python", "deep learning", "neural network", and "classification". Consistent with the data science topics discovered by the LDA (see Table 1), machine learning-related tags such as "machine learning", "deep learning", and "neural network" emerged as clearly dominant.
In order to demonstrate the temporal trends of these 50 tags, we also present the annual percentages, trend values, and trend directions of the tags in Table 4. In Fig. 7, we visualized the top 15 tags with increasing and decreasing trends to better illustrate the tag trends. As seen in Fig. 7, the prominent tags with an increasing trend are "deep-learning", "python", "tensorflow", "keras" and "lstm". On the other hand, top tags with decreasing trend are identified as "data-mining", "machine-learning", "r", "text-mining" and "classification". Considering these trends, we can predict that deep learning will become a widespread research and application area for data scientists in the near future. It is an interesting finding that, while the trend of "data-mining" and "machine-learning" tags is decreasing, the trend of "deep-learning" and related tags such as "tensorflow", "keras", and "lstm" is increasing. This finding clearly shows a shift in data science from "machinelearning" to "deep-learning". The topics and tags discovered in the previous stage of our analysis revealed that data science issues cover a wide range of data science tasks, techniques, and tools. When the context and background of data science issues are thoroughly examined, it is clear that the vast majority of these issues and challenges are related to how data-driven task techniques and tools are used in data science. In order to provide an understanding of the tasks, techniques, and tools used in data science, we categorized the tags based on their functions and contexts. In this way, we classified tags into three categories: tasks, algorithms, and tools. Table 5 shows the top 25 tags for each category. The tasks, algorithms, and tools listed in Table 5 can be defined as the three pillars of data science. As shown in Table 5, the top five data science tasks are "machine-learning", "deep-learning", "classification", "nlp", and "time-series". The dominance of "machine-learning" and "deep-learning" tasks in data science is especially noteworthy. The most commonly used algorithms in data science are identified as "neural-network", "cnn", "lstm", "random-forest", and "rnn". The top five data science tools are discovered to be "python", "keras", "scikit-learn", "tensorflow", and "r".
How do data science topics relate to data-driven technologies? (RQ5)

Motivation
Earlier stages of our analysis revealed that the issues discussed by data scientists cover a wide range of tasks, techniques, and tools used in data science. Analyzing the connections between data-driven technologies (tasks, techniques, and tools) and data science issues can lead to significant discoveries. In this way, the findings of such an analysis will contribute to a better understanding of data science issues and the advancement of data science. To achieve this, we expanded our analysis at this point to investigate correlations between data science issues and data-driven technologies.

Approach
At this stage, we added another process to our current analysis and attempted to correlate our findings in RQ1 and RQ3. We began by calculating the tag distribution for posts assigned to each topic discovered in RQ1. In RQ3, we explained how the tags for each post are parsed. We then identified the top 15 tags with the highest frequency for each topic. In this way, we determined which tasks, techniques, and tools are closely related to which data science issues.

Finding
As a result of this process, we identified the top 15 tags for each topic and presented them in Table 6, where the topics are listed in descending order by percentage. Likewise, the top 15 tags for each topic are sorted in descending order. As shown in Table 6, the tag "machine-learning" has the highest frequency for the topic "Model Training", while "random-forest" has the lowest. Figure 8 shows the first three topics and their tags to illustrate the significance of these tags for each topic. In this way, we discovered a wide range of data-driven technologies (e.g., tasks, algorithms, analysis tools, programming languages, and machine learning libraries) categorized under the 18 issues. According to the results in Table 6, "machine-learning" is the first tag in nine of the 18 topics. In other words, machine learning is clearly a dominant concept in data science issues. Python was seen as the first tag in three of these topics ("Data Manipulation", "Tools", and "Coding Errors"). This finding highlighted Python as the most popular programming language for data science. Apart from these, the tags "neural-network", "nlp", "time-series", "deeplearning", "classification", and "clustering" appeared as the first tag in only one topic.

DISCUSSION
Data science is comprised of dynamic and competitive working environments in which tasks, paradigms, tools, technologies, skills, and experiences are constantly updating and progressing. Our analysis identifies the most frequently asked questions by data scientists on the DSSE platform and investigates the dimensions that indicate their importance. Our findings revealed the data science issues and challenges as 18 distinct topics discovered by LDA. The findings of our study have important implications for understanding data scientists' thoughts, interactions, work flows, and issues in today's IT industry. We will now go over these findings in detail. Initially, we found that the topic "Machine Learning" and its related processes are emerging as the most dominant tasks and most common implementations for data scientists, so tools and applications support for machine learning may need to mature further (Alshangiti et al., 2019;Hin, 2020). Other machine learningrelated topics with high percentages, such as "Model Training", "Neural Networks", and "Feature Engineering", support our finding (see Table 1). We frequently encountered machine learning-related keywords and tags in other topics as well. Table 6 shows that "machine-learning" is the first tag in nine of 18 topics. Furthermore, "NLP", "Computer Vision", and "Recommendation Systems" were identified to be prominent data science application areas (see Table 1) (Hin, 2020;Karbasian & Johri, 2020).
We also analyzed the temporal evolution of data science issues and discovered a number of noteworthy findings. "Model Training" is once again at the top of the list of topics with the fastest growing trend. Following that, the data science topics with the highest increasing trend were identified as "Neural Networks", "Regression Models", and "Time Series" (see Fig. 3). Furthermore, within application areas, "Computer Vision" is on the rise, while "NLP" and "Recommendation Systems" are on the decline (Liu et al., 2017). The strong increasing trend of "Model Training", "Neural Networks", and "Computer Vision" topics suggests that deep learning will gain a more leading position in the near future (Karbasian & Johri, 2020). These findings point to a significant shift from machine learning to deep learning (see Fig. 3) (Hin, 2020).
We expanded our findings by including some indicators of the issues in our analysis to better understand the most important, popular, and difficult issues for data scientists. The number of questions, views, and answers for a topic on the DSSE platform reveal various insights about it (Bagherzadeh & Khatchadourian, 2019). If a question has already been asked, it will not be asked again; instead, the user will see the previously asked question and its related answers. As a result, the number of views on the topics is an important indicator of the topic's popularity . We discovered that the top three most popular topics were "Data Manipulation", "Coding Errors", and "Tools" (see Fig. 4). These three topics are the most widely discussed in data science. "Time Series" is the most difficult topic, with the lowest percentage of accepted answers, followed by "Computer Vision", "Recommendation Systems", "NLP", and "Statistics" (see Fig. 5) (Sarker, 2021).
Such common issues in data science that have yet to be resolved should be investigated further and supported by data scientists (Cao, 2017).

Implications of findings
Knowledge and experiences shared on Q&A platforms like DSSE should motivate researchers, practitioners, and developers to create documentation aimed at resolving common data science issues. We have inferred remarkable implications and guidelines for data science stakeholders based on the empirical background, methodology, and findings of this study, which will contribute to their understanding of the field. We hope that our implications will help data science communities with various profiles, such as developers, researchers, practitioners, educators, and enthusiasts. Developers can lead data science innovation by creating more specific tools and applications to address the current issues and needs of data scientists, which are also highlighted in our findings. One of the most popular topics is "Data Manipulation". For such widespread issues, tool developers can create useful libraries or tools. Developers can use our findings to improve data-driven tools or to choose which frameworks and libraries to support.
Identifying the questions that data scientists are asking on platforms like DSSE can help the research community better understand the challenges of data science. While all of the issues identified are important in their own right, our findings indicate that data science researchers should prioritize the most visible and difficult issues. Researchers can also use our methodology for experimental research and analysis in a variety of settings. Furthermore, data science educators can create more tutorials to assist in the training of data scientists and candidates, especially considering the difficulty of the topics as well as the most commonly used tasks, algorithms, and tools. Educators can keep their training programs and curricula up to date with current field trends, allowing them to provide upto-date background to data scientists training. Our findings can be used by DSSE and other Q&A platforms to develop new approaches for contextually tagging posts and better categorizing user posts. Data science enthusiasts and general readers may find our findings useful in keeping up with emerging developments and trends in the data science industry and ecosystems. More researchers in this field can also contribute to the improvement of data science processes. We hope that our findings, which highlight the challenges that data scientists face, will help to guide future research in this area.

CONCLUSIONS
This research aims to shed light on common issues and challenges encountering data scientists. To that end, all posts shared on the DSSE platform were analyzed using LDAbased semantic topic modeling. Furthermore, the most commonly used data-driven technologies and their connections to data science issues were investigated. Our research methodology is based on the adaptation and implementation of LDA, an unsupervised generative approach for semantic topic modeling that is widely used in textual content analysis.
As a result of the LDA analysis, 18 topics were identified that demonstrate the current landscape of data science issues and trends. Among these topics, "Model Training", "Machine Learning", and "Neural Networks" were the most frequently asked. Furthermore, the most viewed (most popular) topics were "Data Manipulation", "Coding Errors", and "Tools". The most difficult topics were identified as "Time Series", "Computer Vision", and "Recommendation Systems".
One of our key discoveries is that data science issues and challenges are inextricably linked to data-driven technologies (tasks, techniques, and tools). While the trend for "datamining" and "machine-learning" tags is decreasing, the trend for deep learning-related tags such as "deep-learning", "tensorflow", "keras", and "lstm" is increasing. As the findings show, there has been a significant shift in data science from machine learning to deep learning. It was also determined that the most commonly used algorithms are "neuralnetwork", "cnn", "lstm", "random-forest", and "rnn", and the most commonly used tools are "python", "keras", "scikit-learn", "tensorflow", and "r". Thus, by analyzing the issues discussed by data scientists, this study provides an in-depth understanding of this dynamic discipline.
These findings will help online communities with diverse profiles understand data science focuses and issues. Our findings have significant implications for the various data science stakeholders who are working to advance data science. Our findings can be used by tool builders to improve support and documentation, by developers to create data applications and libraries, and by educators to create modern training and curricula. By focusing on current issues, researchers can provide more solutions for data science. Our methodology can also be applied to other developer platforms like forums, blogs, and portals, as well as different Q&A platforms like Kuggle, Reddit, and Quora.