A Comprehensive Study of Groundbreaking Machine Learning Research: Analyzing highly cited and impactful publications across six decades

Machine learning (ML) has emerged as a prominent field of research in computer science and other related fields, thereby driving advancements in other domains of interest. As the field continues to evolve, it is crucial to understand the landscape of highly cited publications to identify key trends, influential authors, and significant contributions made thus far. In this paper, we present a comprehensive bibliometric analysis of highly cited ML publications. We collected a dataset consisting of the top-cited papers from reputable ML conferences and journals, covering a period of several years from 1959 to 2022. We employed various bibliometric techniques to analyze the data, including citation analysis, co-authorship analysis, keyword analysis, and publication trends. Our findings reveal the most influential papers, highly cited authors, and collaborative networks within the machine learning community. We identify popular research themes and uncover emerging topics that have recently gained significant attention. Furthermore, we examine the geographical distribution of highly cited publications, highlighting the dominance of certain countries in ML research. By shedding light on the landscape of highly cited ML publications, our study provides valuable insights for researchers, policymakers, and practitioners seeking to understand the key developments and trends in this rapidly evolving field.

The remaining sections of this paper are structured as follows: In Section 2, we outline the methods utilized for data gathering and the various search strategies employed for conducting the proposed bibliometric analysis study.Section 3 provides a comprehensive presentation of the results and discussion, including a detailed description of the bibliometric study and an analysis of the data collection pertaining to different ML research publications.Additionally, this section offers an overview of the discussion surrounding the top cited ML publications.Finally, in Section 4, we draw conclusions based on the findings of this study.

Data Collection and Search Strategy
In this study, we employed comprehensive data collection techniques to gather ML research publications from various sources.These sources include leading academic journals, conference proceedings, and other relevant scholarly outlets.By utilizing a diverse range of search strategies, we ensured the inclusiveness and representativeness of the collected dataset.
Moreover, for this study, we retrieved data from the Clarivate Analytics Web of Science Core Collection, specifically the online version of the Science Citation Index Expanded (SCI-EXPANDED).The dataset used in our analysis was updated on 18 May 2023, ensuring the inclusion of the most recent publications in the field.To construct a comprehensive dataset, we employed a carefully designed search strategy.We utilized quotation marks (" ") and the Boolean operator "or" to ensure that at least one search keyword appeared in terms of TOPIC (title, abstract, author keywords, and Keywords Plus).The search encompassed the period from 1900 to 2022, spanning over a century of ML literature.
Our search was primarily focused on the keyword "machine learning."However, to account for variations in terminology and to capture a broader range of relevant publications, we included additional terms such as "machine learned," "machine learn," "machine learners," "machines learning," "machining learning," "machine learner," "machine learnings," "machines learn," "machine learns," "machine learnable," "maching learning," "learning of machine," and "machin learning."Furthermore, we incorporated misspelled terms like "machine learnt" and "machine learnig" to account for potential typographical errors.Additionally, terms lacking spaces, such as "machine learningbased," "machine learningmethods," "machine learnin," "machine learningalgorithm," "machine learningclassifiers," and "machine learningmetrics," were included to capture relevant documents within the SCI-EXPANDED database.By adopting this comprehensive search approach, we aimed to ensure that our analysis results are as accurate and inclusive as possible, encompassing a wide range of documents related to the field of ML research.

Assessing publication impact
To gauge the impact of publications in this study, we employed several citation indicators derived from the Web of Science Core Collection.These indicators provide valuable insights into the citation performance of individual publications.The following citation indicators were utilized: • Cyear: This indicator represents the number of citations a publication received from the Web of Science Core Collection in a specific year.For example, C2022 denotes the citation count in the year 2022 (Ho, 2012).
• TCyear: The TCyear reflects the total number of citations a publication has received from the Web of Science Core Collection since its publication year up until the end of the most recent year (2022 in our study, TC2022) (Wang et al., 2011).
• CPPyear: The CPPyear stands for the average number of citations per publication within a particular year.Specifically, CPP2022 is calculated as TC2022 divided by TP, which represents the total number of publications (Ho, 2013).
The use of these citation indicators, namely Cyear, TCyear, CPPyear, offers distinct advantages.These indicators ensure consistency and repeatability in our analysis, compared to directly using the number of citations from the Web of Science Core Collection (Ho and Hartley, 2016a).To identify highly cited publications, we employed a criterion where publications with TC2022 of 100 or more were selected.This threshold allows us to focus on publications that have received substantial attention and recognition within the focused field (Ho, 2014a).By utilizing these citation indicators and criteria, we aim to identify and highlight the most influential and highly cited publications in the field of ML research.These indicators provide a quantitative measure of the impact and visibility of individual publications, offering valuable insights into the scholarly contributions that have significantly shaped the field of ML.

Methodology and data analysis
In this study, we employed a meticulous approach to identify and analyze highly cited ML documents.
A total of 5,402 documents with a TC (total citations from the Web of Science Core Collection) of 100 or more were searched and retrieved from SCI-EXPANDED, covering the period from 1959 to 2022.The data used in our analysis were updated as of 18 May 2023.To conduct our analysis, we downloaded the full records of these highly cited documents from SCI-EXPANDED, along with the number of citations received in each year for each document.The downloaded data were then imported into Microsoft 365 Excel for further analysis.Manual coding was performed to enhance the data and extract relevant information (Li and Ho, 2008;Al-Moraissi et al., 2023).
Various functions available in Microsoft 365 Excel, such as Counta, Concatenate, Filter, Match, Vlookup, Proper, Rank, Replace, Freeze Panes, Sort, Sum, and Len, were utilized to process and analyze the data (Al-Moraissi et al., 2023).These functions enabled us to perform calculations, organize the data, and derive meaningful insights.Out of the initial 5,402 highly cited documents, we found 4,878 documents with a TC2022 of 100 or more, accounting for 90% of the initial set.This subset of documents formed the basis for our analysis.Additionally, to refine our search strategy and ensure the inclusion of relevant documents, we applied a filter known as the "front page" approach (Wang and Ho, 2011;Al-Moraissi et al., 2023).This approach involved considering the title, abstract, and author keywords as a filter for the search keywords in the Web of Science Core Collection's Topic (TS) field.As a result, we identified 4,851 documents (99% of the 4,878 documents) with the search keywords present in their "front page," establishing them as highly cited ML research publications.
Figure 1 shows the representation for searching the highly cited machine learning publications in SCI-EXPANDED.

EXPANDED
We obtained the journal impact factors (IF2022) from the Journal Citation Reports (JCR) published in 2022 to assess the impact of the journals where these highly cited publications were published.This information further contributes to our analysis of the influence and prestige of the respective journals.
By employing these rigorous methods and data analysis techniques, we aim to comprehensively and accurately examine highly cited ML research publications and their associated journal impact factors.
These findings serve as a valuable resource for understanding the prominence and impact of ML research in the scholarly community.

Authorship and affiliation handling
In the SCI-EXPANDED database, the designation "reprint author" is used to identify the corresponding author.However, for this study, we adopted the term "corresponding author" to refer to the author with whom correspondence regarding the publication can be made (Chiu and Ho, 2007).
It is important to note that in articles with unspecified authorship, single authors were considered both the first and corresponding authors (Ho, 2014b).
Similarly, for articles with unspecified corresponding institutions, the single institution listed was considered the first and corresponding-author institution (Ho, 2014b).In the case of articles from a single country, the country was classified as the first and corresponding-author country (Ho, 2014b).This approach ensures consistency and accuracy in assigning authorship and affiliations.For articles with multiple corresponding authors, institutions, and countries, all corresponding authors, their respective institutions, and countries were considered (Al-Moraissi et al., 2023).This allows for a comprehensive analysis that considers the contributions and affiliations of all relevant authors involved in the publication.
Additionally, a thorough verification process was conducted to address articles in the SCI-EXPANDED database where corresponding authors were listed with only addresses but not affiliation names.These articles were carefully examined, and the addresses were updated to include the corresponding affiliation names (Al-Moraissi et al., 2023).This step ensures that the affiliations of corresponding authors are accurately represented in our analysis.By employing these approaches to authorship and affiliation handling, we ensure that our analysis accurately represents the contributions of authors and their affiliated institutions, allowing for a comprehensive understanding of the research landscape in machine learning.

Affiliation classification and publication performance evaluation
In this study, we undertook a classification process for affiliations to ensure consistency and accuracy.
Affiliations from England, Scotland, Northern Ireland, and Wales were reclassified as being from the United Kingdom (UK) (Chiu and Ho, 2005).This consolidation allows for a more comprehensive analysis of the contributions from the UK.Moreover, affiliations initially listed as Yugoslavia were carefully checked and reclassified as being from Slovenia (Wambu et al., 2017).This adjustment ensures the correct attribution of publications to their respective countries and facilitates accurate evaluation.To evaluate the publication performance of countries and institutions, we applied six publication indicators, as outlined by Hsu and Ho (2014): • TP (Total Number of Articles): This indicator represents the total number of articles published by a specific country or institution.
• IP (Number of Single-Country Articles or Single-Institution Articles): IP denotes the number of articles where the authors are from a single country (IPC) or a single institution (IPI).
• CP (Number of Internationally Articles or Inter-Institutionally Collaborative Articles): CP signifies the number of articles resulting from international collaborations (CPC) or interinstitutional collaborations (CPI).
• FP (Number of First-Author Articles): FP refers to the number of articles where the authors are listed as the first authors.
• RP (Number of Corresponding-Author Articles): RP represents the number of articles where the authors are identified as the corresponding authors.
• SP (Number of Single-Author Articles): SP denotes the number of articles authored by a single author.
By utilizing these publication indicators, we gain valuable insights into the publication performance of countries and institutions, highlighting their level of collaboration, authorship patterns, and overall research output.This information aids in the assessment and comparison of their contributions to the field of machine learning.

Publication impact evaluation and Y-index
In addition to the six publication indicators, we also applied six citation indicators to evaluate the publication impact of countries and institutions (Ho and Mukul, 2021).One of the metrics used for evaluating the publication performance of authors is the Y-index.The Y-index, as defined by Ho (2012;2014a), is denoted as the Y-index (j, h), where j is a constant related to the publication potential, determined by the sum of the first-author articles and corresponding-author articles.The parameter h is a constant associated with the publication characteristics and represents the polar angle indicating the proportion of corresponding-author articles to first-author articles.
The value of j indicates the contribution of the author as a first or corresponding author to the articles.
A higher value of j suggests a greater contribution by the author in terms of first-author and corresponding-author articles.The parameter h is defined as follows: • h = π/2: This indicates an author who has solely published corresponding-author articles (j represents the number of corresponding-author articles).
• π/2 > h > π/4: This range signifies an author who has a higher proportion of correspondingauthor articles compared to first-author articles (FP > 0).
• h = π/4: This value indicates an author with an equal number of first-author and corresponding-author articles (FP > 0 and RP > 0).
• h = 0: This value indicates an author who has exclusively published first-author articles (j represents the number of first-author articles).
By applying the Y-index, we can evaluate the publication performance of authors, taking into account their contribution as both first authors and corresponding authors.The Y-index provides insights into the author's publication potential and the proportion of their contributions based on the types of articles they have published.This metric enables a comprehensive assessment of authors' publication impact, considering their roles and contributions to the scholarly literature in machine learning.

Results and Discussion
In this section, we present the findings and engage in a discussion of our bibliometric analysis study, which is centered on publications in the field of machine learning and their citation patterns.Through this analysis, we aim to uncover and examine the most highly cited publications within the field of machine learning.Through an exploration of citation trends and patterns, we can gain valuable insights into the influential works that have shaped the landscape of ML research.Citation analysis provides a quantitative measure of the impact and significance of scholarly publications.Similarly, by investigating the citation counts, we can identify the publications that have gained substantial attention and recognition within the academic community.Furthermore, analyzing the citation patterns allows us to identify the key works that have contributed to the advancement of ML research and have had a lasting impact on the field.
The subsequent sections of this study provide a detailed description of our bibliometric analysis approach and the findings derived from the analysis of the collected data.We present an overview of the top cited ML publications and discuss the implications of these findings for the field.Additionally, we explore the characteristics of these highly cited works, such as their authors, publication journals, and prevalent themes.By comprehensively examining the most cited publications, we aim to identify the influential authors, notable research directions, and emerging trends within the field of machine learning.These insights will contribute to the existing body of knowledge and guide researchers, practitioners, and decision-makers in understanding the influential works and the evolving landscape of ML research.In the following sections, we present our findings and discuss the bibliometric analysis results comprehensively, shedding light on the key contributions and trends within the highly cited ML publications.

Characteristics of document types
The approach described by Monge-Nájera and Ho (2017) to regenerate the characteristics of a document type involves utilizing two key metrics: the average number of citations per publication (CPPyear) and the average number of authors per publication (APP).This approach has been employed in the bibliometric analysis of highly cited articles published in SCI-EXPANDED, as documented by Ho and Shekofteh (2021) and Ho and Ranasinghe (2022).To apply this approach, we followed these steps: • Define the document type: Specify the particular document type you wish to analyze based on the available data.It could be scientific articles, research papers, or another relevant category.
• Gather the necessary data: Obtain a dataset that provides the total number of citations (TCyear), the total number of publications (TP), and the total number of authors (AU) for each document of the chosen type.Ensure the dataset covers the desired time frame and corresponds to articles published in SCI-EXPANDED.
• Calculate the average number of citations per publication (CPPyear): Divide the total number of citations in a given year (TCyear) by the total number of publications (TP) in that same year.This calculation will yield the average number of citations per publication for the specified document type.CPPyear = TCyear/TP • Calculate the average number of authors per publication (APP): Divide the total number of authors (AU) by the total number of publications (TP) for the document type.This calculation will provide the average number of authors per publication.APP = AU/TP • Analyze the results: Examine the obtained values of CPPyear and APP to gain insights into the characteristics of the document type.A higher CPPyear suggests a greater average number of citations per publication, indicating a higher impact or visibility.Conversely, a higher APP signifies increased collaboration and multiple authorship within the document type.
• Contextualize with existing literature: Consider the findings of Monge-Nájera and Ho (2017), Ho andShekofteh (2021), andHo andRanasinghe (2022) to compare and contextualize your results.This will enable validation of your regenerated characteristics and provide additional perspectives on the document type's attributes.
It is important to note that the specific methodologies and any additional factors accounted for in the aforementioned study steps may influence the precise results.Therefore, a thorough review and understanding of the methodologies employed in each related study are essential for meaningful comparisons and interpretations of any regenerated characteristics.
In the analysis of SCI-EXPANDED, a total of 4,851 documents were found to contain search keywords in their "font page."These documents represent seven different document types specified in Table 1.Among the identified documents, there were 4,139 articles, accounting for 85% of the total.The average number of authors per publication (APP) for these articles was 5.8.Within the document types, reviews constituted 674 documents and had the highest CPP2022 value of 373.This high CPP2022 value could be attributed to a specific review titled "Gradient-based learning applied to document recognition" (Lecun et al., 1998), which had a total citation count (TC2022) of 24,111.
Among the 160 classic publications with a TC2022 of 1,000 or more (such as Long et al., 2014), 32 were categorized as reviews.Proceedings papers accounted for seven highly cited publications, followed by four editorial materials and one book chapter.
The CPP2022 value for reviews was found to be 1.3 times higher than that of articles.It was also noted that highly cited medical-related documents, such as those on multiple sclerosis (Ho and Ranasinghe, 2022) and insulin resistance (Ho and Shekofteh, 2021), had lower CPP2022 values of 1.1 and 0.85, respectively, compared to reviews.A total of 674 reviews were published across 382 journals.The journal "Expert Systems with Applications" had the highest reviews, totaling 15.Additionally, it was observed that some documents could be categorized under multiple document types in the Web of Science Core Collection.For example, 180 proceedings papers, eight data papers, seven book chapters, and two retracted publications were also classified as articles.Therefore, the cumulative percentages in Table 1 may exceed 100% (Usman and Ho, 2020).
Furthermore, it is important to acknowledge that contributions can differ among various types of documents.In the case of articles, they generally consist of sections like introduction, methods, results, discussion, and conclusion.Based on this consideration, articles were chosen for closer analysis, specifically focusing on 4,139 highly cited ML articles published in English.It is worth mentioning that among these articles, one publication appeared in a bilingual journal, presenting content in both English and Estonian languages.This bilingual approach allowed for wider dissemination of the research findings to both English-speaking and Estonian-speaking audiences.

Characteristics of publication outputs
The study conducted by Ho and Shekofteh (2021) applied a correlation analysis between the annual number of highly cited articles (TP) and their CPPyear in the medical topic of multiple sclerosis.This analysis aimed to gain insights into the development trends and impacts of articles in this field.Figure 2 illustrates the distribution of these highly cited articles.Among the highly cited articles, a notable observation is that the highest number of articles, specifically 575, were published in 2018.Notably, it took a full five years for the number of articles to reach their peak, which was considerably longer than the highly cited multiple sclerosis articles, which took 12 years to reach their peak.The field of ML has seen increased activity from authors and researchers.In 1959, a significant article titled "Some studies in ML using the game of checkers" (Samuel, 1959) had the highest CPP2022 value of 1,486.This suggests the influential impact of this article in the field of ML, which has gained prominence as a relatively new research area.These findings highlight the dynamics and growth Similarly, in 1986, Farmer et al. introduced the article titled "The immune-system, adaptation, and machine learning," which garnered a high CPP2022 score of 886.The paper investigates the relationship between the immune system, adaptation, and machine learning.It explores how principles derived from the immune system, known for its vital role in safeguarding the body against pathogens, can be applied to the realm of machine learning.The authors aimed to harness the immune system's learning, memory, and pattern recognition capabilities to construct a dynamic model based on Jerne's network hypothesis.By combining concepts from immunology and ML, the research offers the potential for developing adaptive learning algorithms inspired by biological systems.This interdisciplinary approach provides valuable insights into the synergy between the immune system and ML, with the goal of advancing the field.

Web of Science Category and Journal
In 2022, the Journal Citation Reports (JCR) included a total of 9,510 journals that contained citation references across 178 categories in the SCI-EXPANDED section of Web of Science.Within these categories, there were 1,007 journals that published highly cited articles specifically related to machine learning.These articles were distributed among 152 Web of Science categories within the SCI-EXPANDED section.The majority of the ML articles, constituting 22% of a total of 4,139 articles, were published in the field of artificial intelligence computer science, amounting to 907 articles.Furthermore, 794 articles (19%) were published in the field of electrical and electronic engineering, 459 articles (11%) in the field of information systems computer science, 429 articles (10%) in the field of interdisciplinary applications computer science, and 300 articles (7.2%) in the field of multidisciplinary sciences.It is evident that ML has gained significant attention and has been applied across a wide range of research fields.When considering the impact factor (IF2022) of the top five journals with an IF2022 more than 100, it was found that the CA-A Cancer Journal for Clinicians (IF2022 = 254.7)had one article, the Lancet (IF2022 = 168.9)had two articles, the New England Journal of Medicine (IF2022 = 158.5)had two articles, the JAMA-Journal of the American Medical Association had five articles (IF2022 = 158.5)had five articles, and the BMJ-British Medical Journal (IF2022 = 105.7)had one article.These five journals were among the highest ranked in their respective categories, with the Lancet, the New England Journal of Medicine, the JAMA-Journal of the American Medical Association, the BMJ-British Medical Journal securing the 1 st , 2 nd , 3 rd , and 4 th positions among 167 journals in the Web of Science category of general and internal medicine.Furthermore, the CA-A Cancer Journal for Clinicians ranked first not only in the category of oncology (241 journals) but also in the SCI-EXPANDED (9,510 journals).

Publication performances: countries and institutions
The significant contributions of two authors, namely the first author and the corresponding author, in a research article have been widely acknowledged (Riesenberg and Lundberg, 1990).Within the SCI-EXPANDED dataset, seven highly cited ML articles (1.7% of the total 4,139 highly cited articles) did not have affiliations listed.On the other hand, a total of 4,132 highly cited articles were published by authors affiliated with 102 different countries.Among these, 2,460 articles (60% of 4,132) were single-country articles published by authors from 58 countries, with a CPP2022 of 294.Additionally, 1,672 internationally collaborative articles (40%) were published by authors from 102 countries, with a CPP2022 of 268.The results indicated that internationally collaborative research had a slightly lower citation impact in the highly cited ML domain.
To compare the productivity of different countries, six publication indicators and six related citation indicators (CPP2022) were utilized (Ho and Mukul, 2021).Table 3 presents the findings for the top 15 productive countries, each having more than 100 highly cited articles.Notably, Egypt ranked 35th with 25 articles and emerged as the most productive country in Africa.Among the publication indicators, the USA led in all six categories: TP (1,932 highly cited articles, 47% of the total), IPC (1,045 articles, 42% of single-country articles), CPC (887 articles, 53% of internationally collaborative articles), FP (1,418 articles, 34% of first-author articles), RP (1,480 articles, 36% of corresponding-author articles), and SP (72 articles, 39% of single-author articles).When comparing the top 15 productive countries, France achieved the highest CPP2022 in various categories: TP (453), FP (595), and RP (570) for ML research.Canada had the highest CPP2022 of 526 for IPC, while Japan had the highest CPP2022 of 580 for CPC.With four articles in the SP category, Australia attained the highest CPP2022 of 1,138.These findings shed light on the productivity and citation impact of different countries in the field of highly cited ML research.
According to Ho (2012), the institution of the corresponding author in a research article often represents either the study's home base or the paper's origin.In terms of institutions, 1,288 highly cited ML articles (31% of the total 4,132 articles) were attributed to single institutions, with a CPP2022 of 322.On the other hand, 2,844 articles (69%) were the result of institutional collaborations, with a CPP2022 of 266.These findings suggest that institutional collaborations contribute to higher citation rates.
Table 4 provides an overview of the top 15 productive institutions and their respective characteristics.• Access to Industry: Being situated in close proximity to tech hubs like Silicon Valley (Stanford) and the Boston-Cambridge area (MIT), these universities have strong connections with industry leaders and startups.This proximity offers opportunities for collaborations, internships, and access to real-world datasets and challenges, enabling researchers to address practical problems and apply their findings to industry applications.
• Academic Reputation: MIT and Stanford have long-standing reputations for excellence in education and research.Their strong academic standing attracts top-tier students and researchers from around the world.The high-calibre talent pool and rigorous academic programs foster an environment conducive to producing impactful research outcomes.These factors and a strong institutional commitment to research and innovation contribute to MIT and Stanford's success in early ML research and their ongoing prominence in the field.

Publication performances: authors
In the domain of highly cited articles related to ML, the average number of authors per publication (APP) was 5.8.The maximum number of authors in a single article was 346 (Abolfathi et al., 2018).
Out of the 4,139 articles with available author information, the majority, accounting for 65% of the total, were published by groups of two to five authors.Specifically, there were 782 highly cited articles (19% of the total) written by groups of 3 authors, 685 articles (17%) by groups of 4 authors, 657 articles (16%) by groups of 2 authors, and 549 articles (13%) by groups of 5 authors.
Table 5 presents  Out of the total 4,139 highly cited ML articles, a vast majority of 4,125 articles (99.7% of the total) included information about both the first author and corresponding author in SCI-EXPANDED.These articles were thoroughly analyzed using the Y-index as a metric.The 4,125 highly cited ML articles involved a total of 18,234 authors.Among them, 13,399 authors (73% of the total) had no first-author or corresponding-author articles, resulting in a Y-index value of (0, 0).Additionally, 1,365 authors (7.5%) exclusively published corresponding-author articles with a h-index value of π/2.Furthermore, 133 authors (0.73%) had more corresponding-author articles than first-author articles, with π/2 > h > π/4, indicating a higher h-index value for corresponding-author articles compared to first-author articles (FP > 0).Meanwhile, 2,001 authors (11%) contributed an equal number of firstauthor and corresponding-author articles, resulting in an h-index value of π/4 (FP > 0 and RP > 0).
In contrast, 128 authors (0.70%) published more first-author articles than corresponding-author articles, with π/4 > h > 0, signifying a higher h-index value for first-author articles (RP > 0).Finally, 1,208 authors (6.6%) exclusively published first-author articles, yielding an h-index value of 0. These analyses were conducted to explore the authorship patterns and productivity of highly cited ML articles based on the Y-index.
The polar coordinate plot shown in Figure 3 illustrates the distribution of the Y-index (j, h) for the top 41 potential authors in highly cited ML research, where j ≥ 8.Each point on the plot represents a coordinate Y-index (j, h) corresponding to a single author or multiple authors.For instance, authors Chen, Kononenko, Wang, Chicco, Yuan, Ishibuchi, Zhang, and Wang have a Y-index of (8, π/4), while Chen and Koutsouleris have a Y-index of (9, 0.6747).Among these potential authors, Zhou has the highest publication potential in highly cited ML articles, with a Y-index of (24, π/4), followed by Pham, with a Y-index of (20, 0.8851).Authors Ramprasad, Deo, Chen, and seven others share the same j-value of 8, indicating they have the same publication potential in highly cited ML research., 10, 11, 13, 14, 15, and 16.For example, authors Zhou (24, π/4), Onan (14,π/4),B. Blankertz (12,π/4), Hu (12, π/4), Jung (10, π/4), Verrelst (10, π/4), Schuld (10, π/4), and Zhang with seven other authors (8, π/4) are all located on the diagonal line representing h = π/4.This indicates that they share publication characteristics but differ in their publication potential.Zhou has the highest publication potential with a j-value of 24, followed by Onan with a j-value of 14, Blankertz and Hu with a j-value of 12, Jung, Verrelst, and Schuld with a j-value of 10, and Zhang with seven other authors with a j-value of 8. Similarly, authors Lee (9, π/2) and Ramprasad (8, π/2) are located on the y-axis representing h = π/2, indicating that they share the same publication characteristics.However, Lee has a greater publication potential compared to Ramprasad.It is important to note that the authorship analysis may be subject to potential biases due to authors with the same name or the same author using different names over time, which can impact the accuracy of the findings (Chiu and Ho, 2007).

The top ten most frequently cited articles in machine learning research
The total citations (TC) of articles are periodically updated in the Web of Science Core Collection.
In order to enhance the accuracy of bibliometric studies, the total number of citations from the Web of Science Core Collection, specifically from the publication year until the end of 2022 (TC2022), utilized to minimize bias, as suggested by Wang et al. (2011).Among the analyzed articles, 908 articles (22% of 4,139 articles), 3,375 articles (82% of 4,117 articles with abstracts in SCI-EXPANDED), and 1,511 articles (51% of 2,984 articles with author keywords in SCI-EXPANDED) contained search keywords in their title, abstract, and author keywords, respectively.
Table 6 presents the top ten most frequently cited articles in ML research.Among these articles, one article had search keywords in its title, nine had search keywords in its abstracts, and two had search keywords in its author keywords.One of the most frequently cited and referred-to articles concerning Building upon the foundation laid by earlier research, the authors extended the scope of statistical comparisons to encompass pairwise evaluations among all classifiers.They also introduced novel statistical tests and procedures designed to provide a comprehensive assessment of performance variations among classifiers across a multitude of datasets.This extension significantly enhances the previous framework by delivering a more in-depth detailed analysis of classifier performance.
The article contributes to the field of ML by addressing the need for rigorous statistical comparisons when evaluating and selecting classifiers.By considering all possible pairwise comparisons, the authors provide a comprehensive approach to assess the relative performance of classifiers on diverse datasets.This extension has practical implications for researchers and practitioners in ML, as it offers a robust methodology for making informed decisions about classifier selection and performance evaluation.Inclusively, Garcia and Herrera's study presented a valuable extension to the existing framework for comparing classifiers over multiple datasets, enhancing the statistical analysis and providing a more comprehensive understanding of classifier performance.
In the course of our bibliometric analysis study on ML, we discovered five influential and widely cited articles within the field.These articles, published in various years, have played a crucial role in advancing ML research and have significantly impacted the field.Although they may not be among the top ten most frequently cited, they have contributed substantially and shaped the machine learning landscape.Subsequently, the next subsections give brief details of these articles and explore their notable contributions: Scikit-learn: Machine learning in Python (Pedregosa et al., 2011) This article has reaped considerable attention and citations since its publication in 2011.It introduced the scikit-learn library, a powerful and widely used ML toolkit in Python.The article presented a comprehensive overview of the library's capabilities, providing researchers and practitioners with a valuable resource for developing ML models and experiments.Its impact lies in enabling widespread adoption and facilitating the development of ML algorithms and applications.The articles were published by 16 authors from ten institutions in France, Japan, Germany, USA, and the UK, with a C2022 of 7,359 (rank 1 st in highly cited ML research) and a TC2022 of 29,958 (rank 1 st ).The article is not only the most frequently cited but also the most impactful in ML research.The article had a sharp increasing citation after its publication to reach the top of ML research.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al., 2014) The article was published by five authors from the University of Toronto in Canada with a of 3,587 Random forests (Breiman, 2001) The article was published by Breiman from the University of California, Berkeley in the USA, with a C2022 of 4,689 (rank 2 nd ) and a TC2022 of 14,127 (rank 4 th ). Figure 4 shows an increasing citation trend after its publication, sharply increasing in recent years to reach the top second in ML research.
Breiman's article introduced the concept of random forests, a powerful ensemble learning method.
Random forests combine multiple decision trees to create a robust and accurate prediction model.
This article significantly influenced the ML field by presenting a scalable and efficient approach for handling complex datasets.Random forests have since become a widely adopted and successful technique in various domains, showcasing their practical relevance and impact.Though not among the most frequently cited, these five articles have made significant contributions to the field of ML.Their impact lies in introducing innovative techniques, frameworks, and evaluation methods, which have paved the way for further research and advancements.Through their notable contributions, they have shaped the landscape of ML and continue to influence the work of researchers and practitioners in the field.

Strengths and limitations of the study
In our current investigation, we have taken several steps to mitigate potential limitations.Firstly, we have made efforts to minimize data bias by ensuring that our selection of publications is not skewed towards any particular source, publication type, or author.This safeguards the generalizability of our analysis results.Additionally, we have diligently addressed the issue of publication lag, ensuring that we consider recent publications to avoid any temporal bias.Notably, our dataset exclusively comprises publication extracts from the Clarivate Analytics Web of Science Core Collection, guaranteeing the utilization of high-quality publications and citations, thus avoiding the risk of omitting relevant highly cited and impactful ML due to limitations in data sources or incomplete records.
However, it's crucial to that achieving a completely flawless study is challenging task, and our work is not exempt from potential limitations.Some possible limitations include the challenge of dealing with citation self-selection, as similar studies may also be influenced by authors' tendencies to cite works that align with their own research, potentially resulting in an overestimation of certain influential papers.Furthermore, the evolving nature of the machine learning field, along with changes in terminology and subfields, may impact the identification of highly impactful papers across different decades.Interdisciplinary collaboration is a common aspect of ML research, making it challenging to carefully categorize papers within traditional boundaries, which can occasionally lead to misclassifications.
The assessment of impact and significance is inherently somewhat subjective, as different scholars may have varying criteria for defining groundbreaking research.Additionally, our analysis may not fully account for temporal biases in citation patterns, such as the natural tendency for older papers to accumulate more citations over time due to their longevity.Lastly, our study's focus on "highly cited" and "impactful" publications might inadvertently exclude important yet less-cited works that have exerted a significant influence on the field.We recognize that these limitations provide opportunities for future research, and we hope that our findings will inspire and inform further exploration in this evolving and interesting area of research.

Conclusion and Future Research Directions
In this bibliometric exploration, we embarked on a comprehensive analysis of highly cited and high impact ML publications.Through the employment of bibliometric analysis techniques, we sought to uncover key trends, influential authors, top journals, and significant themes within this thriving field.
Our analysis shed light on the landscape of ML research, providing valuable insights into progress direction.By examining citation patterns, co-authorship networks, publication trends, we identified the most influential research papers and collaborations that have shaped the field of ML research.Moreover, we gained a deeper understanding of the key research areas, methodologies, and applications that have garnered substantial attention and citation within the research community and specifically the ML research enthusiasts.
The findings of our study contribute to the existing body of knowledge in ML research and its relevant applications to wide areas of interest or fields.They offer valuable insights for researchers, practitioners, and decision-makers in identifying seminal works and understanding the evolving landscape of the field.These insights can inform future research directions, foster collaborations, and guide the development of innovative approaches and methodologies within the field of ML.
While this bibliometric exploration has provided significant insights, several avenues for future research can further enhance our understanding of highly cited and high impact ML publications.
Some potential directions for future research include: • Fine-grained analysis: Conduct a more granular analysis of specific subfields within ML to uncover unique trends and influential publications within each domain.
• Citation analysis over time: Examining the temporal dynamics of citations to identify emerging research trends, changes in citation patterns, and the evolution of highly influential works in ML.
• Collaboration analysis: Investigating the collaborative networks and patterns of co-authorship within highly cited publications to identify key research groups and their contributions to the field.
• Cross-disciplinary analysis: Exploring the interdisciplinary nature ML research by analyzing citations and collaborations across multiple domains, such as finance, natural language processing.
• Text and topic modeling: Applying text mining and topic modeling techniques to extract and analyze the main themes, research topics, and emerging concepts within highly cited ML publications.
By further pursuing these research directions, we can better understand the landscape highly cited and high-impact ML publications.This knowledge will further facilitate advancements in the field and contribute to developing cutting-edge ML techniques and applications.

Conflicts of interest/Competing interests -Not applicable.
Funding -Not applicable.
Ethics approval -Not applicable.
Consent to participate -Not applicable.
Consent for publication -All authors agreed to the publication of the paper.TP: total number of highly cited articles; %: percentage of articles in all highly cited machine learning articles; IF2022: journal impact factor in 2022; APP: average number of authors per article; CPP2022: average number of citations per paper (TC2022/TP).TP: number of total highly cited articles; TP R (%): total number of articles and the percentage of total articles; IPC R (%): rank and percentage of singlecountry articles in all single-country articles; CPC R (%): rank and percentage of internationally collaborative articles in all internationally collaborative articles; FP R (%): rank and the percentage of first-author articles in all first-author articles; RP R (%): rank and the percentage of corresponding-author articles in all corresponding-author articles; SP R (%): rank and the percentage of first-author articles in all first-author articles; CPP2022: average number of citations per publication (CPP2022 = TC2022/TP); N/A: not available.

Figure 1 .
Figure 1.Schematic for searching the highly cited machine learning publications in SCI-

Figure 2 .
Figure 2. Number of highly cited machine learning articles and their average number of citations per citations per publication number of highly cited articles patterns within the medical topic of multiple sclerosis and the role of ML in shaping the research landscape.

A
recent study conducted byHo (2021) proposed utilizing the average number of citations per publication (CPPyear) and the average number of authors per publication (APP) as key indicators to characterize journals within a specific research topic.Table2in the study presents information on the top 11 most productive journals that published 40 or more highly cited articles.This includes their journal impact factors, CPP2022, and APP.The Journal of ML Research, with an impact factor of 5.177 in 2021, published the highest number of highly cited articles, amounting to 88.These articles accounted for only 2.1% of the total 4,139 highly cited articles, indicating that ML is a relatively new research topic in various fields.Among the top 11 productive journals, the highly cited ML articles published in the Journal of ML Research achieved the highest CPP2022, reaching 1,041.Interestingly, three of the top ten most cited articles, authored by Pedregosa et al. (2011), Srivastava et al. (2014), and Demsar (2006), were published in the Journal of ML Research.In contrast, the journal Expert Systems with Applications (IF2022 = 8.5) had a significantly lower number of highly cited articles, with only 190.The average number of authors per publication (APP) varied across journals, ranging from 11 in Nature Communications to 3.1 in Expert Systems with Applications.
Among them, five institutions were located in the United States, three in the United Kingdom, two in China, and one each in Canada and Singapore.Notably, Zagazig University in Egypt, the University of KwaZulu-Natal in South Africa, and Cairo University in Egypt emerged as the most productive institutions in Africa, with each institution publishing six highly cited articles and ranking 360 in the dataset.The Massachusetts Institute of Technology (MIT) in the United States demonstrated its dominance in four out of six publication indicators.It achieved a TP (Total Productivity) of 127 highly cited articles, accounting for 3.1% of the total 4,132 highly cited articles.Additionally, MIT had a CPI (Collaboration Productivity Index) of 102 articles, representing 3.6% of the 2,844 inter-institutionally collaborative articles.In terms of FP (First-author Productivity), MIT contributed 70 articles, making up 1.7% of the 4,132 first-author articles.Similarly, MIT had an RP (Corresponding-author Productivity) of 75 articles, which accounted for 1.8% of the 4,124 corresponding-author articles.In contrast, Stanford University in the United States secured the top position in terms of the Institutional Productivity Index (IPI) with 27 articles.This accounted for 2.1% of the total 1,288 single-institution articles.On the other hand, the University of Wisconsin, also in the United States, emerged as the leader in Single-author Productivity (SP) with 47 articles, representing 3.2% of the 186 single-author articles.In comparison to the top 15 productive institutes listed in Table 4, the University of Washington in the United States demonstrated outstanding performance.It achieved a TP (Total Productivity) of 53 articles and a CPI (Collaboration Productivity Index) of 47 articles.Remarkably, the University of Washington had the highest CPP2022 of 939 and 1,038 for TP and CPI, respectively.However, the University of Toronto in Canada excelled in terms of the Institutional Productivity Index (IPI), Firstauthor Productivity (FP), and Corresponding-author Productivity (RP).With an IPI of six articles, an FP of 20 articles, and an RP of 23 articles, the University of Toronto boasted impressive CPP2022 values of 3,494, 1,321, and 1,173 for IPI, FP, and RP, respectively.Furthermore, the University of California, Berkeley in the United States, with three articles in the Single-author Productivity (SP) category, achieved an exceptional CPP2022 of 4,804.It is noteworthy that both MIT and Stanford University are renowned institutions known for their contributions to research, including early ML research.Several factors contribute to their reputation in this field: • Strong Faculty: MIT and Stanford have attracted and cultivated world-class faculty members in various disciplines, including computer science and artificial intelligence.These faculties often conduct groundbreaking research and attract top talent, contributing to the overall research excellence of these institutions.• Research Funding: Both institutions have a history of securing substantial research funding, which allows researchers to pursue ambitious projects and support their work.Adequate funding provides the resources needed for conducting experiments, accessing data, and developing innovative algorithms, giving them a competitive edge in producing impactful research.• Collaborative Environment: MIT and Stanford foster a collaborative research environment, encouraging interdisciplinary collaboration among researchers, students, and industry partners.This promotes knowledge sharing, facilitates the exchange of ideas, and encourages cross-pollination of expertise from different fields, leading to innovative solutions and breakthroughs.
the top 15 productive authors who have contributed 15 or more highly cited ML articles.Among them, Muller emerged as the most productive author, with 35 highly cited articles, including one as the first author and 15 as the corresponding author.Zhou published 21 articles, out of which 12 were as the first author.Additionally, Onan published seven articles, with five of them being single-author articles.Comparing the 15 productive authors, Zhou achieved the highest CPP2022 (citation per publication in 2021) values across all categories, with scores of 461 for all highly cited articles, 361 for first-author articles, and 491 for corresponding-author articles.Only three of the top 15 authors, Zhang, Von Lilienfeld, and Liu, had single-author articles.Notably, eight of the 15 productive authors, including Zhou, Pham, Zou, Zhang, Muller, Liu, Bui, and Von Lilienfeld, were recognized as top authors in terms of publication potential, as evaluated by the Y-index.
However, they exhibit different publication characteristics.Ramprasad has published only eight corresponding-author articles with an h-value of π/2, while Deo has published more correspondingauthor articles than first-author articles with an h-value of 1.030.Chen and the other seven authors have an equal number of first-author and corresponding-author articles, resulting in an h-value of π/4.Manavalan has published more first-author articles than corresponding-author articles, with an hvalue of 0.5404.

Figure 3 .
Figure 3. Top 41 authors with Y-index (j  8) the paper titled "Statistical Comparisons of Classifiers over Multiple Data Sets" is the work authored by Garcia and Herrera in 2008.In their publication, "An extension on 'Statistical Comparisons of Classifiers over Multiple Data Sets' for all pairwise comparisons," Garcia and Herrera introduced an extension to a previously established statistical framework used for the evaluation of the performance of multiple classifiers across various datasets.This article was featured in the Journal of ML Research, specifically in volume 9, encompassing pages 2677-2694.
rank 4 th ) and a TC2022 of 19,590 (rank 3 rd ).This article introduced the concept of dropout regularization in deep learning.Dropout is a technique that mitigates overfitting by randomly dropping units during training.The article demonstrated the effectiveness of dropout in improving generalization and preventing overfitting in deep neural networks.This contribution has become a cornerstone in machine learning, enabling the training of more robust and accurate deep-learning models.

Figure 4 .
Figure 4.The citation histories of the ten highly cited machine learning articles

Table 1 .
Citations and authors according to document type.
TP: number of publications; AU: number of authors; APP: average number of authors per publication; TC2022: the total number of citations from Web of Science Core Collection since publication year to the end of 2022; CPP2022: average number of citations per publication (TC2022/TP).

Table 2 .
The top 11 most productive with 40 highly cited articles or

Table 3 .
Top 15 productive countries with more than 100 highly cited articles.