Introduction

Scientific-purpose software as a non-traditional academic output

Software is an essential component in the ecosystem of modern Science, particularly in those disciplines that follow a data-driven paradigm, guided by the ongoing generation, availability, and consumption of high volumes of scientific data (Hey, Tansley & Tolle 2009; Li & Yan, 2018). Today, software is used in all stages of academic work (Howison et al., 2015), from annotating preliminary ideas to processing large volumes of data or disseminating research results. Among the vast amount of software available in the scientific endeavor, we can distinguish between general-purposeFootnote 1 and scientific-purpose software.

Scientific-purpose applications consist of software explicitly designed to assist on non-trivial scientific tasks (for example, VOSviewer,Footnote 2CitNet ExplorerFootnote 3 or Gephi,Footnote 4 to name a few). Scientific software can play important roles in processes related to data collection, management, formatting, analysis, modelling, simulation, prediction, visualization, and dissemination (Howison et al., 2015; Pan et al., 2017), becoming essential in the scientific discovery process (Pradal et al., 2013). Thus, scientific software has a direct effect on the validity of scientific results, since replacing the software could in turn lead to replacing an underlying procedure or logic assumption (Hannay et al., 2009; Howison & Herbsleb, 2011; Li et al., 2017; Yang et al., 2018).

This rising importance of software in the scientific process prompted the perception of scientific-purpose software as a research product of its own. Research funding agencies are increasingly funding the development of scientific-purpose software (Howison et al., 2015), as well as accepting software creation as an accepted outcome in some grant applications (Piwowar, 2013), like the U.S. National Science Foundation (NSF)Footnote 5 and the U.K. Research Excellence Framework (REF)Footnote 6 (Pan et al., 2018).

Tracking citations to scientific-purpose software

While it is commonly accepted that there is no need to mention/cite general-purpose software (Pan et al., 2019), it is recommended that scientific publications using scientific-purpose software should mention it (Niemeyer et al., 2016). The main reasons to encourage the citation of scientific-purpose software include credit allocation, reproducibility, transparency, and discovery (Smith, Katz, & Niemeyer 2016).

Previous literature has provided a significant body of knowledge about the lack of formal mentions of software in scientific publications. For example, Howison and Bullard (2016) found that between 31 and 43% of software textual mentions involved also formal citations. Pan et al. (2016) discovered that more than 30% of the software mentions in 2014 in articles published in PLoS ONE received no formal citations. Park and Wolfram (2019) found that research software was rarely cited in the Clarivate Analytics’ Data Citation Index (DCI). This under-citedness of software varies both by discipline (Pan et al., 2016) and the nature (commercial or freeware) of the software (Howison & Bullard, 2016; Pan et al., 2019). Moreover, the mentions of software often lack sufficient information related to the software employed (e.g., version, access, crediting information, etc.) (Howison & Bullard, 2016).

Despite the academic community initiated diverse actions like proposing best software citation practices (Hafer & Kirkpatrick, 2009; Howison & Bullard, 2016; Niemeyer et al., 2016), working groups (e.g., FORCE11 Software Citation Working Group,Footnote 7FORCE11 Software Citation Implementation Working GroupFootnote 8 and WSSSPE Software Credit Working Group—Katz et al., 2016), publisher guidelines (e.g., the American Astronomical Society Policy Statement on SoftwareFootnote 9) or informal statements, such as the Science Code ManifestoFootnote 10 and The Research Software Impact Manifesto,Footnote 11 the diversity of ways to referring to software and the still pending proper standardized citation guidelines (e.g., standardized citation styles and publishers sometimes contradict each other make citation counts a limited metric for the proper traceability of scientific-purpose software (Pan et al., 2019).

Textual approaches to track academic software usage

Since citation metrics have shown only a limited applicability to measure software usage in academic settings, it becomes necessary to establish alternative methods to measure the usage of scientific-purpose software, and to obtain evidence about its influence and impact (Hannay et al., 2009; Pan et al., 2018). Thus, the identification of textual mentions of software in the text of scientific papers has been a quite common approach to capture the impact of software on science (Pan et al., 2016).

Different efforts have been made to measure scientific software text-mention patterns in publications at different levels: a) disciplines, such as Biology (Howison & Bullard, 2016; Yang et al., 2018) and Library and information sciences (Pan et al., 2019); b) multidisciplinary journals, such as PLoS ONE (Pan et al., 2015, 2016); c) programs stored in software repositories (Thelwall & Kousha, 2016); and d) specific software applications, such as Geant4 toolkit (Pia et al., 2009), R packages (Li & Yan, 2018; Li et al., 2017, 2019), and bibliometric mapping software (Citespace, VOSviewer and Histcite) (Pan et al., 2017, 2018).

Tracking the use of software via text-mentions introduces some methodological challenges, which might limit the identification of software names in large texts (Du et al., 2021). First, there may be different ways to invoke the same software, a software project name (e.g., in GitHub), the URL of the software’s official website, the URL to the repository where it is hosted, mentions to unpublished manuscripts about the software, users’ manuals, etc. In addition, we can find synonyms or even translations to other languages. Consequently, the polymorphous nature of textual mentions is huge (Cronin et al., 1988). Second, common words used as software names may also represent other objects, due to polysemy of textual mentions. Third, complex software applications might have parts (specific packages and modules, etc.) that make their clear identification through text-mentions complex.

Using software website URLs as traceable objects

A possibility to limit this complexity and facilitate the operationalization of the tracking of software is to consider the URL of the software website as the traceable object. The mentioning of URLs has been extensively studied in the webometrics field as an established technique to measure the online importance and impact of websites (Orduna-Malea & Alonso-Arroyo, 2017; Park & Thelwall, 2003; Thelwall, 2004). Likewise, the URL also stands out as a central piece for Altmetric studies, as such universal identifier is often used as a digital object (mainly via DOI URLs) representing research publications mentioned on online social media platforms (Wouters et al., 2019).

Tracking URLs presents fundamental advantages over the tracking of texts: First, a URL provides a unique and unequivocal element to identify the software. For example, the text ‘vosviewer.com’ can only refer to the software VOSviewer, thus reducing polymorphism and polysemy. Second, the URL is an actionable element that allows users to navigate from the source document (document where the software is mentioned) to the target (document published/hosted in the software’s website), establishing and making explicit a relation between these documents. Third, considering that mentioning a URL is time-consuming –more than simply mentioning the software– it can be argued that this action might be related to a more conscious informational purpose (e.g., sources transparency, facilitating resources for readers, etc.), where links are oriented to navigational issues (Halavais, 2008). Fourth, the URL not only represents univocally the digital object (in this case, the software) but also represents the whole website (and all contents hosted inside) where the software is available. As a website, a wide range of metrics (e.g., traffic, visibility, size, etc.) are available, which can report information related to the consumption and interest on the software.

VOSviewer as a case study

The free bibliometric software VOSviewerFootnote 12 is analyzed as a case study. VOSviewer was developed by Nees Jan van Eck and Ludo Waltman at Leiden University's Centre for Science and Technology Studies (CWTS). The application was launched in 2010, and formally introduced through a software paper –a scientific publication describing and analyzing the software (Smith, Katz & Niemeyer 2016)– published in Scientometrics (Van Eck & Waltman, 2010). This publication is the most cited article in the journal, according to both Scopus (1,621 citations) and Web of Science-all databases (1,431), as of August 2020.

Its ease of use and multiple features (including specific clustering and natural language processing techniques) made VOSviewer popular not only in the Scientometrics community but also in other disciplines where science maps are used. Given its simplicity as a software product (code and related material is all centralized and available on a website) and the broad interest and diverse audience to this software, VOSviewer constitutes an excellent case study to test the proposed approach. Moreover, VOSviewer is relevant and well known software for the Scientometric research community. All these features make VOSviewer an ideal case study for an illustrative discussion like the one presented in this paper.

Aim of the study and analytical framework

The objective of this study is to illustrate a webometric and altmetric method to determine the use and interest on a case study of scientific-purpose software: VOSviewer. The main purpose of this work is to design an analytical framework aimed at studying scientific software’s impact metrics by collecting large amounts of data from multiple online data sources. This analytical framework will be made explicit by means of an evaluation technical sheet, which will include a wide list of URL-based metrics specifically and formally defined to measure the use of scientific software. This analytical framework is based on the mentioning of software’s main URL across different scenarios (academic publications, web at-large, and social media), each of which is operationalized by an online data source (Google Scholar, Majestic, and Twitter, respectively).

In Appendix 1 we describe more specifically the three data sources chosen for this study. The rest of the paper is structured as follows, in Sect. 2 the methodological approach is described, in Sect. 3 we present the main results, discussion in Sect. 4, and finally the main conclusions in Sect. 5.

Methodological approach

We analyze the mentioning of the official URL of VOSviewer (www.vosviewer.com) in three scenarios: academic publications, web at-large, and social media. Each scenario is characterized by the following five elements (Fig. 1):

  • Source: the database where the use of the software is tracked (e.g., Majestic, Twitter, etc.).

  • Object: an element that represents the software (e.g., the software name, the software website).

  • Event: an action carried out by users or applications that trigger the appearance or invocation of the object (e.g., a software name’s text-mention, a link to the software website).

  • Tracking metrics: a measure which captures the use of the object through a certain event (e.g., the number of software name’s text mentions).

  • Context metrics: a measure which captures characteristics of the users that generate the events (e.g., users’ gender, location, language, etc.).

Fig. 1
figure 1

Scenarios for software use measurement: academic literature (Google Scholar), websites (Majestic) and social media (Twitter). AF = Analytical Framework; S1AF1 = Analytical Framework 1 belonging to Scenario 1

URL mentions in academic literature via Google Scholar

Academic publications containing the URL string “vosviewer.com” were retrieved from Google Scholar using the Publish or Perish v7 softwareFootnote 13 (patent documents included). To do this, the direct query “vosviewer.com” was performed, excluding protocols (https and www) to improve recall as much as possible. The search was not limited to a specific period in order to retrieve all publications in Google Scholar, regardless the year of publication, language or document type.

The search returned 1,190 records as of 31 March 2020. These records were subsequently exported including the following bibliographic fields: publication author(s), publication title, publication year, publication source, publication URL, and the number of citations received. Due to the unsupervised indexing process carried out by Google Scholar, bibliographic errors were found (Orduna-Malea, Martín-Martín & Delgado López-Cózar 2017), including multiple records for one same publication and incomplete/erroneous authorship.Footnote 14 To solve this limitation, data was manually cleansed, and multiple copies were merged, obtaining a final amount of 988 records. The document type (journal article, book, book chapter, conference proceeding, working papers and reports) was directly determined from the information contained in the publication source field. All dubious cases were manually checked.

The publication URL was used to manually access each of the publications. The search functionalities of web browsers (for HTML publications) and Adobe Acrobat (for PDF publications) were used to locate each URL mention in each of the publications, according to the available full text format. The publication section where the string “vosviewer.com” appeared (introductory sections, method, conclusions, or references) was manually checked and annotated throughout this process. When no clear structure (IMRaD type) was found, the category ‘unstructured’ was assigned.

Along this process, full text access was not possible for 51 publications. No URL mentions were found for 32 publications (due to Google Scholar parsing errorsFootnote 15), and document typology could not be properly determined for 18 publications (due to lack of information on the full texts available), thus resulting in a final set of 887 publications mentioning ‘vosviewer.com’.

URL mentions in webpages via Majestic

Link data related to the VOSviewer website was gathered from MajesticFootnote 16 through the site explorer feature (historic index). To do this, the direct root domain query “vosviewer.com” was carried out, obtaining a total of 17,261 mentioning webpages belonging to 1,330 distinct websites were gathered as of 12 April 2020. To characterize those webpages including a URL mention of “vosviewer.com”, additional web metrics related to each of these webpages were also directly obtained from Majestic (see Table 1 for a detailed description of these metrics), including the IP address, website language, and flow metrics (Trust Flow and Citation Flow)Footnote 17 (Jones, 2012). These flow metrics are meant to capture some idea of the “prestige” or reputation of the linking URLs.Footnote 18

Table 1 Categories used to typify the purpose of those tweets mentioning the software’s URL (vosviewer.com)

Finally, each website was categorized. To this end, a bottom-up process based on a previous work oriented to scientific-related websites categorization (Orduna-Malea 2021) was carried out. First, each website was accessed to and manually classified according to their functional nature. A total of 46 academic-related categories were identified, whilst all non-academic related websites (casinos, adult content, etc.) were discarded. Second, an external researcher with expertise in websites classification was asked to carry out an inter-coder reliability test through a random sample of the 10% of academic-related websites. The percentage of agreement achieved was 80%, and the Krippendorff's alpha (nominal) achieved was 0.92, which is considered acceptable.

It should be noted that only URLs at the web domain level were considered for website categorization. As blogs from generic blog providers (under ‘wordpress.com’, ‘blogspot’ or similar web domains) employ generic web domains provided by the blog service, several different blogs may be linked to one same web domain. In these cases, no academic-related category has been assigned, as particular blogs under generic blog providers have not been checked.

URL mentions in tweets via Twitter

All tweets containing the string “vosviewer.com” until 31 March 2020 were gathered from Twitter using the TweetDeck dashboard application,Footnote 19 without any time restriction. A corpus of 267 tweets mentioning the URL “vosviewer.com” was finally retrieved.

For each tweet, the username, date of publication, tweet text, number of replies, likes, and retweets were obtained. Likewise, the tweet type (original tweet, retweet, and reply) and the inclusion of images or videos were also collected.

Finally, each tweet was categorized according to the main purpose. To this end, a bottom-up process was carried out. First, each tweet content was accessed to, and manually classified according to the general purpose perceived, considering the words, textual signs, visual signs, multimedia, and tweet type. Second, all categories identified were grouped, standardized, and defined, achieving a total of 10 general categories (see Table 1). Third, a second classification round was performed to reassign the standardized category to each tweet. Fourth, an external researcher with expertise in tweets classification was asked to carry out an inter-coder reliability test through a random sample of the 10% of tweets. The percentage of agreement achieved was 80%, and the Krippendorff's alpha (nominal) achieved was 0.76, which is considered acceptable. This test was used to reclassify few tweets and improve the definition of categories.

At the user-level, all users providing likes, retweets and replies to each of the original 267 tweets were also obtained, and categorized (female, male, institutional, unknown). The social authorityFootnote 20 of each Twitter user was obtained from Followerwonk.Footnote 21 This metric recursively measures the prestige of a Twitter account based on the prestige of the followers who retweet the tweets by the given account.Footnote 22 Specifically, it includes the following three components: a) the retweet rate of a few hundred of the measured user’s last non-@mention tweets; b) a time decay to favor recent activity versus old activity; c) other data for each user (such as follower count, friend count, and so on) that are optimized via a regression model trained to retweet rate. Social authority metrics score from 0 (no authority) to 100 (maximum authority).Footnote 23

Different networks (connecting the user who publishes a tweet with the user who likes/retweets/replies the original tweet published) were generated with Gephi v0.9.1.Footnote 24 Finally, data was statistically analyzed with XLStat.Footnote 25

Data processing and metrics

At the end of the process, a total of 887 academic publications, 17,261 webpages and 267 tweets are considered. The overall process followed is illustrated briefly in the Fig. 2.

Fig. 2
figure 2

Data gathering process for publications (Google Scholar), webpages (Majestic) and tweets (Twitter)

As we can observe in the Fig. 2, each scenario operates with one specific source, each providing a different set of mentioning events: publications, webpages, and tweets. From each of these three document bodies, several context metrics can be obtained.

Based on software diffusion indicators (paper diffusion breadth and journal diffusion breadth) proposed by Pan et al., (2018), the following tracking and context metrics have been obtained (Table 2). These are grounded on diffusion breadth metrics (number of elements mentioning the software) and impact breadth metrics (attention achieved by elements mentioning the software). Elements can be authors, journals, webpages, countries, languages, etc.

Table 2 List of tracking and context metrics used to measure the software use

For the sake of clarity, in this work webpage will refer to any document displayed to a user in a web browser, regardless its format and represented by a URL (e.g., vosviewer.com/vosviewer.php). Likewise, a website will refer to a collection of webpages linked together in a coherent fashion, also represented by a URL which nests hierarchically all related webpages’ URLs (e.g., vosviewer.com).

Results

URL mentions in academic literature via Google Scholar

A total of 1,144 publication URL mentions from 887 different publications (publication diffusion breadth) were found. Most publications (79%) include only one URL mention, although few publications include up to ten URL mentions, denoting a strong importance of the software in that publications (Fig. 3). In any case, the publication URL mention intensity is low (average of 1.29 URL mentions per publication).

Fig. 3
figure 3

Histogram showing the number of publications (y-axis) according to the number of mentions to the VOSviewer’s URL (vosviewer.com) (x-axis)

The publication impact breadth achieves an i-10 index of 188 (and an i-100 index of 22), showing a significant number of publications mentioning the software’s URL achieving citation-based impact. 36.4% of this corpus of URL mentioning documents had not received any citations at the time of data gathering. The publication diffusion breadth has increased over the years, especially in 2019 (Fig. 4). About two-thirds of these publications are journal articles (70.1%), while other categories such as theses and Master theses (10.8%) and Conference papers (9.2%) also show a remarkable presence (Fig. 5).

Fig. 4
figure 4

Number of publications indexed in Google Scholar mentioning the VOSviewer’s URL (vosviewer.com) over the years

Fig. 5
figure 5

Number of publications indexed in Google Scholar mentioning the VOSviewer’s URL (vosviewer.com) and total number of URL mentions included, according to the publication type

The references section is the most frequently location where VOSviewer’s URL mentions were found (30.8% of all URL mentions found), followed by the methodology Sect. (29.7%) and results (18.1%). Otherwise, 108 documents (providing 144 URL mentions) did not exhibit a standard structure (Fig. 6). These results suggest a preference to mention VOSviewer’s URL as part of the bibliographic references, and as a methodological item in the mentioning publications.

Fig. 6
figure 6

Number of publication URL mentions to the VOSviewer’s URL (vosviewer.com) according to the publication section where located. Note: excluding documents retrieved but without mentioning ‘vosviewer.com’ (N = 32), and documents without full text access (N = 51)

Mentioning journal articles come from 499 different academic journals (source diffusion breadth), out of which 16% belong to the Library and Information Sciences field. The distribution of URL mentions per journal shows a highly skewed distribution (431 journals appear with just one publication each including at least one URL mention). Scientometrics, JASIST, and Journal of Informetrics are the principal sources (Table 3), being all of them core journals in the Library and Information Sciences field, area in which the VOSviewer software has been applied.

Table 3 List of academic journals according to the number of mentions to the VOSviewer’s URL (vosviewer.com)

At the author-level, a total of 2,130 authors are found as author(s) or co-author(s) of publications including at least one mention to the software’s URL (author diffusion breadth). The authors who have mentioned the software’s URL the most times are shown in the Table 4, where the co-developers of the software as well as other eminent researchers in the field of Scientometrics can be distinguished.

Table 4 List of authors according to the number of mentions provided to the VOSviewer’s URL (vosviewer.com)

URL mentions in websites via Majestic

VOSviewer’s official website has accumulated 21,440 website URL mentions since 2014 (no data is available before this date in majestic), out of which 99.5% appear in internal pages (only 110 URL mentions come from websites’ homepages).

These website URL mentions come from 17,261 webpages (webpage diffusion breadth) belonging to 1330 websites (website diffusion breadth). The website impact breadth of this corpus of mentioning websites is elevated (119 of these websites receive hyperlinks from at least 100,000 different external websites, while 244 of these websites receive hyperlinks from at least 10,000 different websites).

The number of new webpages mentioning VOSviewer’s URL increases over time. From 2015 onwards, the monthly average of new mentioning webpages is 246.7. During 2019, this average value increases to 467.9 (Fig. 7). The monthly average of new mentioning websites is 17.8 (Fig. 8).

Fig. 7
figure 7

Number of webpages (both total accumulated–line–and monthly–bars) mentioning VOSviewer’s URL (vosviewer.com) over time

Fig. 8
figure 8

Number of websites (both total accumulated–line–and monthly–bars) mentioning VOSviewer’s URL (vosviewer.com) over time

The web authority of this corpus of websites mentioning the software’s URL is diverse. 136 websites (10.2%) achieve a Trust Flow score equal to or greater than 50 (out of 100), and 181 websites (13.6%) achieve a Citation Flow score equal to or greater than 50 (also, out of 100), while most mentioning websites achieve lower scores, specially Trust Flow scores (926 mentioning websites achieve scores lower than 10), being websites with low web authority (Fig. 9).

Fig. 9
figure 9

Distribution of Trust Flow and Citation Flow scores for those websites mentioning VOSviewer’s URL (vosviewer.com)

The nature of mentioning websites is also diverse. About 28.5% of these 1330 websites (379) correspond to academic-related websites (academic website diffusion breadth), whereas 23.8% (317) include dubious websites related to casinos, gambling, online bets, and even pornographic websites.

It is note to worth that a significant number of websites were not available two months after data gathering (473 websites, containing 2,906 mentioning webpages). Consequently, they could not be categorized. Likewise, 95 websites (containing 215 mentioning webpages) were parkedFootnote 26 (Orduna-Malea, 2021), and 76 domains (208 mentioning webpages) automatically redirected to dubious websites.

Notwithstanding, the total percentage of mentioning webpages from non-academic websites was small (5.1%; 877 mentioning webpages), while URL mentions from academic-related webpages were majority (74.1%; 12,799 mentioning webpages). Therefore, its incidence on VOSviewer’s website overall online impact is limited (Fig. 10).

Fig. 10
figure 10

Distribution of the number of mentioning webpages by mentioning websites

At the webpage level, the origin of URL mentions from academic-related webpages to VOSviewer’s URL is mainly from personal blogs (5,352 webpages), academic information products (1,883), and research groups (1,450). At the website level, universities and academic journals stand out (101 and 32 URL mentions, respectively). Most of websites (91.9%) provide just one URL mention to the software. In addition, non-academic websites have been included by way of illustration (Table 5). Of these, 473 websites have expired, 95 were parked and 76 redirected to other web locations. Other significant non-academic categories included websites with tricks for SEO professionals (44), non-academic companies (40) and mentions from online messages groups (57 links from 34 websites).

Table 5 Ranking of website categories according to the number of academic webpages and websites mentioning VOSviewer’s URL (vosviewer.com)

The language has been identified for 71% of all 17,261 webpages mentioning the software’s URL, covering 26 different languages (webpage language diffusion breadth). URL mentions come mainly from webpages written in English (5,894) and French (5,216). Analyzing the geo-location of IP addresses of each website’s web domain, we find 3,849 webpages from 91 websites hosted in Netherlands (which is coherent as it is the place where VOSviewer’s developers work). However, only 46 webpages written in Dutch were identified (Table 6). A similar issue is found with Germany, from which 539 webpages from 94 different websites geo-located in this country are found, but only 21 webpages are written in German language. The use of English in most academic web environments may explain these results.

Table 6 Ranking of languages according to the number of webpages mentioning VOSviewer’s URL, written in the corresponding language

The IP address geo-location has identified webpages placed in 54 different countries (webpage location diffusion breadth), mainly from websites located at United States (53.9%; 717 websites) and United Kingdom (11%; 146 websites) (Fig. 11). At the level of mentioning webpages, Argentina (255 webpages from 3 websites) and Indonesia (113 webpages from 3 websites) stand out due to the existence of specific websites with many mentioning webpages (mainly from ‘r020.com.ar’, an Argentinean website dedicated to Library and information sciences resources with 249 mentioning webpages, and ‘dasaptaerwin.net’, an Indonesian personal academic website providing 113 mentioning websites, respectively).

Fig. 11
figure 11

Distribution of webpages mentioning VOSviewer’s URL (vosviewer.com) according to the website’s web domain IP address location

67.1% of webpages mentioning the software’s URL link directly to VOSviewer’s homepage. In addition, we can find a significant number of URL mentions linking to specific sections of the software’s website, especially maps created by the software as use examples, and the page where the software is available to download (Table 7).

Table 7 VOSviewer’s website specific URLs most mentioned by external webpages

URL mentions in tweets via Twitter

A total of 267 tweets containing a URL mention to the VOSviewer’s official website have been identified in the period (tweet diffusion breadth). As all tweets include just one URL mention to the software’s URL, the tweet diffusion breadth and the total tweet URL mention counts are the same.

This corpus of mentioning tweets has originated a further engagement of 89 replies (reply diffusion breadth), 646 retweets (retweets diffusion breadth), and 1109 likes (like diffusion breadth) (Table 8). The tweet impact breadth is limited; the like-based impact breadth achieves an i10-index of 26 (i100-index of 1) and the retweet-based impact breadth achieves an i10-index of 15 (i100-index is null). The low number of replies make reply-based impact breadth null (i10-index is 0; i1-index is 65).

Table 8 Number of tweets mentioning the software’s URL (vosviewer.com) over the years, and the engagement achieved

75% of the 267 tweets gathered are original tweets, while 20% (53) are replies. Conversely, retweets containing URL mentions are scarce (5%; 14 tweets). Most tweets (80%) do not include images, and only two include media.

Original tweets related both with diffusion (21.3% of all tweets) and software recommendation (16.9%) are the most frequent. VOSviewer’s URL is also frequently mentioned in conversations about bibliometric maps generation (17.2%). However, tweets related with new versions releases and use examples are those achieving the largest engagement, both in terms of number of likes and retweets received (Fig. 12).

Fig. 12
figure 12

Distribution of tweets mentioning VOSviewer’s URL (vosviewer.com) according to the category of each tweet Note: due to the low volume of the number of replies, this metric has been excluded

As regards to tweets’ authorship, 200 users have been identified (total author diffusion breadth), 56% of which are male, 30% female, and 12% institutional accounts (Table 9). The community of users giving likes (liking author diffusion breadth of 748) and retweets (retweeting author diffusion breadth of 428) follow similar gender distribution, except for institutional profiles, which exhibit a lower participation in the generation of likes.

Table 9 Gender analysis of the users who tweeted, retweeted, and liked tweets mentioning the VOSviewer software’s website

The overall community of attention of users who have published at least a tweet mentioning VOSviewer’s URL (200 users) can be characterized as having a high level of Twitter reputation, with a social authority median value equal to 47 (given that only few users in the world achieve a score of 100, this median value is considered substantial), and with a significant number of followers and total tweets published (Table 10). However, tweeting shows a skewed distribution of tweets per author, as only 18 users have published more than one tweet mentioning VOSviewer’s URL.

Table 10 Descriptive statistics related to users mentioning the VOSviewer software’s URL (vosviewer.com) (n = 200)

The network of the community of attention of VOSviewer is a very sparse user (tweet creator) to user (like generator) network (Fig. 14 up), which shows one large node, few significant nodes, and lots of small nodes involved in sporadic interactions (one user generating a tweet and few liking it). This network (with low density and average degree and large diameter considering the number of nodes involved) shows the dependence on the Twitter activity of one user (van Eck, one of the co-developers of VOSviewer). In addition, a strong unbalanced user behavior is detected. Users who receive a great number of likes do not provide likes to those users including a URL mention in their tweets, reflecting a lack of interactivity in this community of users. The network based on retweets (Fig. 13 bottom) shows similar patterns, being reduced, sparse, and dependent on van Eck’s activity.

Fig. 13
figure 13

Network of users providing likes (up) and retweets (bottom) to tweets mentioning the software’s URL (vosviewer.com). Node color and node size according to the weighed InDegree value of each node. Note: due to the low volume of the number of replies, this metric has been excluded

Fig. 14
figure 14

Example of webometric technical evaluation sheet for VOSviewer software Note: object: URL (vosviewer.com); event: URL mentions

Combining scenarios

Each of the three scenarios analyzed (academic literature, websites and Twitter) provides a complementary story about the use and interest in VOSviewer. Each scenario is determined by the different available context metrics provided in the corresponding analytical framework, each of which covers specific document bodies (academic publications, webpages and tweets, respectively).

The academic literature scenario allows checking the use of the software in publications. This way, VOSviewer’s diffusion breadth can be determined in terms of the number of publications, sources or authors mentioning the software’s URL, and the intensity of this mentioning event. Moreover, driving the analysis to a greater detail, the appearance of mentions in different sections—beyond the references—can inform about the nature of the mentions. Thus, URL mentions can appear in the method section (as part of describing procedures and tasks performed), results (mainly to show data created with the software to illustrate direct findings related to objectives), and introductory sections (mainly to supplement literature reviews).

The web scenario has allowed determining VOSviewer’s interest through webpages linking to the software’s URL. This way, VOSviewer diffusion breadth can be determined in terms of the number of webpages and websites mentioning the software, and the number of different languages and countries from which the software is mentioned. Moreover, the appearance of URL mentions in specific academic-related webpages can potentially inform about the interest of software in academic spaces beyond publications.

The social media scenario has allowed checking the academic software’s use and interest through tweets linking to the software’s URL. This way, diffusion breadth can be determined in terms of the number of tweets and users mentioning the software’s URL. Moreover, the appearance of URL mentions in different tweets allows detailed analyses at the tweet-level, such as the type of tweet (distinguishing replies—as part of discussions—and retweets—for mere diffusion purposes—from original tweets), the motivation of the tweet (e.g., software releases, use examples, etc.) and the engagement of the tweet (number of likes and retweets achieved by tweets mentioning the software’s URL). In addition, data captures attention characteristics of the audience interested in the software, for example users’ likes. The conversational nature of Twitter can also be useful in detecting user networks who may potentially employ the software, favoring the interaction and circulation of scientific knowledge across different communities of attention (Díaz-Faes et al., 2019).

All metrics gathered from each of the scenarios can be combined in technical analytical sheets (Fig. 14) to characterize the software’s use in a multidimensional way. It can be highlighted that all data is captured externally in a non-invasive way, and without access privileges as webmasters, allowing its use to evaluate any other software in a similar fashion, as long as the software counts with an unambiguous URL.

Data included in the evaluation sheet is aimed at giving a detailed overview of software’s diffusion breadth. Obviously, we do need to compare these results with other digital objects to determine whether the results obtained show an outstanding diffusion or not.

Discussion

The suitability of the approach presented in this case study depends mainly on the sources employed and the object selected. In this work, three sources have been selected to exemplify the proposed approach (Google Scholar, Majestic, and Twitter). The coverage, accuracy and availability of these sources determine the comprehensiveness and suitability of tracking and context metrics.

All three sources have as main advantages the coverage and the wide variety of available tracking metrics. Google Scholar covers academic publications regardless the type, language, and discipline, enabling the discovering of mentions in a larger number of publications and sources not available in other bibliographic sources. Majestic covers millions of webpages regardless the webpage type, language, or location, enabling the discovering of mentions in the global open online sphere, where academics and practitioners also participate. Twitter covers millions of tweets, enabling the discovering of mentions in the Twittersphere (the total universe of Twitter users and their publishing/following/reacting habits), which allows measuring interactivity between users (e.g., discussions threads or engagement).

Conversely, all these sources show limitations on availability, accuracy, and volatility. Google Scholar does not offer data export facilities (the Publish or Perish software helps but it is also limited to Google Scholar requirements). It also introduces errors in the bibliographic records that require manual examination, which in turn impairs the analysis of large sets of publications. Moreover, unlike other bibliographic databases, Google Scholar is not an accumulative database, and documents can be unindexed if they stop meeting Google Scholar’s technical requirements (Delgado López-Cózar, Orduña-Malea & Martín-Martín 2019). Majestic operates with highly-volatile data, as webpages are continuously changing. For example, 4,365 URL mentions to VOSviewer came from one specific personal blog (culturalibre.ca). Most of these hyperlinks were deleted just few months after data gathering, and the overall domain was lately disabled. Therefore, web data should be treated as dynamic and fluid impact instead of current impact. Moreover, Majestic data is offered under a paid license –which limits its use for large scale research endeavors– and its flow metrics are composite indicators whose full methodology is unknown, jeopardizing its transparency. Taking apart the volatility of the data, Twitter also deals with the limitations derived from compliance with its terms of use together with the General Data Protection Regulation (GDPR) as regards demographic data of users. Availability is also a problem as public Twitter API is not enough to carry out large data analyses and full API functionalities are offered under different paid licenses. The recent Twitter academic research APIFootnote 27 might solve partially these limitations.

The object used to represent the software is another important aspect of the proposed approach. In this work, the object corresponds to the URL of the software’s official website. The choice of a URL as a traceable digital object has important advantages. For example, the mentions to the URL can be unequivocally identified, avoiding the polysemy and synonymy of natural language. In addition, URLs have specific search filters to ease data retrieval in a multitude of sources. These characteristics allow the generation of faster, simpler and more refined information retrieval systems. Moreover, URLs also allow online navigation (and, therefore, web traffic), which potentially facilitates software usage, and constitutes an effective type of mentioning software for scientific purposes. Finally, we can find a wide variety of URL metrics, which are not available for other textual metrics or, in some cases, obtaining them would require much more complex computing needs.

Despite the benefits of using URL as an object, the following limitations (and threats) should be acknowledged (Table 11):

Table 11 URL-based text mentions limitations

Software can be mentioned by authors for a wide range of reasons (an instrument, an artifact, a scientific protocol, a method, or just an example—Howison & Bullard 2016; Li et al., 2019) regardless the type of text-mention used (URL, software’s name, etc.). The publication sections where the mention appears or the category of tweets where the software is mentioned have been precisely studied in this work as exploratory signals. Further qualitative techniques would add more context to the Twitter conversations, which may enhance the software’s technical evaluation sheet obtained. The taxonomy of tweets is only based on the collected tweets, but a more generic taxonomy would be advisable, including other potential categories or subcategories (e.g., software awards, datacamps, dedicated conferences and meetings, official use of scientific software in syllabi, mentions and reviews from professional specialized media, diffusion of books dedicated to the software, forks, etc.) not considered in this analysis but foreseeable necessary for other types of software.

VOSviewer has been used as a case study to exemplify the method. The manageable number of URL mentions found together with its online presence has made its choice an adequate one to test a wide number of metrics per analytical framework.

Conclusion

A webometric analytical approach to track scientific software use and interest has been proposed in this work. This approach is based on the definition of scenarios, analytical elements, analytical frameworks, sources, objects, events, tracking metrics and context metrics. The operationalization of the approach has been exemplified by analyzing one specific academic software (VOSviewer) and object (VOSviewer’s official URL).

Results show that the different analytical frameworks provide useful information about the usage of scientific software, expanding the notion of usage of scientific software in research publications to dissemination and interest of scientific software in the research community, illustrating how this usage information is relevant to fully comprehend the broader influence of scientific software in the whole research ecosystem.

The Google scholar scenario has shown VOSviewer as a research resource, whilst the Majestic and Twitter scenarios have shown the interest of VOSviewer as an information and a conversational resource, respectively.

Finally, the approach proposed in this study can be expanded by adding new scenarios and new analytical frameworks. Each analytical framework can also be expanded by considering additional sources, objects, events, tracking metrics and context metrics. In addition, data from different scenarios can be combined to create new indicators and to show added-value information (e.g., an integrated publication timeline for each specific event tracked).

As a matter of fact, the approach proposed in this study can actually be used to track any research object that can be enclosed in a specific URL (e.g., software, scientific conferences, presentations, online courses, videos, scientific exhibitions, research projects, academic websites, etc.), effectively expanding the analytical scope of the scientometric toolset by incorporating a novel digital dimension through methods that draw from the fields of Webometrics and Altmetrics.