Link-based approach to study scientific software usage: the case of VOSviewer

Orduña-Malea, Enrique; Costas, Rodrigo

doi:10.1007/s11192-021-04082-y

Link-based approach to study scientific software usage: the case of VOSviewer

Open access
Published: 10 July 2021

Volume 126, pages 8153–8186, (2021)
Cite this article

Download PDF

You have full access to this open access article

Scientometrics Aims and scope Submit manuscript

Link-based approach to study scientific software usage: the case of VOSviewer

Download PDF

6389 Accesses
47 Citations
8 Altmetric
Explore all metrics

Abstract

Scientific software is a fundamental player in modern science, participating in all stages of scientific knowledge production. Software occasionally supports the development of trivial tasks, while at other instances it determines procedures, methods, protocols, results, or conclusions related with the scientific work. The growing relevance of scientific software as a research product with value of its own has triggered the development of quantitative science studies of scientific software. The main objective of this study is to illustrate a link-based webometric approach to characterize the online mentions to scientific software across different analytical frameworks. To do this, the bibliometric software VOSviewer is used as a case study. Considering VOSviewer’s official website as a baseline, online mentions to this website were counted in three different analytical frameworks: academic literature via Google Scholar (988 mentioning publications), webpages via Majestic (1,330 mentioning websites), and tweets via Twitter (267 mentioning tweets). Google scholar mentions shows how VOSviewer is used as a research resource, whilst mentions in webpages and tweets show the interest on VOSviewer’s website from an informational and a conversational point of view. Results evidence that URL mentions can be used to gather all sorts of online impacts related to non-traditional research objects, like software, thus expanding the analytical scientometric toolset by incorporating a novel digital dimension.

How to design bibliometric research: an overview and a framework proposal

Article Open access 06 March 2024

The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis

Article 26 March 2021

Mapping the Landscape of Internet Pornography, Loneliness, and Social Media Addiction: A CiteSpace Bibliometric Analysis

Article 09 April 2024

Introduction

Scientific-purpose software as a non-traditional academic output

Software is an essential component in the ecosystem of modern Science, particularly in those disciplines that follow a data-driven paradigm, guided by the ongoing generation, availability, and consumption of high volumes of scientific data (Hey, Tansley & Tolle 2009; Li & Yan, 2018). Today, software is used in all stages of academic work (Howison et al., 2015), from annotating preliminary ideas to processing large volumes of data or disseminating research results. Among the vast amount of software available in the scientific endeavor, we can distinguish between general-purpose^{Footnote 1} and scientific-purpose software.

Scientific-purpose applications consist of software explicitly designed to assist on non-trivial scientific tasks (for example, VOSviewer,^{Footnote 2}CitNet Explorer^{Footnote 3} or Gephi,^{Footnote 4} to name a few). Scientific software can play important roles in processes related to data collection, management, formatting, analysis, modelling, simulation, prediction, visualization, and dissemination (Howison et al., 2015; Pan et al., 2017), becoming essential in the scientific discovery process (Pradal et al., 2013). Thus, scientific software has a direct effect on the validity of scientific results, since replacing the software could in turn lead to replacing an underlying procedure or logic assumption (Hannay et al., 2009; Howison & Herbsleb, 2011; Li et al., 2017; Yang et al., 2018).

This rising importance of software in the scientific process prompted the perception of scientific-purpose software as a research product of its own. Research funding agencies are increasingly funding the development of scientific-purpose software (Howison et al., 2015), as well as accepting software creation as an accepted outcome in some grant applications (Piwowar, 2013), like the U.S. National Science Foundation (NSF)^{Footnote 5} and the U.K. Research Excellence Framework (REF)^{Footnote 6} (Pan et al., 2018).

Tracking citations to scientific-purpose software

While it is commonly accepted that there is no need to mention/cite general-purpose software (Pan et al., 2019), it is recommended that scientific publications using scientific-purpose software should mention it (Niemeyer et al., 2016). The main reasons to encourage the citation of scientific-purpose software include credit allocation, reproducibility, transparency, and discovery (Smith, Katz, & Niemeyer 2016).

Previous literature has provided a significant body of knowledge about the lack of formal mentions of software in scientific publications. For example, Howison and Bullard (2016) found that between 31 and 43% of software textual mentions involved also formal citations. Pan et al. (2016) discovered that more than 30% of the software mentions in 2014 in articles published in PLoS ONE received no formal citations. Park and Wolfram (2019) found that research software was rarely cited in the Clarivate Analytics’ Data Citation Index (DCI). This under-citedness of software varies both by discipline (Pan et al., 2016) and the nature (commercial or freeware) of the software (Howison & Bullard, 2016; Pan et al., 2019). Moreover, the mentions of software often lack sufficient information related to the software employed (e.g., version, access, crediting information, etc.) (Howison & Bullard, 2016).

Despite the academic community initiated diverse actions like proposing best software citation practices (Hafer & Kirkpatrick, 2009; Howison & Bullard, 2016; Niemeyer et al., 2016), working groups (e.g., FORCE11 Software Citation Working Group,^{Footnote 7}FORCE11 Software Citation Implementation Working Group^{Footnote 8} and WSSSPE Software Credit Working Group—Katz et al., 2016), publisher guidelines (e.g., the American Astronomical Society Policy Statement on Software^{Footnote 9}) or informal statements, such as the Science Code Manifesto^{Footnote 10} and The Research Software Impact Manifesto,^{Footnote 11} the diversity of ways to referring to software and the still pending proper standardized citation guidelines (e.g., standardized citation styles and publishers sometimes contradict each other make citation counts a limited metric for the proper traceability of scientific-purpose software (Pan et al., 2019).

Textual approaches to track academic software usage

Since citation metrics have shown only a limited applicability to measure software usage in academic settings, it becomes necessary to establish alternative methods to measure the usage of scientific-purpose software, and to obtain evidence about its influence and impact (Hannay et al., 2009; Pan et al., 2018). Thus, the identification of textual mentions of software in the text of scientific papers has been a quite common approach to capture the impact of software on science (Pan et al., 2016).

Different efforts have been made to measure scientific software text-mention patterns in publications at different levels: a) disciplines, such as Biology (Howison & Bullard, 2016; Yang et al., 2018) and Library and information sciences (Pan et al., 2019); b) multidisciplinary journals, such as PLoS ONE (Pan et al., 2015, 2016); c) programs stored in software repositories (Thelwall & Kousha, 2016); and d) specific software applications, such as Geant4 toolkit (Pia et al., 2009), R packages (Li & Yan, 2018; Li et al., 2017, 2019), and bibliometric mapping software (Citespace, VOSviewer and Histcite) (Pan et al., 2017, 2018).

Tracking the use of software via text-mentions introduces some methodological challenges, which might limit the identification of software names in large texts (Du et al., 2021). First, there may be different ways to invoke the same software, a software project name (e.g., in GitHub), the URL of the software’s official website, the URL to the repository where it is hosted, mentions to unpublished manuscripts about the software, users’ manuals, etc. In addition, we can find synonyms or even translations to other languages. Consequently, the polymorphous nature of textual mentions is huge (Cronin et al., 1988). Second, common words used as software names may also represent other objects, due to polysemy of textual mentions. Third, complex software applications might have parts (specific packages and modules, etc.) that make their clear identification through text-mentions complex.

Using software website URLs as traceable objects

A possibility to limit this complexity and facilitate the operationalization of the tracking of software is to consider the URL of the software website as the traceable object. The mentioning of URLs has been extensively studied in the webometrics field as an established technique to measure the online importance and impact of websites (Orduna-Malea & Alonso-Arroyo, 2017; Park & Thelwall, 2003; Thelwall, 2004). Likewise, the URL also stands out as a central piece for Altmetric studies, as such universal identifier is often used as a digital object (mainly via DOI URLs) representing research publications mentioned on online social media platforms (Wouters et al., 2019).

Tracking URLs presents fundamental advantages over the tracking of texts: First, a URL provides a unique and unequivocal element to identify the software. For example, the text ‘vosviewer.com’ can only refer to the software VOSviewer, thus reducing polymorphism and polysemy. Second, the URL is an actionable element that allows users to navigate from the source document (document where the software is mentioned) to the target (document published/hosted in the software’s website), establishing and making explicit a relation between these documents. Third, considering that mentioning a URL is time-consuming –more than simply mentioning the software– it can be argued that this action might be related to a more conscious informational purpose (e.g., sources transparency, facilitating resources for readers, etc.), where links are oriented to navigational issues (Halavais, 2008). Fourth, the URL not only represents univocally the digital object (in this case, the software) but also represents the whole website (and all contents hosted inside) where the software is available. As a website, a wide range of metrics (e.g., traffic, visibility, size, etc.) are available, which can report information related to the consumption and interest on the software.

VOSviewer as a case study

The free bibliometric software VOSviewer^{Footnote 12} is analyzed as a case study. VOSviewer was developed by Nees Jan van Eck and Ludo Waltman at Leiden University's Centre for Science and Technology Studies (CWTS). The application was launched in 2010, and formally introduced through a software paper –a scientific publication describing and analyzing the software (Smith, Katz & Niemeyer 2016)– published in Scientometrics (Van Eck & Waltman, 2010). This publication is the most cited article in the journal, according to both Scopus (1,621 citations) and Web of Science-all databases (1,431), as of August 2020.

Its ease of use and multiple features (including specific clustering and natural language processing techniques) made VOSviewer popular not only in the Scientometrics community but also in other disciplines where science maps are used. Given its simplicity as a software product (code and related material is all centralized and available on a website) and the broad interest and diverse audience to this software, VOSviewer constitutes an excellent case study to test the proposed approach. Moreover, VOSviewer is relevant and well known software for the Scientometric research community. All these features make VOSviewer an ideal case study for an illustrative discussion like the one presented in this paper.

Aim of the study and analytical framework

The objective of this study is to illustrate a webometric and altmetric method to determine the use and interest on a case study of scientific-purpose software: VOSviewer. The main purpose of this work is to design an analytical framework aimed at studying scientific software’s impact metrics by collecting large amounts of data from multiple online data sources. This analytical framework will be made explicit by means of an evaluation technical sheet, which will include a wide list of URL-based metrics specifically and formally defined to measure the use of scientific software. This analytical framework is based on the mentioning of software’s main URL across different scenarios (academic publications, web at-large, and social media), each of which is operationalized by an online data source (Google Scholar, Majestic, and Twitter, respectively).

In Appendix 1 we describe more specifically the three data sources chosen for this study. The rest of the paper is structured as follows, in Sect. 2 the methodological approach is described, in Sect. 3 we present the main results, discussion in Sect. 4, and finally the main conclusions in Sect. 5.

Methodological approach

We analyze the mentioning of the official URL of VOSviewer (www.vosviewer.com) in three scenarios: academic publications, web at-large, and social media. Each scenario is characterized by the following five elements (Fig. 1):

Source: the database where the use of the software is tracked (e.g., Majestic, Twitter, etc.).
Object: an element that represents the software (e.g., the software name, the software website).
Event: an action carried out by users or applications that trigger the appearance or invocation of the object (e.g., a software name’s text-mention, a link to the software website).
Tracking metrics: a measure which captures the use of the object through a certain event (e.g., the number of software name’s text mentions).
Context metrics: a measure which captures characteristics of the users that generate the events (e.g., users’ gender, location, language, etc.).

URL mentions in academic literature via Google Scholar

Academic publications containing the URL string “vosviewer.com” were retrieved from Google Scholar using the Publish or Perish v7 software^{Footnote 13} (patent documents included). To do this, the direct query “vosviewer.com” was performed, excluding protocols (https and www) to improve recall as much as possible. The search was not limited to a specific period in order to retrieve all publications in Google Scholar, regardless the year of publication, language or document type.

The search returned 1,190 records as of 31 March 2020. These records were subsequently exported including the following bibliographic fields: publication author(s), publication title, publication year, publication source, publication URL, and the number of citations received. Due to the unsupervised indexing process carried out by Google Scholar, bibliographic errors were found (Orduna-Malea, Martín-Martín & Delgado López-Cózar 2017), including multiple records for one same publication and incomplete/erroneous authorship.^{Footnote 14} To solve this limitation, data was manually cleansed, and multiple copies were merged, obtaining a final amount of 988 records. The document type (journal article, book, book chapter, conference proceeding, working papers and reports) was directly determined from the information contained in the publication source field. All dubious cases were manually checked.

The publication URL was used to manually access each of the publications. The search functionalities of web browsers (for HTML publications) and Adobe Acrobat (for PDF publications) were used to locate each URL mention in each of the publications, according to the available full text format. The publication section where the string “vosviewer.com” appeared (introductory sections, method, conclusions, or references) was manually checked and annotated throughout this process. When no clear structure (IMRaD type) was found, the category ‘unstructured’ was assigned.

Along this process, full text access was not possible for 51 publications. No URL mentions were found for 32 publications (due to Google Scholar parsing errors^{Footnote 15}), and document typology could not be properly determined for 18 publications (due to lack of information on the full texts available), thus resulting in a final set of 887 publications mentioning ‘vosviewer.com’.

URL mentions in webpages via Majestic

Link data related to the VOSviewer website was gathered from Majestic^{Footnote 16} through the site explorer feature (historic index). To do this, the direct root domain query “vosviewer.com” was carried out, obtaining a total of 17,261 mentioning webpages belonging to 1,330 distinct websites were gathered as of 12 April 2020. To characterize those webpages including a URL mention of “vosviewer.com”, additional web metrics related to each of these webpages were also directly obtained from Majestic (see Table 1 for a detailed description of these metrics), including the IP address, website language, and flow metrics (Trust Flow and Citation Flow)^{Footnote 17} (Jones, 2012). These flow metrics are meant to capture some idea of the “prestige” or reputation of the linking URLs.^{Footnote 18}

Table 1 Categories used to typify the purpose of those tweets mentioning the software’s URL (vosviewer.com)

Full size table

Finally, each website was categorized. To this end, a bottom-up process based on a previous work oriented to scientific-related websites categorization (Orduna-Malea 2021) was carried out. First, each website was accessed to and manually classified according to their functional nature. A total of 46 academic-related categories were identified, whilst all non-academic related websites (casinos, adult content, etc.) were discarded. Second, an external researcher with expertise in websites classification was asked to carry out an inter-coder reliability test through a random sample of the 10% of academic-related websites. The percentage of agreement achieved was 80%, and the Krippendorff's alpha (nominal) achieved was 0.92, which is considered acceptable.

It should be noted that only URLs at the web domain level were considered for website categorization. As blogs from generic blog providers (under ‘wordpress.com’, ‘blogspot’ or similar web domains) employ generic web domains provided by the blog service, several different blogs may be linked to one same web domain. In these cases, no academic-related category has been assigned, as particular blogs under generic blog providers have not been checked.

URL mentions in tweets via Twitter

All tweets containing the string “vosviewer.com” until 31 March 2020 were gathered from Twitter using the TweetDeck dashboard application,^{Footnote 19} without any time restriction. A corpus of 267 tweets mentioning the URL “vosviewer.com” was finally retrieved.

For each tweet, the username, date of publication, tweet text, number of replies, likes, and retweets were obtained. Likewise, the tweet type (original tweet, retweet, and reply) and the inclusion of images or videos were also collected.

Finally, each tweet was categorized according to the main purpose. To this end, a bottom-up process was carried out. First, each tweet content was accessed to, and manually classified according to the general purpose perceived, considering the words, textual signs, visual signs, multimedia, and tweet type. Second, all categories identified were grouped, standardized, and defined, achieving a total of 10 general categories (see Table 1). Third, a second classification round was performed to reassign the standardized category to each tweet. Fourth, an external researcher with expertise in tweets classification was asked to carry out an inter-coder reliability test through a random sample of the 10% of tweets. The percentage of agreement achieved was 80%, and the Krippendorff's alpha (nominal) achieved was 0.76, which is considered acceptable. This test was used to reclassify few tweets and improve the definition of categories.

At the user-level, all users providing likes, retweets and replies to each of the original 267 tweets were also obtained, and categorized (female, male, institutional, unknown). The social authority^{Footnote 20} of each Twitter user was obtained from Followerwonk.^{Footnote 21} This metric recursively measures the prestige of a Twitter account based on the prestige of the followers who retweet the tweets by the given account.^{Footnote 22} Specifically, it includes the following three components: a) the retweet rate of a few hundred of the measured user’s last non-@mention tweets; b) a time decay to favor recent activity versus old activity; c) other data for each user (such as follower count, friend count, and so on) that are optimized via a regression model trained to retweet rate. Social authority metrics score from 0 (no authority) to 100 (maximum authority).^{Footnote 23}

Different networks (connecting the user who publishes a tweet with the user who likes/retweets/replies the original tweet published) were generated with Gephi v0.9.1.^{Footnote 24} Finally, data was statistically analyzed with XLStat.^{Footnote 25}

Data processing and metrics

At the end of the process, a total of 887 academic publications, 17,261 webpages and 267 tweets are considered. The overall process followed is illustrated briefly in the Fig. 2.

As we can observe in the Fig. 2, each scenario operates with one specific source, each providing a different set of mentioning events: publications, webpages, and tweets. From each of these three document bodies, several context metrics can be obtained.

Based on software diffusion indicators (paper diffusion breadth and journal diffusion breadth) proposed by Pan et al., (2018), the following tracking and context metrics have been obtained (Table 2). These are grounded on diffusion breadth metrics (number of elements mentioning the software) and impact breadth metrics (attention achieved by elements mentioning the software). Elements can be authors, journals, webpages, countries, languages, etc.

Table 2 List of tracking and context metrics used to measure the software use

Full size table

For the sake of clarity, in this work webpage will refer to any document displayed to a user in a web browser, regardless its format and represented by a URL (e.g., vosviewer.com/vosviewer.php). Likewise, a website will refer to a collection of webpages linked together in a coherent fashion, also represented by a URL which nests hierarchically all related webpages’ URLs (e.g., vosviewer.com).

Results

URL mentions in academic literature via Google Scholar

A total of 1,144 publication URL mentions from 887 different publications (publication diffusion breadth) were found. Most publications (79%) include only one URL mention, although few publications include up to ten URL mentions, denoting a strong importance of the software in that publications (Fig. 3). In any case, the publication URL mention intensity is low (average of 1.29 URL mentions per publication).

The publication impact breadth achieves an i-10 index of 188 (and an i-100 index of 22), showing a significant number of publications mentioning the software’s URL achieving citation-based impact. 36.4% of this corpus of URL mentioning documents had not received any citations at the time of data gathering. The publication diffusion breadth has increased over the years, especially in 2019 (Fig. 4). About two-thirds of these publications are journal articles (70.1%), while other categories such as theses and Master theses (10.8%) and Conference papers (9.2%) also show a remarkable presence (Fig. 5).

The references section is the most frequently location where VOSviewer’s URL mentions were found (30.8% of all URL mentions found), followed by the methodology Sect. (29.7%) and results (18.1%). Otherwise, 108 documents (providing 144 URL mentions) did not exhibit a standard structure (Fig. 6). These results suggest a preference to mention VOSviewer’s URL as part of the bibliographic references, and as a methodological item in the mentioning publications.

Mentioning journal articles come from 499 different academic journals (source diffusion breadth), out of which 16% belong to the Library and Information Sciences field. The distribution of URL mentions per journal shows a highly skewed distribution (431 journals appear with just one publication each including at least one URL mention). Scientometrics, JASIST, and Journal of Informetrics are the principal sources (Table 3), being all of them core journals in the Library and Information Sciences field, area in which the VOSviewer software has been applied.

Table 3 List of academic journals according to the number of mentions to the VOSviewer’s URL (vosviewer.com)

Full size table

At the author-level, a total of 2,130 authors are found as author(s) or co-author(s) of publications including at least one mention to the software’s URL (author diffusion breadth). The authors who have mentioned the software’s URL the most times are shown in the Table 4, where the co-developers of the software as well as other eminent researchers in the field of Scientometrics can be distinguished.

Table 4 List of authors according to the number of mentions provided to the VOSviewer’s URL (vosviewer.com)

Full size table

URL mentions in websites via Majestic

VOSviewer’s official website has accumulated 21,440 website URL mentions since 2014 (no data is available before this date in majestic), out of which 99.5% appear in internal pages (only 110 URL mentions come from websites’ homepages).

These website URL mentions come from 17,261 webpages (webpage diffusion breadth) belonging to 1330 websites (website diffusion breadth). The website impact breadth of this corpus of mentioning websites is elevated (119 of these websites receive hyperlinks from at least 100,000 different external websites, while 244 of these websites receive hyperlinks from at least 10,000 different websites).

The number of new webpages mentioning VOSviewer’s URL increases over time. From 2015 onwards, the monthly average of new mentioning webpages is 246.7. During 2019, this average value increases to 467.9 (Fig. 7). The monthly average of new mentioning websites is 17.8 (Fig. 8).

The web authority of this corpus of websites mentioning the software’s URL is diverse. 136 websites (10.2%) achieve a Trust Flow score equal to or greater than 50 (out of 100), and 181 websites (13.6%) achieve a Citation Flow score equal to or greater than 50 (also, out of 100), while most mentioning websites achieve lower scores, specially Trust Flow scores (926 mentioning websites achieve scores lower than 10), being websites with low web authority (Fig. 9).

The nature of mentioning websites is also diverse. About 28.5% of these 1330 websites (379) correspond to academic-related websites (academic website diffusion breadth), whereas 23.8% (317) include dubious websites related to casinos, gambling, online bets, and even pornographic websites.

It is note to worth that a significant number of websites were not available two months after data gathering (473 websites, containing 2,906 mentioning webpages). Consequently, they could not be categorized. Likewise, 95 websites (containing 215 mentioning webpages) were parked^{Footnote 26} (Orduna-Malea, 2021), and 76 domains (208 mentioning webpages) automatically redirected to dubious websites.

Notwithstanding, the total percentage of mentioning webpages from non-academic websites was small (5.1%; 877 mentioning webpages), while URL mentions from academic-related webpages were majority (74.1%; 12,799 mentioning webpages). Therefore, its incidence on VOSviewer’s website overall online impact is limited (Fig. 10).

At the webpage level, the origin of URL mentions from academic-related webpages to VOSviewer’s URL is mainly from personal blogs (5,352 webpages), academic information products (1,883), and research groups (1,450). At the website level, universities and academic journals stand out (101 and 32 URL mentions, respectively). Most of websites (91.9%) provide just one URL mention to the software. In addition, non-academic websites have been included by way of illustration (Table 5). Of these, 473 websites have expired, 95 were parked and 76 redirected to other web locations. Other significant non-academic categories included websites with tricks for SEO professionals (44), non-academic companies (40) and mentions from online messages groups (57 links from 34 websites).

Table 5 Ranking of website categories according to the number of academic webpages and websites mentioning VOSviewer’s URL (vosviewer.com)

Full size table

The language has been identified for 71% of all 17,261 webpages mentioning the software’s URL, covering 26 different languages (webpage language diffusion breadth). URL mentions come mainly from webpages written in English (5,894) and French (5,216). Analyzing the geo-location of IP addresses of each website’s web domain, we find 3,849 webpages from 91 websites hosted in Netherlands (which is coherent as it is the place where VOSviewer’s developers work). However, only 46 webpages written in Dutch were identified (Table 6). A similar issue is found with Germany, from which 539 webpages from 94 different websites geo-located in this country are found, but only 21 webpages are written in German language. The use of English in most academic web environments may explain these results.

Table 6 Ranking of languages according to the number of webpages mentioning VOSviewer’s URL, written in the corresponding language

Full size table

The IP address geo-location has identified webpages placed in 54 different countries (webpage location diffusion breadth), mainly from websites located at United States (53.9%; 717 websites) and United Kingdom (11%; 146 websites) (Fig. 11). At the level of mentioning webpages, Argentina (255 webpages from 3 websites) and Indonesia (113 webpages from 3 websites) stand out due to the existence of specific websites with many mentioning webpages (mainly from ‘r020.com.ar’, an Argentinean website dedicated to Library and information sciences resources with 249 mentioning webpages, and ‘dasaptaerwin.net’, an Indonesian personal academic website providing 113 mentioning websites, respectively).

67.1% of webpages mentioning the software’s URL link directly to VOSviewer’s homepage. In addition, we can find a significant number of URL mentions linking to specific sections of the software’s website, especially maps created by the software as use examples, and the page where the software is available to download (Table 7).

Table 7 VOSviewer’s website specific URLs most mentioned by external webpages

Full size table

URL mentions in tweets via Twitter

A total of 267 tweets containing a URL mention to the VOSviewer’s official website have been identified in the period (tweet diffusion breadth). As all tweets include just one URL mention to the software’s URL, the tweet diffusion breadth and the total tweet URL mention counts are the same.

This corpus of mentioning tweets has originated a further engagement of 89 replies (reply diffusion breadth), 646 retweets (retweets diffusion breadth), and 1109 likes (like diffusion breadth) (Table 8). The tweet impact breadth is limited; the like-based impact breadth achieves an i10-index of 26 (i100-index of 1) and the retweet-based impact breadth achieves an i10-index of 15 (i100-index is null). The low number of replies make reply-based impact breadth null (i10-index is 0; i1-index is 65).

Table 8 Number of tweets mentioning the software’s URL (vosviewer.com) over the years, and the engagement achieved

Full size table

75% of the 267 tweets gathered are original tweets, while 20% (53) are replies. Conversely, retweets containing URL mentions are scarce (5%; 14 tweets). Most tweets (80%) do not include images, and only two include media.

Original tweets related both with diffusion (21.3% of all tweets) and software recommendation (16.9%) are the most frequent. VOSviewer’s URL is also frequently mentioned in conversations about bibliometric maps generation (17.2%). However, tweets related with new versions releases and use examples are those achieving the largest engagement, both in terms of number of likes and retweets received (Fig. 12).

As regards to tweets’ authorship, 200 users have been identified (total author diffusion breadth), 56% of which are male, 30% female, and 12% institutional accounts (Table 9). The community of users giving likes (liking author diffusion breadth of 748) and retweets (retweeting author diffusion breadth of 428) follow similar gender distribution, except for institutional profiles, which exhibit a lower participation in the generation of likes.

Table 9 Gender analysis of the users who tweeted, retweeted, and liked tweets mentioning the VOSviewer software’s website

Full size table

The overall community of attention of users who have published at least a tweet mentioning VOSviewer’s URL (200 users) can be characterized as having a high level of Twitter reputation, with a social authority median value equal to 47 (given that only few users in the world achieve a score of 100, this median value is considered substantial), and with a significant number of followers and total tweets published (Table 10). However, tweeting shows a skewed distribution of tweets per author, as only 18 users have published more than one tweet mentioning VOSviewer’s URL.

Table 10 Descriptive statistics related to users mentioning the VOSviewer software’s URL (vosviewer.com) (n = 200)

Full size table

The network of the community of attention of VOSviewer is a very sparse user (tweet creator) to user (like generator) network (Fig. 14 up), which shows one large node, few significant nodes, and lots of small nodes involved in sporadic interactions (one user generating a tweet and few liking it). This network (with low density and average degree and large diameter considering the number of nodes involved) shows the dependence on the Twitter activity of one user (van Eck, one of the co-developers of VOSviewer). In addition, a strong unbalanced user behavior is detected. Users who receive a great number of likes do not provide likes to those users including a URL mention in their tweets, reflecting a lack of interactivity in this community of users. The network based on retweets (Fig. 13 bottom) shows similar patterns, being reduced, sparse, and dependent on van Eck’s activity.

Combining scenarios

Each of the three scenarios analyzed (academic literature, websites and Twitter) provides a complementary story about the use and interest in VOSviewer. Each scenario is determined by the different available context metrics provided in the corresponding analytical framework, each of which covers specific document bodies (academic publications, webpages and tweets, respectively).

The academic literature scenario allows checking the use of the software in publications. This way, VOSviewer’s diffusion breadth can be determined in terms of the number of publications, sources or authors mentioning the software’s URL, and the intensity of this mentioning event. Moreover, driving the analysis to a greater detail, the appearance of mentions in different sections—beyond the references—can inform about the nature of the mentions. Thus, URL mentions can appear in the method section (as part of describing procedures and tasks performed), results (mainly to show data created with the software to illustrate direct findings related to objectives), and introductory sections (mainly to supplement literature reviews).

The web scenario has allowed determining VOSviewer’s interest through webpages linking to the software’s URL. This way, VOSviewer diffusion breadth can be determined in terms of the number of webpages and websites mentioning the software, and the number of different languages and countries from which the software is mentioned. Moreover, the appearance of URL mentions in specific academic-related webpages can potentially inform about the interest of software in academic spaces beyond publications.

The social media scenario has allowed checking the academic software’s use and interest through tweets linking to the software’s URL. This way, diffusion breadth can be determined in terms of the number of tweets and users mentioning the software’s URL. Moreover, the appearance of URL mentions in different tweets allows detailed analyses at the tweet-level, such as the type of tweet (distinguishing replies—as part of discussions—and retweets—for mere diffusion purposes—from original tweets), the motivation of the tweet (e.g., software releases, use examples, etc.) and the engagement of the tweet (number of likes and retweets achieved by tweets mentioning the software’s URL). In addition, data captures attention characteristics of the audience interested in the software, for example users’ likes. The conversational nature of Twitter can also be useful in detecting user networks who may potentially employ the software, favoring the interaction and circulation of scientific knowledge across different communities of attention (Díaz-Faes et al., 2019).

All metrics gathered from each of the scenarios can be combined in technical analytical sheets (Fig. 14) to characterize the software’s use in a multidimensional way. It can be highlighted that all data is captured externally in a non-invasive way, and without access privileges as webmasters, allowing its use to evaluate any other software in a similar fashion, as long as the software counts with an unambiguous URL.

Data included in the evaluation sheet is aimed at giving a detailed overview of software’s diffusion breadth. Obviously, we do need to compare these results with other digital objects to determine whether the results obtained show an outstanding diffusion or not.

Discussion

The suitability of the approach presented in this case study depends mainly on the sources employed and the object selected. In this work, three sources have been selected to exemplify the proposed approach (Google Scholar, Majestic, and Twitter). The coverage, accuracy and availability of these sources determine the comprehensiveness and suitability of tracking and context metrics.

All three sources have as main advantages the coverage and the wide variety of available tracking metrics. Google Scholar covers academic publications regardless the type, language, and discipline, enabling the discovering of mentions in a larger number of publications and sources not available in other bibliographic sources. Majestic covers millions of webpages regardless the webpage type, language, or location, enabling the discovering of mentions in the global open online sphere, where academics and practitioners also participate. Twitter covers millions of tweets, enabling the discovering of mentions in the Twittersphere (the total universe of Twitter users and their publishing/following/reacting habits), which allows measuring interactivity between users (e.g., discussions threads or engagement).

Conversely, all these sources show limitations on availability, accuracy, and volatility. Google Scholar does not offer data export facilities (the Publish or Perish software helps but it is also limited to Google Scholar requirements). It also introduces errors in the bibliographic records that require manual examination, which in turn impairs the analysis of large sets of publications. Moreover, unlike other bibliographic databases, Google Scholar is not an accumulative database, and documents can be unindexed if they stop meeting Google Scholar’s technical requirements (Delgado López-Cózar, Orduña-Malea & Martín-Martín 2019). Majestic operates with highly-volatile data, as webpages are continuously changing. For example, 4,365 URL mentions to VOSviewer came from one specific personal blog (culturalibre.ca). Most of these hyperlinks were deleted just few months after data gathering, and the overall domain was lately disabled. Therefore, web data should be treated as dynamic and fluid impact instead of current impact. Moreover, Majestic data is offered under a paid license –which limits its use for large scale research endeavors– and its flow metrics are composite indicators whose full methodology is unknown, jeopardizing its transparency. Taking apart the volatility of the data, Twitter also deals with the limitations derived from compliance with its terms of use together with the General Data Protection Regulation (GDPR) as regards demographic data of users. Availability is also a problem as public Twitter API is not enough to carry out large data analyses and full API functionalities are offered under different paid licenses. The recent Twitter academic research API^{Footnote 27} might solve partially these limitations.

The object used to represent the software is another important aspect of the proposed approach. In this work, the object corresponds to the URL of the software’s official website. The choice of a URL as a traceable digital object has important advantages. For example, the mentions to the URL can be unequivocally identified, avoiding the polysemy and synonymy of natural language. In addition, URLs have specific search filters to ease data retrieval in a multitude of sources. These characteristics allow the generation of faster, simpler and more refined information retrieval systems. Moreover, URLs also allow online navigation (and, therefore, web traffic), which potentially facilitates software usage, and constitutes an effective type of mentioning software for scientific purposes. Finally, we can find a wide variety of URL metrics, which are not available for other textual metrics or, in some cases, obtaining them would require much more complex computing needs.

Despite the benefits of using URL as an object, the following limitations (and threats) should be acknowledged (Table 11):

Table 11 URL-based text mentions limitations

Full size table

Software can be mentioned by authors for a wide range of reasons (an instrument, an artifact, a scientific protocol, a method, or just an example—Howison & Bullard 2016; Li et al., 2019) regardless the type of text-mention used (URL, software’s name, etc.). The publication sections where the mention appears or the category of tweets where the software is mentioned have been precisely studied in this work as exploratory signals. Further qualitative techniques would add more context to the Twitter conversations, which may enhance the software’s technical evaluation sheet obtained. The taxonomy of tweets is only based on the collected tweets, but a more generic taxonomy would be advisable, including other potential categories or subcategories (e.g., software awards, datacamps, dedicated conferences and meetings, official use of scientific software in syllabi, mentions and reviews from professional specialized media, diffusion of books dedicated to the software, forks, etc.) not considered in this analysis but foreseeable necessary for other types of software.

VOSviewer has been used as a case study to exemplify the method. The manageable number of URL mentions found together with its online presence has made its choice an adequate one to test a wide number of metrics per analytical framework.

Conclusion

A webometric analytical approach to track scientific software use and interest has been proposed in this work. This approach is based on the definition of scenarios, analytical elements, analytical frameworks, sources, objects, events, tracking metrics and context metrics. The operationalization of the approach has been exemplified by analyzing one specific academic software (VOSviewer) and object (VOSviewer’s official URL).

Results show that the different analytical frameworks provide useful information about the usage of scientific software, expanding the notion of usage of scientific software in research publications to dissemination and interest of scientific software in the research community, illustrating how this usage information is relevant to fully comprehend the broader influence of scientific software in the whole research ecosystem.

The Google scholar scenario has shown VOSviewer as a research resource, whilst the Majestic and Twitter scenarios have shown the interest of VOSviewer as an information and a conversational resource, respectively.

Finally, the approach proposed in this study can be expanded by adding new scenarios and new analytical frameworks. Each analytical framework can also be expanded by considering additional sources, objects, events, tracking metrics and context metrics. In addition, data from different scenarios can be combined to create new indicators and to show added-value information (e.g., an integrated publication timeline for each specific event tracked).

As a matter of fact, the approach proposed in this study can actually be used to track any research object that can be enclosed in a specific URL (e.g., software, scientific conferences, presentations, online courses, videos, scientific exhibitions, research projects, academic websites, etc.), effectively expanding the analytical scope of the scientometric toolset by incorporating a novel digital dimension through methods that draw from the fields of Webometrics and Altmetrics.

Notes

General-purpose applications are those originally developed for a general usage, which can also be applied to assist and support some trivial scientific tasks, such as writing documents, sending e-mails, videocalls or presentations (Soito & Hwang. 2016). These applications (for example, Microsoft Word) have no effect on the validity of scientific results and can be easily replaced by other similar solutions (Pan et al., 2019).
https://www.vosviewer.com.
https://www.citnetexplorer.nl.
https://gephi.org.
https://www.nsf.gov/pubs/policydocs/pappg20_1/nsf20_1.pdf.
https://www.ref.ac.uk/media/1092/ref-2019_01-guidance-on-submissions.pdf.
https://www.force11.org/group/software-citation-working-group.
https://www.force11.org/group/software-citation-implementation-working-group.
http://journals.aas.org/policy/software.html.
http://sciencecodemanifesto.org/.
https://www.software.ac.uk/blog/2016-10-06-publish-or-be-damned-alternative-impact-manifesto-research-software.
https://www.vosviewer.com.
https://harzing.com/resources/publish-or-perish.
For example, one author field found was: < I Kellevezir, G Özdağoğlu, M Damar… > , and a manual inspection revealed a missing author. The field was completed as follows: < I Kellevezir, G Özdağoğlu, M Damar, A Özdağoğlu > . In other cases, author names and surnames were altered. For example, the author field < VA Vasco López, M Moreno Mejía, PA Reyes Gavilán… > was updated to < M Moreno Mejía, P Reyes Gavilán, V Vasco López, A Aroca Mejía, N Herrera > .
For example, the sentence “formatados e importados para o software VosViewer com o intuito de esboçar a rede de conexão dos termos pesquisados” does not include a URL mention but a text-mention.
https://majestic.com.
Table 2 includes formal definitions of these two metrics.
The incorporation of this metric only plays a role to illustrate the relevance of characterizing linking websites by their “prestige”, but this does not represent a validation of this metric (which at best must happen in future research) neither a recommendation to be incorporated as a fix element of the analytical framework proposed.
https://tweetdeck.twitter.com.
https://followerwonk.com/social-authority.
https://followerwonk.com.
https://moz.com/blog/social-authority.
Like before, the incorporation of this metric only plays a role to illustrate the relevance of characterizing Twitter users on their “prestige” or “social media capital” (see Díaz-Faes, Bowman, & Costas 2019), but this does not represent a validation of this metric (which at best must happen in future research) neither a recommendation to be incorporated as a fix element of the analytical framework proposed.
https://gephi.org.
https://www.xlstat.com.
A parked web domain is a domain name which has been registered but it is not associated with any service (email, website, etc.). Sometimes, a dummy webpage is artificially introduced to avoid a blank page on the browser.
https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you.
https://scholar.google.com.
https://majestic.com.
http://www.webometrics.info/en.
https://twitter.com.
https://www.similarweb.com/website/twitter.com.

References

Bruns, A., Weller, K., Zimmer, M., & Proferes, N. J. (2014). A topology of Twitter research: Disciplines, methods, and ethics. Aslib Journal of Information Management, 66(3), 250–261.
Article Google Scholar
Cronin, B., Snyder, H. W., Rosenbaum, H., Martinson, A., & Callahan, E. (1998). Invoked on the Web. Journal of the American Society for Information Science, 49(14), 1319–1328.
Article Google Scholar
Delgado López-Cózar, E., Orduna-Malea, E., & Martín-Martín, A. (2019). Google Scholar as a data source for research assessment. In W. Glänzel, H. Moed, U. Schmoch, & M. Thelwall (Eds.), Springer handbook of science and technology indicators (pp. 95–127). Springer.
Delgado López-Cózar, E., Orduna-Malea, E., Martín-Martín, A., & Ayllón, J. M. (2017). Google Scholar: The big data bibliographic tool. In F. J. Cantú-Ortiz (Ed.), Research analytics: Boosting university productivity and competitiveness through scientometrics (pp. 59–80). Taylor and Francis.
Díaz-Faes, A., Bowman, T. D., & Costas, R. (2019). Towards a second generation of ‘social media metrics’: Characterizing Twitter communities of attention around science. PLoS ONE, 14(5), e0216408. https://doi.org/10.1371/journal.pone.0216408
Article Google Scholar
Du, C., Cohoon, J., Lopez, P., & Howison, J. (2021). Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24454
Article Google Scholar
Gusenbauer, M. (2019). Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases. Scientometrics, 118(1), 177–214.
Article Google Scholar
Hafer, L., & Kirkpatrick, A. E. (2009). Assessing open source software as a scholarly contribution. Communications of the ACM, 52(12), 126–129.
Article Google Scholar
Halavais, A. (2008). The hyperlink as organizing principle. In J. Turow & L. Lokman (Eds.), The hyperlinked Society: Questioning connections in the digital age (pp. 39–55). The University of Michigan Press.
Hannay, J. E., MacLeod, C., Singer, J., Langtangen, H. P., Pfahl, D., & Wilson, G. (2009). How do scientists develop and use scientific software? Proceedings of the 2009 ICSE workshop on software engineering for computational science and engineering, SECSE 2009, 1–8. https://ieeexplore.ieee.org/abstract/document/5069155.
Haustein, S., Bowman, T. D., & Costas, R. (2016). Interpreting “altmetrics”: Viewing acts on social media through the lens of citation and social theories. In C. Sugimoto (Ed.), Theories of informetrics and scholarly communication (pp. 372–406). De Gruyter Saur.
Hey, T., Tansley, S., & Tolle, K.M. (Ed.) (2009). The fourth paradigm: data-intensive scientific discovery. Redmond, WA: Microsoft research. https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf.
Howison, J., & Bullard, J. (2016). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology, 67(9), 2137–2155.
Howison, J., & Herbsleb, J. D. (2011). Scientific software production: incentives and collaboration. Proceedings of the ACM 2011 conference on computer supported cooperative work –CSCW ’11, 513–522. https://doi.org/10.1145/1958824.1958904
Howison, J., Deelman, E., McLennan, M. J. M., Da Silva, R. F., & Herbsleb, J. D. (2015). Understanding the scientific software ecosystem and its impact: Current and future measures. Research Evaluation, 24(4), 454–470.
Article Google Scholar
Jansen, B. J., Jung, S.G., & Salminen, J. (2020). Data Quality in Website Traffic Metrics: A Comparison of 86 Websites Using Two Popular Analytics Services. http://www.bernardjjansen.com/uploads/2/4/1/8/24188166/traffic_analytics_comparison.pdf.
Jones, D. (2012). Flow Metrics™ will change the way you look at links. Majestic Blog. https://blog.majestic.com/development/flow-metrics.
Katz D. S., Choi S-. C. T., Niemeyer, K. E. et al. (2016). Report on the third workshop on sustainable software for science: practice and experiences (WSSSPE3). https://arxiv.org/abs/1602.02296.
Li, K., Chen, P. Y., & Yan, E. (2019). Challenges of measuring software impact through citations: An examination of the lme4 R package. Journal of Informetrics, 13(1), 449–461.
Article Google Scholar
Li, K., & Yan, E. (2018). Co-mention network of R packages: Scientific impact and clustering structure. Journal of Informetrics, 12(1), 87–100.
Article Google Scholar
Li, K., Yan, E., & Feng, Y. (2017). How is R cited in research outputs? Structure, impacts, and citation standard. Journal of Informetrics, 11(4), 989–1002.
Article Google Scholar
Lepori, B., Aguillo, I. F., & Seeber, M. (2014). Size of web domains and interlinking behavior of higher education institutions in Europe. Scientometrics, 100(2), 497–518.
Article Google Scholar
Niemeyer, K. E., Smith, A. M., & Katz, D. S. (2016). The challenge and promise of software citation for credit, identification, discovery, and reuse. Journal of Data and Information Quality, 7(4), 1–5.
Article Google Scholar
Orduna-Malea, E. (2021). Dot-Science Top Level Domain: Academic websites or dumpsites? Scientometrics, 126(4), 3565–3591. https://doi.org/10.1007/s11192-020-03832-8
Article Google Scholar
Orduna-Malea, E. (2020). Investigando con Twitter: una mirada según el Reglamento General de Protección de Datos. In Francisca Ramón-Fernández (Ed.). Marco jurídico de la ciencia de datos (pp. 331–378). Valencia: Tirant lo Blanch.
Orduna-Malea, E., & Alonso-Arroyo, A. (2017). Cybermetric techniques to evaluate organizations using web-based data. Chandos Publishing.
Orduna-Malea, E., Ayllón, J. M., Martín-Martín, A., & Delgado López-Cózar, E. (2015). Methods for estimating the size of Google Scholar. Scientometrics, 104(3), 931–949.
Article Google Scholar
Orduna Malea, E., Martín-Martín, A., & Delgado-López-Cózar, E. (2017). Google Scholar as a source for scholarly evaluation: A bibliographic review of database errors. Revista Española De Documentación Científica, 40(4), 1–33.
Article Google Scholar
Orduna-Malea, E., & Regazzi, J. J. (2014). US academic libraries: Understanding their web presence and their relationship with economic indicators. Scientometrics, 98(1), 315–336.
Article Google Scholar
Ortega, J. L. (2014). Academic search engines: A quantitative outlook. Elsevier.
Ovadia, S. (2009). Exploring the potential of Twitter as a research tool. Behavioral & Social Sciences Librarian, 28(4), 202–205.
Article Google Scholar
Pan, X., Cui, M., Yu, X., & Hua, W. (2017). How is CiteSpace used and cited in the literature? An analysis of the articles published in English and Chinese core journals. ISSI 2017–16th International conference on Scientometrics and Informetrics. http://issi-society.org/proceedings/issi_2017/2017ISSI%20Conference%20Proceedings.pdf.
Pan, X., Yan, E., & Hua, W. (2016). Disciplinary differences of software use and impact in scientific literature. Scientometrics, 109(3), 1–18.
Article Google Scholar
Pan, X., Yan, E., Cui, M., & Hua, W. (2018). Examining the usage, citation, and diffusion patterns of bibliometric mapping software: A comparative study of three tools. Journal of Informetrics, 12(2), 481–493.
Article Google Scholar
Pan, X., Yan, E., Cui, M., & Hua, W. (2019). How important is software to library and information science research? A content analysis of full-text publications. Journal of Informetrics, 13(1), 397–406.
Article Google Scholar
Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871.
Article Google Scholar
Park, H. W., & Thelwall, M. (2003). Hyperlink analyses of the World Wide Web: A review. Journal of computer-mediated communication. https://doi.org/10.1111/j.1083-6101.2003.tb00223.x
Article Google Scholar
Park, H., & Wolfram, D. (2019). Research software citation in the Data Citation Index: Current practices and implications for research software sharing and reuse. Journal of Informetrics, 13(2), 574–582.
Article Google Scholar
Pia, M. G., Basaglia, T., Bell, Z. W., & Dressendorfer, P. V. (2009). Geant4 in scientific literature. IEEE Nuclear Science Symposium Conference Record, 189–194. https://ieeexplore.ieee.org/document/5401810.
Piwowar, H. A. (2013). Value all research products. Nature, 493, 159.
Article Google Scholar
Pradal, C., Varoquaux, G., & Langtangen, H. P. (2013). Publishing scientific software matters. Journal of Computational Science, 4(5), 311–312.
Article Google Scholar
Smith, K. (2020). 58 Incredible and Interesting Twitter Stats and Statistics. Brandwatch. https://www.brandwatch.com/blog/twitter-stats-and-statistics.
Smith, A. M., Katz, D. S., & Niemeyer, K. E. (2016). Software citation principles. PeerJ Computer Science, 2, e86. https://peerj.com/articles/cs-86/.
Soito, L., & Hwang, L. J. (2016). Citations for Software: Providing identification, access and recognition for research software. IJDC, 11(2), 48–63.
Article Google Scholar
Stewart, B. (2017). Twitter as method: Using Twitter as a tool to conduct research. L. Sloan, & A. Quan-Haase, Social Media Research Methods, 251–266.
Thelwall, M. (2004). Link Analysis: An information science approach. Elsevier.
Thelwall, M. (2006). Interpreting social science link analysis research: A theoretical framework. Journal of the American Society for Information Science and Technology, 57(1), 60–68.
Article Google Scholar
Thelwall, M., & Kousha, K. (2016). Academic software downloads from google code. Information Research, 21(1). http://informationr.net/ir/21-1/paper709.html#.XzelJ-gzbIU.
Van Eck, N., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538.
Article Google Scholar
Williams, S. A., Terras, M. M., & Warwick, C. (2013). What do people study when they study Twitter? Classifying Twitter related academic papers. Journal of Documentation, 69(3), 384–410.
Article Google Scholar
Wouters, P., Zahedi, Z., & Costas, R. (2019). Social media metrics for new research evaluation. In W. Glänze, H. F. Moed, U. Schmoch, & M. Thelwall (Eds.), Springer handbook of science and technology indicators (pp. 687–713). Springer.
Yang, B., Rousseau, R., Wang, X., & Huang, S. (2018). How important is scientific software in bioinformatics research? A comparative study between international and Chinese research communities. Journal of the Association for Information Science and Technology, 69(9), 1122–1133.
Article Google Scholar

Download references

Acknowledgements

Special thanks to Ariadna Fernández-Planells for the Twitter intercoder reliability test, and to Cristina Font-Julián for the website intercoder reliability test.

Funding

RC was partially funded by the South African DST‐NRF Center of Excellence in Scientometrics and Science, Technology, and Innovation Policy (SciSTIP).

Author information

Authors and Affiliations

Department of Audiovisual Communication, Documentation and History of Art, Universitat Politècnica de València, Valencia, Spain
Enrique Orduña-Malea
Centre for Science and Technology Studies (CWTS), Leiden University, Leiden, The Netherlands
Rodrigo Costas
DST‐NRF Centre of Excellence in Scientometrics and Science, Technology and Innovation Policy, Stellenbosch University, Stellenbosch, South Africa
Rodrigo Costas

Authors

Enrique Orduña-Malea
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Costas
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Rodrigo Costas is member of the Distinguished Reviewers Board of Scientometrics, and is employed at the Centre for Science and Technology Studies (CWTS) that is host of the VOSviewer software. However, he is not involved in its development. All authors contributed to the study conception and design. Data collection and analysis were performed by Enrique Orduña-Malea. The manuscript was written by all authors. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Enrique Orduña-Malea.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (xlsx 4,164 KB)

Appendix 1 Data sources used in the study

In this section we discuss the main data sources selected for the three scenarios and the main methodological approach.

Google Scholar

Google Scholar^{Footnote 28} is a freely accessible academic search engine launched in 2004 and aimed at facilitating the discovery of academic literature worldwide (Ortega, 2014). To accomplish with this overarching goal, Google Scholar employs user agents (web applications) that automatically discover and scan websites by following hyperlinks from one webpage to another (Delgado López-Cózar et al., 2017). This way, Google Scholar parses the entire academic web (websites of universities, scientific publishers, repositories, aggregators, library catalogues, and any other web spaces where they might find academic-like materials).

Google Scholar indexes in a –mostly– unsupervised manner every scholarly document it finds as long as it meets a set of technical requirements, covering thus a whole range of academic document types (books, book chapters, journal articles, conference articles, teaching materials, theses, posters, presentations, reports, patents, etc.), from a wide range of disciplines (including Arts, Engineering, Humanities and Social Sciences) and regardless their language (Delgado López-Cózar et al., 2019).

These operating characteristics make Google Scholar the largest bibliographic database in the world today (Delgado López-Cózar et al., 2019; Gusenbauer, 2019; Orduna-Malea et al., 2015; Ortega, 2014), and therefore, a suitable source to measure software text-mentions in the academic literature.

Majestic

Majestic^{Footnote 29} is a link intelligent tool launched in 2008 by Majestic-12 and oriented to massive link analysis and search engine optimization (SEO). While Google Scholar indexes academic documents, Majestic indexes URLs, providing a wide range of tailored metrics to determine the impact and authority of these URLs –and the webpages they represent– on the Web, especially the number of webpages including text mentions to the URL analyzed.

Currently, Majestic is one of the most comprehensive sources of web data on the Web, declaring 2,482 billion unique URLs indexed in its historic database (coverage from 2015 to 2020) and 947 billion unique URLs indexed in its fresh database (last five months), as of 21 January 2021. This database includes all kind of websites (blogs, portals, wikis, fora, etc.) from all kind of users (personal websites, company websites, organizations, institutions, etc.).

Given its functionalities and coverage, Majestic has been used as a data source in webometrics contributions (Jansen, Jung & Salminen, 2020; Lepori et al., 2014; Orduna-Malea & Regazzi, 2014) and it is also used as a data source in the Ranking Web of universities.^{Footnote 30} For these reasons, Majestic constitutes an authoritative source to measure software text-mentions on the Web-at-large.

Twitter

Twitter^{Footnote 31} is a microblogging service created in March 2006 by Jack Dorsey, Evan Williams, Biz Stone, and Noah Glass. The main feature of this platform is to instantly create and publish short messages called tweets (originally up to 140 characters, a size that was expanded to 280 in 2017) in which diverse files can be embedded (static or moving images, hyperlinks, hashtags and mentions to other users). Readers can also interact with these messages in different ways, expressing that they find interesting it (like), spread it to their contacts (retweet) or give a direct answer (reply) generating thus discussion threads (a conversational feature). In a complementary way, users generate social networks by following the publications and activities of other users. They can also follow topics of interest (through hashtags), make user lists or even communicate with other users through private direct messages. Twitter also offers a wide number of metrics both at the user-level (e.g., number of tweets published, number of followers achieved, user demographic data [gender, location, interests…]) and at the tweet-level (e.g., number of likes, retweets and replies received, tweet demographic data [who created the tweet, when, where…]).

Twitter generates about 700 million tweets per day approximately (as of January 2020). It is estimated that since its inception, around 1.3 billion accounts have been created (Smith, 2020), maintaining some 330 million active monthly users and 145 million daily users. The Twitter website is visited by 6.54 billion users (both registered and unregistered) monthly according to December 2021 data provided by the analytics tool SimilarWeb.^{Footnote 32}

Given the huge amount of data generated on Twitter, this platform has been widely used not only as a source of data in the scientific literature but also as an object of study on its own (Bruns et al., 2014; Ovadia, 2009; Stewart, 2017; Williams et al., 2013). For these reasons, Twitter constitutes an authoritative source to measure software text-mentions on social media.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Orduña-Malea, E., Costas, R. Link-based approach to study scientific software usage: the case of VOSviewer. Scientometrics 126, 8153–8186 (2021). https://doi.org/10.1007/s11192-021-04082-y

Download citation

Received: 24 February 2021
Accepted: 14 June 2021
Published: 10 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11192-021-04082-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Link-based approach to study scientific software usage: the case of VOSviewer

Abstract

Similar content being viewed by others

How to design bibliometric research: an overview and a framework proposal

The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis

Mapping the Landscape of Internet Pornography, Loneliness, and Social Media Addiction: A CiteSpace Bibliometric Analysis

Introduction

Scientific-purpose software as a non-traditional academic output

Tracking citations to scientific-purpose software

Textual approaches to track academic software usage

Using software website URLs as traceable objects

VOSviewer as a case study

Aim of the study and analytical framework

Methodological approach

URL mentions in academic literature via Google Scholar

URL mentions in webpages via Majestic

URL mentions in tweets via Twitter

Data processing and metrics

Results

URL mentions in academic literature via Google Scholar

URL mentions in websites via Majestic

URL mentions in tweets via Twitter

Combining scenarios

Discussion

Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Supplementary Information

Supplementary material 1 (xlsx 4,164 KB)

Appendix 1 Data sources used in the study

Appendix 1 Data sources used in the study

Google Scholar

Majestic

Twitter

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation