VisDict: Improving Communication Via a Visual Dictionary in a Science Gateway

Effective communication is vital for academic project success, particularly in multidisciplinary teams with diverse backgrounds and disciplines. Misunderstandings can arise from differing interpretations of terms, which may go unnoticed. VisDict aims to bridge this gap by creating a visual dictionary within a science gateway to facilitate clear communication between workflow providers and domain researchers. This innovative approach translates computational science concepts into fields like physics and biology. This article delves into our method for building the visual dictionary, the insights gained from curating initial entries, and future plans for automated expansion and illustration of relevant terms.

C omputational workflows and workflow management systems are an invaluable asset for several research areas with complex computational methods such as in the Laser Interferometer Gravitational-Wave Observator (LIGO) project, 1 genomic analyses, and high-energy physics. Workflow projects inherently involve multidisciplinary collaboration, making communication of plans and details among researchers from various disciplines and backgrounds a challenging and error-prone task. The VisDict project addresses effective communication between research domains and workflow providers through the development of a visual dictionary. The current prototyped dictionary includes terms with a definition per research domain. Each definition is from a citable resource and/or other well-established dictionaries such as the Oxford Dictionary of Biology. With the notion that "a picture is worth a thousand words," we believe that the presentation of concepts in figures contributes to the understanding of the term and its definition, enriching the experience and making it easier to grasp how the definitions and concepts differ among the different disciplines. Ontologies fulfill a similar goal by increasing the understanding of terms and are well established in many domains. The gaps filled by VisDict is a focus on visualization and a better understanding of terms that come with a graphical representation.
Our strategy begins by carefully curating 11 entries across computer science, biology, and physics to serve as the foundation for the visual dictionary. The manual approach is time consuming and not scalable to fill a dictionary. Thus, we created scripts to automate the selection of terms and suggestions for the visualizations. In this article, we explain these steps as well as how we envision continuing to curate the dictionary and the VisDict science gateway, which will enable the community to vote on definitions and visualizations.

BACKGROUND
Organizers of training for research facilitators recognize the challenge of effective communication between domain researchers and computational scientists. The virtual residency for research facilitators provides professional skill development, with a particular focus on cyberambassadors' training sessions designed to enhance communication between facilitators and domain researchers. VisDict's dictionary complements this training approach, expanding the tools available to participants. This work is inspired by dictionary illustrations "whereby a word is explained by pointing to an object" and the utility of visual dictionaries in cross-disciplinary or multilanguage learning. 2  graphic illustrations in dictionaries had a "statistically significant effect on reception" over unillustrated entries, meaning that an explanation was significantly less successful (54%) than when color pictures (80%) or line drawings (77%) were present. 3

SELECTION OF INITIAL TERMS
To select a first set of example terms to test this concept, we selected concepts related to computational workflows whose definitions may be different depending on the domain in which they are referenced. This allowed us to demonstrate the variety of perspectives and the potential power of the visual dictionary technique. One of the terms selected was workflow itself. Even computationally, this word can have a variety of meanings. All of the workflow definitions involve some sequence of tasks that can be executed to achieve a desired result. Some workflow patterns can branch or have multiple parallel portions (as in computer science and particle physics), whereas others are linear, more like what some would define as a pipeline, another of our included terms. Readers from one field may not be aware that the interpretation of the word is different in another field; exhibiting these difficulties is one of the main purposes of our dictionary.
The 11 terms for the original dictionary template were chosen to describe computational workflows or their components. Some describe the workflow itself ("workflow," "pipeline," "sequence"), while others include elements of workflows ("read," "code") and terms related to computation ("scale," "site"). Some of these terms are especially interesting because they can be both nouns and verbs, also with different meanings. This initial set represents the richness of this vocabulary and poses a good first test of the value of a visual dictionary. Figure 1 illustrates the dictionary entry for the verb "to map" showing the different definitions and figures fitting for each domain.
For this first set, the definitions were extracted manually from trusted sources, or authored by our group. The images representing the terms were also manually drawn or sourced. This level of curation is not sustainable for a compendium of hundreds of terms.

AUTOMATING TERM SEARCH AND VISUALIZATION SUGGESTIONS
To broaden our scope of terms, we designed and prototyped an open source toolbox for data capture and analysis of text in research papers. a The toolbox uses Scrapy b and XPath for extracting content from digital libraries. Relevant information such as the frequency of words that appear in different sections of the article, Yake! e Yake! is a lightweight unsupervised automatic keyword extraction method that relies on statistical features in text extracted from single documents to select the most important keywords of a text. It has been proven to outperform state-of-the-art methods under a number of collections of different sizes, languages, or domains. Extracted words are lemmatized (i.e., grouped together by the different inflected forms of a word) and scored based on their frequency and relevance to other words in the text. Figure 2 shows the 100 most relevant keywords extracted from the set of research papers. We are currently working on identifying relevant keywords, and on defining a data structure to represent the extracted knowledge. Extracted data will then be formatted and written in a JavaScript Object Notation file format and hosted in the project's science gateway.
Next we use existing ontologies from different domains, e.g., ontologies for biology, 6 to inform keywords and important terms and concepts for VisDict. Some ontologies might have overlap between different domains, but most are tailored to their specific use cases. Ontologies can guide the extension to keywords that might be connected to different subdomains in a research domain and provide a wider selection of relevant terms.
We experimented with finding visualizations using various image search application programming interfaces (APIs) and have shared a Jupyter notebook f to search using keywords, and to download images from different APIs. The sources/APIs used in the notebook for this project are GoogleImagesSearch by Google Cloud, Google Images API by SerpApi, simple_image_ download by PyPI, and selenium chromedriver packages in Python. To maintain objectivity and avoid bias, we use multiple sources instead of just one to search and download images. Our script then searches for a given number of image uniform resource locators (URLs) for a particular keyword and downloads the images with common URLs across all the sources. We compared the results of the different searches and although there is a significant overlap between the images, the different search algorithms show a different order of results. To download roughly 20-25 common images using the script, we have to consider searching for approximately 50 images per source. Each of the searches supports configurations that can refine selections. The search algorithms are configured to search for freely available images and to not select images with a copyright. This information is normally included in the metadata. We could also select only specific images over a certain size or resolution. With the experience of the current runs, the searches are configured via their default values.
We then set up focus groups of students who will vote on visualizations for a set of terms. As a start, we set the maximum number of images empirically to 50, and we explore how many images it takes to create a well-designed basis set for selections.

VOTING ON TERMS AND VISUALIZATIONS
After the first phase of defining dictionary entries via focus groups as described in the previous section, we continue to build content via a mixture of a science gateway with a target audience and the citizen science approach. We plan to develop a science gateway based on HUBzero, 5 which allows adding terms, their definitions, and visualizations. We facilitate voting for the preferred definitions and illustrations, which are used to guide poster competitions with rankings. Selection of terms and the voting steps will be guided by project members, and in the long term by a group of dictionary curators to make sure that terms and definitions as well as figures fit the community and do not reflect any misleading or offensive material.
It will be crucial to have procedures for creating content and voting that are easy to handle and interesting. Experience with projects on Zooniverse, g a citizen science platform, shows that people are keen to contribute to science projects if they feel it is rewarding because they are supporting a good cause and also have an enjoyable interaction. Zooniverse has more than 2.5 million users and more than 716 million classifications supporting research. We are developing a concept to integrate aspects of gaming such that users collect points and see a ranked list, or combine voting with challenges that require users to overcome. This is still a work in progress and requires more research and evaluation.

OUTLOOK
The next step in the project will be forming a focus group to explore ideal numbers of figures for voting, and to fine-tune our scripts for automatic selection of terms and figure collection. The dictionary will always need a human-in-the-loop (HITL) and community-inthe-loop (CITL) approach to ensure that entries in the dictionary are beneficial and correct. HITL often refers to when human labor trains a model; in our case, we employ a CITL (also known as a society-in-the-loop) strategy for identifying image relevance because we acknowledge the role of membership in disciplinary communities as central to our project (physics, biology, and computational science). For our visual dictionary strategy to be successful, disciplinary perspective will be crucial for image selection so that images have resonance in their respective communities; otherwise, one discipline's definition (e.g., computer science) would overwhelm the others (e.g., physics), which might have fewer numbers of practitioners or image labelers/ selectors.
Our goal is to use more machine learning and artificial intelligence possibilities for task automation in a follow-up project. We will analyze how valuable tools like ChatGPT would be and how to integrate them into the dictionary. Furthermore, we will look into defining a main ontology connecting the different domains in the workflow area to enrich the dictionary via semantic measures.