Telescope: an interactive tool for managing large scale analysis from mobile devices

In today's world of big data, computational analysis has become a key driver of biomedical research. Recent exponential growth in the volume of available omics data has reshaped the landscape of contemporary biology, creating demand for a continuous feedback loop that seamlessly integrates experimental biology techniques and bioinformatics tools. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis in real-time. Here we report the development of Telescope, a novel interactive tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. Telescope was designed to natively operate with a simple and straightforward interface using Web 2.0 technology compatible with most modern devices (e.g., tablets and personal smartphones). Telescope provides a modern and elegant solution to integrate computational analyses into the experimental environment of biomedical research. Additionally, it allows biomedical researchers to leverage the power of large computational facilities in a user-friendly manner. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.


Introduction
Exponential growth in the volume of available omics data has reshaped the landscape of contemporary biology, creating demand for a continuous feedback loop that seamlessly integrates experimental biology and bioinformatics 1,2 . Life science and biomedical researchers must choose from an unprecedented diversity of software tools and datasets designed for analyzing increasingly large outputs from modern genomics and sequencing technologies, which are supported by high-performance cluster infrastructures 3 . Scientific discovery in academia and industry now relies on the seamless integration of bioinformatics tools, omics datasets, and large clusters [4][5][6][7][8] .
Many life science and biomedical researchers lacking computational training now must learn how to use computational tools in order to process data from their experiments or seek broad patterns in omics data. Ideally, any bioinformatics analysis tool should provide an easy-to-use interface through which the researcher can run and monitor each analysis of omics data 9 . A friendly user interface for omics tools would also enable the researcher with limited computational background to monitor and adjust their analysis without intervention. Lack of user interface management tools pose an obstacle to novice users to perform analysis on of high-performance computing clusters 10 . The procedure of connecting to the cluster often involves a multi-step process and requires generating SSH keys or other forms of authentication. The necessity of using the UNIX command line for each step may discourage potential users. 3 Yet most bioinformatics tools require the researcher to spend a large amount of time manually adjusting and supervising actively running analytical tasks (referred to as jobs) via command line in a computational pipeline. Today's high-performance computational facilities are capable of processing considerable volumes of data, but a new bottleneck has developed: their user interfaces require of the researcher fluent knowledge of the command line in order to manipulate analysis in real time.
There is a pressing need to seamlessly integrate bioinformatics analysis into the experimental analysis performed by biomedical research, in order to expand research opportunities to individuals lacking a computational background and to reduce the time burden of any researcher who uses a computational pipeline. One example in this direction is the Galaxy Project, which provides a friendly and interactive interface to deploy simple bioinformatics pipelines 11 . Despite many advantages, Galaxy Project lacks a flexible interface to manage the analytical tasks and many parameters related to allocating the computational resources predified (i.e., the number of processes is hard coded) 12 .
Bridging the gap between bioinformatics and biological experimentation requires an on-the-fly job management application that is fronted by a user-friendly interface 13 . We developed Telescope to addresses this challenge. Telescope is capable of leveraging common and familiar technologies that do provide a user-friendly interface to manage jobs from any mobile device without compromising flexibility for advanced users. For example, Telescope allows users to 4 track with their smartphones any bioinformatics tools (e.g., GATK 14 ) or jobs submitted by specific platforms (e.g., Galaxy Project 14,15 ), displaying in real-time the partial outputs, warnings, and error messages associated with each job. Telescope includes the following functionality: • tracking the progress and performance of actively running bioinformatics tools; • displaying in real-time the current output of an active job; • interacting with the computational cluster with minimal effort, allowing cancellation and/or rescheduling of jobs with different parameters, or new job queuing; • using statistics archived from previous jobs to estimate the resources necessary for future jobs.
Telescope is designed to natively operate with a simple and straightforward interface based on Web 2.0 technology that is compatible with most modern devices (e.g., tablets and smartphones).
Moreover, Telescope assumes little from the server side: the existence of a scheduling system (e.g., Sun Grid Engine, SLURM 16 ) and SSH connection, both elements featured in virtually all cluster systems dedicated to high-performance computing. As no further assumptions are made, Telescope is tuned to interfere as minimally as possible with cluster performance. We successfully tested Telescope at UCLA's campus-wide computational cluster 17  Several existing tools can be used to create and monitor jobs using a web-based interface, but support only specific programming languages or processing pipeline formats. For example, Luigi 23 is a Python module that can be used to manage jobs via the internet. Airflow 24 allows the creation of DAGs (Directed Acyclic Graphs) that specify a pipeline for processing of tasks; it also provides a user interface that allows users to visualize the processing status of the jobs specified by the DAGs. Compared to these tools, Telescope is a more general tool because its main objective is to leverage the common existence of scheduling systems (e.g., SGE) on clusters. Thus, Telescope is not designed for nor is restricted to a specific programming language or processing pipeline format. Telescope was initially developed to work with SGE, but it is designed to be configurable to other scheduling systems.
Finally, several tools enable an interactive approach to building and executing bioinformatics analysis tasks, but lack a function that allows the user to remotely monitor jobs. Jupyter Notebooks, an open-source web application that supports the creation and sharing of live code and data visualizations, allow users to connect to clusters and run jobs using web browsers 25,26 .
However, the Jupyter Notebooks system does not allow the user to monitor jobs from a mobile device.

Methods
Telescope is comprised of two main features ( Telescope then leverages this connection to exchange discrete messages with the cluster server. As the messages are encrypted using the industry-standard SSH protocol, Telescope is able to gather information without compromising the user's privacy. The Connection Manager also stores any SSH keys provided by the user. 8 Local Database . The Local Database keeps records of all monitored jobs. An entry is created for each job and archives the job id, job name, and user login. The Local Database also stores information regarding the requested resources (e.g., number of cores requested, memory requested), the current status of the job, and relevant metrics (e.g., elapsed time, max peak memory). The stored attributes can be configured for different scheduling systems (Table S1 lists the attributes in the table Job assuming a cluster with SGE). These records are retained over time to support job statistics and analytics. As this data is aggregated, the average memory and elapsed time for a given bioinformatics pipeline may be extracted as a function of the input parameters. which prevents an overload of the system running Telescope and, more importantly, the target computational cluster. (Rate limiting is a common technique used to prevent denial of service (DoS) attacks 27 .) Each user request must pass through this limiter before reaching the Telescope

Status
Core. When the current rate of job request exceeds the maximum threshold, additional user requests are sent to the Cache, which maintains the results of the user's last requests, rather than the Telescope Core. In addition, Telescope applies an exponential back-off algorithm that increases the time interval during which the system can accept another request from the same user.
User Interface . Users interact with Telescope through a mobile-friendly web interface ( Figure   2). User authentication when logging into Telescope is performed via the OAuth protocol 28 , which conducts verification using the user's existing accounts from popular internet services Security. Because Telescope handles private information and SSH keys, we designed a system that leverages industry standards for data handling and mitigation of vulnerabilities. Stored SSH private keys are encrypted using PBKDF cryptography, as recommended by the Public-Key Cryptography Standards (RFC 8018) 29 . In cases where a private key is compromised, Telescope users may initiate a key revocation policy. Telescope currently supports SSH key revocation by deleting the compromised SSH fingerprints and updating the revocation list, a procedure that covers most Linux distributions. If a custom security policy is required by a user or cluster administration team, Telescope's modular implementation can be easily tailored by Telescope administrators.

Discussion
Telescope interacts with the computational resources directly, at the operating system level, and spares the user from learning in-depth computer science material or devoting substantial time to manually interacting with the computational pipeline. Telescope is domain agnostic and can be used by anyone performing extensive computational analyses (e.g. deep learning, large-scale simulations for climate research).
Data retained in the database could support analytics and generate insights about job behavior, enabling users to predict resource allocation and forecast computation time. We are working on expanding the prediction feature with a simple, automated mechanism based on regular expressions that allows users to attach tags to jobs that can later be used for aggregations and 11 analytics. For instance, data of previous jobs of read alignment tools ( Figures S1-S2 ) ( Supplemental Note 1 ) stored in table Job ( Table S1 ) could have been tagged with tool name and number of reads. Then, Telescope would be able to estimate the expected elapsed time and max amount of memory required to run these tools as a function of the number of reads. In the future, we are planning to systematically collect information about the computational resources of the jobs run through Telescope. We will use recorded information to develop and provide a template that allows users to choose potential software and processing types to make better choices for resource usage.
Telescope has an intuitive user interface and demands minimal requirements from the computational cluster, making the tool appealing to users lacking a computational background, who often face a steep learning curve to operate computational resources, and to experienced users who often manage a large number of jobs and repetitive tasks. As computational clusters run Unix-based operating systems, Telescope does not eliminate completely the interaction with command line prompts, but contributes to lower the bar needed to effectively run and monitor bioinformatics analyses at scale.
We observed that Telescope users who are new users of Unix operating systems are able to, within seconds, check the status of a job and look for warning and error messages: as fast as opening their web browsers and connecting to Telescope. By addressing the challenges inherent to learning command line, Telescope was designed to invite users with any level of computational experience to join the bioinformatics community. 12 The development of Telescope demonstrates that the current model where bioinformatics analyses are outsourced with no control during job execution (for example, use of pre-cut pipelines wrapped in Graphical User Interfaces) is inefficient and prevents biomedical investigators from harnessing the true potential of their computational tools in the wet lab environment. While Telescope does not directly improve the runtime performance of bioinformatics tools, the application increases accessibility of biomedical data analyses to the scientific community and provides for all users a tool for improving work productivity.
Real-time tracking allows biomedical researchers to access partial results-before the analytical task has been completed on a large dataset-and identify potential problems with the analysis or sequencing experiment.
The ideas and results presented in this study represent a contribution toward mitigating the digital divide in contemporary biology. By offering real-time job management tracking and control over computational clusters even on mobile devices, Telescope can help researchers accomplish a seamless feedback connection between bioinformatics and experimental work with minimal performance interference.

Declarations
Ethics approval and consent to participate 13 Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The software presented in this paper is freely available at https://github.com/Mangul-Lab-USC/telescope.
Telescope is registered at bio.tools and SciCrunch.org databases as https://bio.tools/Telescope and RRID (SCR_017626), respectively.   Tables   Table S1. Local Database's schema, explicitly listing attributes, their corresponding data types, and descriptions. These attributes correspond to the information provided by the qstat function of scheduling system Sun Grid Engine.  Figure S1. Comparison of the runtime (measured by CPU time in hours) for each tool against the size of each sample (measured by the number of reads).

Attribute name
22 Figure S2. Comparison of the RAM (measured in gigabytes) used by each tool against the size of each sample (measured by the number of reads).

Supplemental Note 1
A number of choices can affect a job's impact on cluster resources, including the specific bioinformatics analysis tool used and the size of the input omics dataset.  Figure S1) and the RAM ( Figure   S2) required by the job. Number of reads was calculated by considering two Illumina sequencing paired ends as a single read. 24