Non-negative tensor factorization workflow for time series biomedical data

Summary Non-negative tensor factorization (NTF) enables the extraction of a small number of latent components from high-dimensional biomedical data. However, NTF requires many steps, which is a hurdle to implementation. Here, we provide a protocol for TensorLyCV, an easy to run and reproducible NTF analysis pipeline using Snakemake workflow management system and Docker container. Using vaccine adverse reaction data as an example, we describe steps for data processing, tensor decomposition, optimal rank parameter estimation, and visualization of factor matrices. For complete details on the use and execution of this protocol, please refer to Kei Ikeda et al.1


Highlights
TensorLyCV framework to perform NTF Rank optimization using the masking approach to analyze high-dimensional biomedical data

Snakemake and Docker workflow to analyze vaccine adverse reaction data
This protocol describes a procedure for performing non-negative tensor factorization (NTF) on timeseries biomedical data using a workflow called TensorLyCV.NTF requires multiple steps: software installation, data reshaping, rank estimation, decomposition at the optimal rank, and visualization of the tensor components.While the decomposition itself can be performed by existing Python 2 and R 3 packages such as TensorLy 4 and nnTensor, 5 the other processes require a combination of multiple packages, which is a hurdle for researchers unfamiliar with computational analysis.To make the NTF processes easily executable and reproducible, TensorLyCV uses Snakemake 6 workflow management system and Docker 7 container.TensorLyCV frees users from cumbersome software installation and version control, allowing them to reproducibly perform NTF procedures on their own data by changing only a few arguments.
NTF is a topic modeling approach that can decompose the dynamics of biomolecules, biological signals, and clinical symptoms into a small number of time-evolving components.By describing biological phenomena as an ensemble of components, associations with clinical outcomes can be efficiently examined and testable hypotheses about the underlying mechanisms can be generated.][10][11][12][13][14] We recently applied NTF on time-series adverse reaction data after receiving the BNT162b2 mRNA COVID-19 vaccine. 1 In this study, four time-evolving components were extracted to identify different component-specific associations with background factors and post-vaccination antibody titers.
One of the technical challenges in NTF is rank optimization.In unsupervised learning, such as NTF, the number of components (i.e., ranks) must be predetermined, but the number of potential components is often not obvious, especially for biomedical data.Traditionally, the elbow method 15,16 is often used for rank optimization, but in practice, a sharp elbow may not exist, and the optimal rank cannot be unambiguously determined.Instead, we adopted a masking approach in Ikeda K. et al.'s study. 1 In the masking approach, a certain percentage of the original data elements are masked as noise, and a rank that minimizes the error in the task of imputation of masked elements with NTF, is chosen as the optimal rank..8][19] TensorLyCV offers the option of rank optimization using a masking approach in addition to the elbow method.
Here, we present a comprehensive protocol for the NTF analysis using vaccine adverse reaction data from the recent study 1 as an illustrative example and demonstrate the step-by-step data reshaping, data analysis, and visualization process.

Computer system
To reproduce the analysis in Ikeda K. et al.'s study, 1 here we use our original Snakemake workflow TensorLyCV (https://github.com/kokitsuyuzaki/TensorLyCV) to perform TensorLy, 4 which is a Python2 package to execute tensor decomposition methods including NTF.This is because the most time-consuming part was the rank estimation with different rank parameters and initial values 1 and therefore a computing environment that enables distributed computing such as GridEngine/ Slurm is recommended.TensorLyCV is designed to work whether on a local machine such as a laptop or in a distributed computation environment just by switching the arguments.We have already confirmed that TensorLyCV properly worked on the four computers below.In the following sections, we will show the computational time on the Linux Server.

Timing: 1 h
In this section, we describe how to install required software.
For preprocessing section later, download and install the latest R 3 in CRAN (https://www.r-project.org).

Note:
The tools needed will vary depending on the user's data.Here, we recommend the following R packages.

Note:
The following code line is in R language, to be inputted into R-console window.TensorLyCV is intended to be run by the snakemake command.Snakemake is a Python package.Therefore, after installing Python first, install Snakemake by using some Python package manager such as pip, conda, or mamba.To download and install Snakemake, follow the instructions on the installation page (https://snakemake.readthedocs.io/en/stable/getting_started/installation.html).
All necessary tools to perform TensorLyCV (e.g., TensorLy) have already been pre-installed in a Docker container (https://hub.docker.com/r/koki/tensorlycv_component).Snakemake uses Singularity 20 to convert the Docker container image into a local executable file before using it for calculations.This allows users to perform their calculations in any computing environment (e.g., cluster machine) without being aware of their user privilege.To download and install Singularity, follow the instructions on the installation page (https://docs.sylabs.io/guides/3.0/user-guide/installation.html#).
Finally, download TensorLyCV and change the working directory.The following code line is in Bash script, to be inputted into terminal window.

Note:
The binary files of Python, Singularity, and R to be downloaded and how to download them will vary depending on the machine's operation system (i.e., Windows, macOS, or Linux).
To download the appropriate binary file, follow the instructions on each installation page above.

Timing: 1min
In this section, we describe how to download data for demonstration.
In TensorLyCV, the input data is assumed to be a binary file containing NumPy 22 multi-dimensional array saved by numpy.save.The vaccine adverse reaction datasets can be downloaded below.The following code line is in Bash script, to be inputted into terminal window.

STEP-BY-STEP METHOD DETAILS Data preprocessing
Timing: 30 min In this section, we describe how to preprocess the demonstration data.
First, we show the procedure for constructing tensor data.A tensor can be considered a generalization of a matrix.For example, a third-order tensor is a three-dimensional array that stores values in the depth way in addition to the vertical and horizontal ways.When saving such high-dimensional data (e.g., 3D) as a 2D Excel spreadsheet, several data types can be considered.Here, we introduce the procedures for constructing tensor data from three data types.In subsequent demonstrations, we will use only a portion of the data taken from Ikeda K. et al. 1 The following code lines in this section are in R language, to be inputted into R-console window.
Type 1: Wide short data: The first type is wide short matrix data, with rows for subjects and columns for combinations of days and symptoms (https://figshare.com/ndownloader/files/38365868, Figure 1).To load this Excel file in R, use read_excel 22 function as follows: Then, a set of matrices stratified by days is created with the map 22 function, and they are stacked in the depth direction with the abind 23 function to construct a third-order tensor called vaccine_tensor.
Type 2: Tall narrow data: The second type is tall narrow data also known as "tidy" data 22 (https:// figshare.com/ndownloader/files/38362235Figure 2).To load this Excel file in R, use read_excel function as follows: Then, a tidy dataset stratified by days is created with the map function, reshaped as a matrix form by pivot_wider 22 and as.matrix, and they are stacked in the depth direction with the abind function to construct a third-order tensor called vaccine_tensor.
Note: In Ikeda, K. et al.'s study, 1 we added the process for filtering the subjects with few observations (i.e., subjects whose observed elements are under 30%).This process is the summation of a 3D tensor for a given dimension and transforming it to 1D and can be easily described by using the einsum 24 function, which is inspired by Einstein's summation.
5. Save tensor data as a NumPy binary file.

Non-negative tensor factorization
Timing: two days In this section, we describe how to perform TensorLyCV.

Protocol
TensorLyCV consists of 14 rules, and once a rule is successfully executed, the downstream rule is then executed, and this procedure is repeated to ensure that all calculations are properly finished (Figures 4 and 5).The most time-consuming one of these rules is "tensorly_w_mask", which performs the rank estimation with masking approach (Figure 6).To accelerate this step, user has some options to perform TensorLyCV like below.The following code lines in this section are in Bash script, to be inputted into terminal window.
On a local machine such as a laptop, run TensorLyCV as follows: CRITICAL: The above is a code that performs a series of NTF analyses at once, including rank estimation using masking approach, decomposition at the optimal rank, and visualization of the decomposition results.The argument -j is the number of CPU cores to be used in Snakemake.The arguments input and outdir are the input file and output directory, respectively.The arguments associated with rank estimation are rank_min, rank_max, trials, and n_iter_max, meaning a decomposition from rank 1 to 10 with 50 random trials using different initial values for each rank, up to 1,000 iterations before convergence on each random trial, and ratio is an argument related to the masking approach, meaning the percentage of elements to be masked as noise.Previous research 1 has shown that varying the percentage of noise by 5%, 10%, 20%, and 30% does not affect the rank estimation, but noise above 40% makes the decomposition unstable and should be avoided.The argument mem_gb is the memory usage.-use-singularity is the argument to use Docker container via Singularity.
On a distributed environment with Slurm, run TensorLyCV as follows: CRITICAL: The arguments other than -cluster are the same as for GridEngine, but the command enclosed in "" after -cluster has been changed for Slurm.
Because TensorLyCV itself is also dockerized, if Docker is available, user can perform TensorLyCV by docker command as follows: CRITICAL: In this case, the installation of Snakemake and Singularity can be omitted.

EXPECTED OUTCOMES
If TensorLyCV runs successfully, many files and directories are generated in the directory specified by the outdir argument (e.g., Figure 7).The directory tensorly contains all the results of TensorLy with different ranks and initial values.The directory plot contains only the images of the trials with the smallest reconstruction error in all ranks.The directory benchmarks contains the calculation time and memory usage of all the processes in TensorLyCV and are used to generate the report (Figure 6).The directory logs contains the log files of all the processes in TensorLyCV and are referenced during debugging if the calculation fails for some reason.designed to select the rank with the lowest median in Figure 8B, where rank = 4 is chosen as the optimal rank.
Figure 9A shows the individual scores of the components.Higher scores indicate stronger symptoms for the individual.Here, the component in the second row has the highest score for most individuals.
Figure 9B shows what kind of symptoms are included in the component: the symptoms in the first and fourth rows include a variety of systemic symptoms, while the second row component is composed almost exclusively of pain at the inoculation site and the third row component is composed of whole-body muscle pain.Figure 9C represents the timing of the components; the second and third row components show similar temporal changes, with the strongest on the first day after both the first and second inoculation, lasting about 3 days.On the other hand, the first row component is very strong only one day after the second inoculation, while the fourth row component appears mainly after the second inoculation as well, peaking two days after inoculation and persisting for seven days.

LIMITATIONS
The bottleneck in TensorLyCV workflow is the masking approach (i.e., tensorly_w_mask).Assuming the same processing time for each TensorLy calculation, the order of computational time is OðRTIÞ, where R is the number of rank parameters, T is the number of random trials with different initial values, and I is the number of iterations to converge the calculation.When we perform TensorLyCV against the vaccine adverse-reaction data, 1 R is 10, T is 50, and I is 1,000, each computation time for the masking approach ranges from 1 to 5 min and the total calculation time was about 15 h.If a user's data is larger than ours, the total calculation time will be larger.See the troubleshooting section for tips on speeding up TensorLyCV.

TROUBLESHOOTING Problem 1
Unable to perform TensorLyCV on M1/M2 Mac (related to step-by-step method details, non-negative tensor factorization step).

Potential solution
TensorLyCV is based on Singularity to use Docker containers (related to before you begin, software installation step), but Singularity is still unstable on M1/M2 Mac as of January 26, 2023.To be able to perform TensorLyCV on M1/M2 Mac, user has two options: installation of required tools by conda/ mamba command manually and performing snakemake command without -use-singularity option, or using TensorLyCV by Docker.Detailed procedures are described below. https://github.com/kokitsuyuzaki/TensorLyCV/blob/main/README_AppleSilicon.md.

Problem 2
TensorLyCV crashed with the message "The data tensor contains negative elements..." (related to step-by-step method details, non-negative tensor factorization step).

Potential solution
The tensor data has negative elements.NTD supposes that the input data has no negative elements.Therefore, in such a case, user must convert the tensor data into non-negative tensor data in some ways such as taking absolute values or translation to make the minimum value 0.

Problem 3
TensorLyCV crashed with Python's MemoryError (related to step-by-step method details, non-negative tensor factorization step).

Figure 1 .
Figure 1.Type 1: Wide short data Rows indicate subjects and columns indicate symptoms and days.

Figure 2 .
Figure 2. Type 2: Tall narrow data (also known as "tidy" data) Rows indicate subjects and columns indicate all the variables corresponding to each subject.

Figure 3 .
Figure 3. Type 3: Multiple sheets data After stratifying Type 1 data by days, multiple matrices are saved in each sheet.

Figure 4 .
Figure 4. Rules and the descriptions in TensorLyCV

Figure 6 .
Figure 6.Computational time of each rule in TensorLyCV The x-axis indicates the calculation time of each rule and the y-axis indicates rules in TensorLyCV.This figure is generated by Snakemake's -report option (cf.https://github.com/kokitsuyuzaki/TensorLyCV/blob/main/src/report.sh).

Figure 8 .
Figure 8. Reconstruction errors of NTF Reconstruction errors calculated as the mean squared error (MSE) for all elements without mask (A) and using the masking approach (B).The x-axis indicates rank parameters and the y-axis indicates the reconstruction errors.

Figure 9 .
Figure 9. Tensor components identified with NTF Since the original vaccine adverse reaction data is a tensor consisting of three axes: subject, symptom, and time, the subject module (A), symptom module (B), and time module (C) are obtained.Each component is represented by the outer product of the modules in a horizontal row.