Data stream dataset of SARS-CoV-2 genome

As of May 25, 2020, the novel coronavirus disease (called COVID-19) spread to more than 185 countries/regions with more than 348,000 deaths and more than 5,550,000 confirmed cases. In the bioinformatics area, one of the crucial points is the analysis of the virus nucleotide sequences using approaches such as data stream techniques and algorithms. However, to make feasible this approach, it is necessary to transform the nucleotide sequences string to numerical stream representation. Thus, the dataset provides four kinds of data stream representation (DSR) of SARS-CoV-2 virus nucleotide sequences. The dataset provides the DSR of 1557 instances of SARS-CoV-2 virus, 11540 other instances of other viruses from the Virus-Host DB dataset, and three instances of Riboviria viruses from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21).


Subject
Biochemistry, Genetics and Molecular Biology (General) Specific subject area Bioinformatics Type of data

Value of the Data
• These data are useful because they provide numeric representation of the COVID-2019 epidemic virus (SARS-CoV-2).With this, it is possible to use data stream algorithms.• All researchers in bioinformatics, computing science, and computing engineering disciplines can benefit from these data because by using this numeric representation, they can apply several stream algorithms and techniques such as TEDA (Typicality and Eccentricity Data Analytic), TEDA-Cloud, TEDA-Cluster and Teda-Class in genomic information.
• Data experiments that use analytic stream techniques in SARS-CoV-2 virus genomic information can be used with this dataset.• These data represent an simple way to evaluate the SARS-CoV-2 virus genome with stream algorithms.• Differently of the conventional bioinformatics techniques in which are based on dynamic programming (such as BLAST and other), this approach allows the utilization of different techniques (techniques commons in other areas) to find similarities between genome sequences.

Data Description
This work presents a dataset of data stream representation (DSR) of SARS-CoV-2 virus nucleotide sequences.The dataset contains two kinds of data, the raw data, and the processing data.The raw data is composed of the 1557 instances of the SARS-CoV-2 virus genome collected from the National Center for Biotechnology Information (NCBI) [1] , 11540 instances of other viruses from the Virus-Host DB [2,3] , and the other three specific viruses also collected from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21).The last specific three viruses have high similarity with SARS-CoV-2 [4,5] .The processing data is composed of four kinds of DSR called Direct Mapping (DM), DM with Chaos Game Representation (DM-CGR), k -mers mapping (kMersM) and k -mers mapping with CGR (kMersM-CGR).k -mers is a frequency count metric used in Bioinformatics.Other k -mers datasets are presented in [6][7][8] .
In the Chaos Game Representation (CGR) [8] , the genome sequence is transformed in a bidimensional signal (1D vector), and after that, this signal passes to infinite impulse response (IIR) filter [9] .The result of CGR is a signal that expressed the density of the bases and, at the same time, the transition between bases because the IIR is a memory system.CGR can be used with the signature of the genome sequence.With k -mers representation [10] , the genome can be transformed into a 1D or 2D vector that represents the occurrence number of each base (frequency of the bases).k -mers also can be used with a signature of the genome sequence.However, in this manuscript, the genome sequence is transformed into a linear stream data, and this type of transformation can be used with stream algorithms.Another important aspect of this dataset is associated with applied CGR not in all sequences but just in each k bases (with mers or not).This strategy maintains the statistical characteristics and reduces the size of the stream.

Experimental design, materials, and methods
The streams were based in nucleotide sequence, s , expressed as where N is the length of sequence and s n is the n th nucleotide of the sequence.
For DM and DM-CGR, the nucleotide sequence, s , are grouped in sub-sequences of the k bases.The group of sub-sequences can be expressed as where and the i -th vector b i is a i -th group of the k nucleotides, that is For DM, the group of sup-sequences, stored in matrix B , are transformed in a sequence of the integer values expressed as where c is the DM stream stored in dataset.The DM stream, c , calculus can be expressed as where f map ( • ) is the mapping function expressed by and For DM-CGR, the stream is characterized by vector a expressed as where the a i is the i -th value of CGR.In CGR (see [11,12] ) each element a i is a bi-dimensional value expressed as where a x i and a y i are the x-axes and y-axes in bi-dimensional space, receptively.The values of the CGR are calculate using the functions f x The function f x CGR (•) calculates the x-axes value of the CGR and it can be expressed as where and For y-axes, the function, f y CGR (•) , can be expressed as where and For the initial condition, j = 0 , p x i, 0 = α x and p y i, 0 = α y [11,12] .The dataset was generated with α x = 0 and α y = 0 .
For kMersM and kMersM-CGR, the nucleotide sequence, s , are grouped in k -mers subsequences [13,14] in the matrix H that can expressed as The kMersM, stream is characterized as a sequence of the integer values expressed as where The function f map ( • ) is the mapping processing characterized by Eqs.(7) and (8) .The kMersM-CGR is stored in the vector z expressed as where the z i is the i -th value of CGR.Each i th element z i is a bi-dimensional value expressed as where z x i and z y i are the x-axes and y-axes in bi-dimensional space, receptively.The values of the CGR are calculate using the functions f x CGR (•) (see Eqs. ( 12)-( 14) ) and f y CGR (•) (see Equation

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.