Mixing Up Contrastive Learning: Self-Supervised Representation Learning for Time Series

The lack of labeled data is a key challenge for learning useful representation from time series data. However, an unsupervised representation framework that is capable of producing high quality representations could be of great value. It is key to enabling transfer learning, which is especially beneficial for medical applications, where there is an abundance of data but labeling is costly and time consuming. We propose an unsupervised contrastive learning framework that is motivated from the perspective of label smoothing. The proposed approach uses a novel contrastive loss that naturally exploits a data augmentation scheme in which new samples are generated by mixing two data samples with a mixing component. The task in the proposed framework is to predict the mixing component, which is utilized as soft targets in the loss function. Experiments demonstrate the framework's superior performance compared to other representation learning approaches on both univariate and multivariate time series and illustrate its benefits for transfer learning for clinical time series.


Introduction
Learning a useful representation of time series without labels is a challenging task. Nevertheless, time series are a typical data type in numerous domains where the lack of labeled data is a common challenge. Particularly in the medical domain there can often be an abundance of data but labeling can be costly and challenging (Ching et al., 2018). Learning useful representations from unlabeled data would be of great benefit in such scenarios. In particular, it could enable transfer learning for clinical time series. Transfer learning is the practice of transferring knowledge from a source domain to a target domain (Pan and Yang, 2010). Such a technique enables researchers to exploit large unlabeled datasets to train more robust and precise systems on small labeled datasets.
Learning useful representations is an active area of research in machine learning (Bengio et al., 2013;Jing and Tian, 2019), with encouraging results in recent works on image representation learning (Chen et al., 2020;He et al., 2020;Grill et al., e-mail: kristoffer.k.wickstrom@uit.no (Kristoffer Wickstrøm) 2020). Many of such recent works have used contrastive learning for learning useful features, and these works exploits prior information about noise invariances in the image data. However, time series data constitute a highly heterogeneous data source, and invariances can differ completely between different datasets.
Contrastive learning is a type of self-supervised representation learning where the task is to discriminate between different views of the sample, where the different views are created through data augmentation that exploit prior information about the structure in the data. Data augmentation is typically performed by injecting noise into the data. Recent advances in contrastive learning have been particularly prominent for image data, as there exists a wide range of applicable augmentation schemes (Shorten and Khoshgoftaar, 2019;Chen et al., 2020) that are suitable for natural images. On the other hand, data augmentation for time series based on the injection of noise can be more challenging because of the heterogeneous nature of time series data and the lack of generally applicable augmentations. This paper introduces a novel self-supervised learning framework that naturally exploits a recent data augmentation scheme arXiv:2203.09270v1 [stat.ML] 17 Mar 2022 called mixup (Zhang et al., 2018). The mixup data augmentation scheme creates an augmented sample through a convex combination of two data points and a mixing component. Such an approach allows for a natural generation of new data points, as augmented samples are generated through a combination of samples from the data distribution. In the proposed framework, the task is to predict the strength of the mixing component based on the two data points and the augmented sample, which is motivated by recent research on label smoothing (Müller et al., 2019). Label smoothing refers to the concept of adding noise to the labels, such that the targets are no longer hard 0 and 1 targets, but soft targets in the range between 0 and 1. This has been shown to increase performance and reduce overconfidence in deep learning-based approaches (Müller et al., 2019). The proposed framework shows encouraging results when evaluated on the UCR  and UEA  databases and compared to a number of baselines. Furthermore, we show how the proposed method can be used to enable transfer learning for clinical time series. Experiments illustrate that self-supervised pre-training can increase both performance and convergence speed for deep learning-based classification of clinical time series.
Our contributions are: 1. A novel contrastive learning framework that is motivated through the concept of label smoothing and is based on predicting the amount of mixing between data points. 2. An extensive evaluation of the proposed method with comparison to a number of baselines. 3. We show how the proposed method enables transfer learning clinical time series, which leads to an increase in performance when classifying echocardiograms.

Mixup Contrastive Learning
We outline the proposed framework for contrastive representation learning of time series. We propose a new contrastive loss that naturally exploits the information from the data augmentation procedure. Before we present our new contrastive learning framework, we introduce some notation. Our presentation will be based on univariate time series (UTS), but is also extended to multivariate time series (MTS) in the experiments. Let a UTS, x, be defined as a sequence of real numbers ordered in time, x = {x(t) ∈ R|t = 1, 2, · · · , T }, where t denotes each time step and T denotes the length of the UTS. Vectorial data will be denoted in lowercase bold x.
A common approach to contrastive learning is to use a neural network-based encoder to transform the data into a new representation (Chen et al., 2020). The encoder is trained by passing different augmentations of the same sample through the encoder and a projection head, before applying a contrastive loss. The goal of contrastive learning is to embed similar samples in close proximity by exploiting the invariances in the data. After training, the task dependent projection head is discarded and the encoder is kept for down-stream tasks.
The data augmentation scheme used to create different views of the same sample is crucial for learning a useful representation. However, care must be taken when determining the set of transformations to apply. The potential invariances of time series are rarely known in advance, and incautious application can result in a representation where unalike samples are embedded in close proximity (Oh et al., 2018). For instance, a transformation like rotation that is common to apply for natural images can completely change the nature of a time series by changing the trend of the data.
In this work, we opt for a data augmentation scheme based on creating new samples through convex combinations of training examples referred to as mixup Zhang et al. (2018). Given two time series x i and x j drawn randomly from our training data, an augmented training example can be constructed as follows: Here, λ ∈ [0, 1] is a mixing parameter that determines the contribution of each time series in the new sample, where λ ∼ Beta(α, α) and α ∈ (0, ∞). The distribution of λ for different values of α is illustrated in Figure 1. The choice of this augmentation scheme is motivated by avoiding the need to tune a noise parameter based on specific datasets but instead automatically generating data samples based on the specific dataset. Moreover, the information in the mixing parameter λ can be exploited to produce a novel contrastive loss that is described in the following section. In a nutshell, the proposed framework is based on transforming the task from predicting hard 0 and 1 targets to soft targets λ and 1−λ. This is motivated by recent research on label smoothing that has shown how such regularization can lead to increased performance and less overconfidence in deep learning (Müller et al., 2019).

A Novel Contrastive Loss for Unsupervised Representation Learning of Time Series
We propose a new contrastive loss function that naturally exploits the information from the mixing parameter λ. At each training iteration, a new λ is drawn randomly from a beta distribution, and two minibatches of size N, {x (1) 1 , · · · , x (1) N } and {x (2) 1 , · · · , x (2) N }, are drawn randomly from the training data. Applying Equation 1, the two minibatches are used to create a new minibatch of augmented samples, {x 1 , · · · ,x N }. All three minibatches are passed through the encoder, f (·), that transforms the data into a new representation, {h (1) 1 , · · · , h (1) N }, {h (2) 1 , · · · , h (2) N }, and {h 1 , · · · ,h N }, which can be used for down-stream tasks. Next, the new representations are again transformed into a taskdependent representation, {z (1) 1 , · · · , z (1) N }, {z (2) 1 , · · · , z (2) N }, and {z 1 , · · · ,z N }, by the projection head, g(·), where the contrastive loss is applied. The framework is illustrated in Figure 2. Using this notation, our proposed contrastive loss for a single instance is applied on the representation produced by the projection head and is defined as: where D C (·) denotes the cosine similarity and τ denotes a temperature parameter, as in recent works on contrastive learning (Chen et al., 2020). The loss will be referred to as the MNT-Xent loss (the mixup normalized temperature-scaled cross entropy loss). The proposed loss changes the task from identifying the positive pair of samples, as in standard contrastive learning, to predicting the amount of mixing. Moreover, neural networks are known to be overly confident in predictions far from the training data (Hein et al., 2019), but the proposed loss will discourage overconfidence since the model is tasked with predicting the mixing factor instead of a hard 0 or 1 decision.

Experiments and Results
We evaluate the proposed framework on an extensive number of both UTS and MTS datasets, and compare against well known baselines. Also, we demonstrate how the proposed methodology enables transfer learning in clinical time series.

Evaluating Quality of Representation
A common approach for evaluating the usefulness of an unsupervised contrastive learning framework is training a simple classifier on the learned representation (Zhang et al., 2017;Caron et al., 2018). We use a 1-nearest-neighbor (1NN) classifier to evaluate quality of different representations, as suggested by . This is motivated by the simplicity of the 1NN classifier, which requires no training and minimal hyperparameter tuning. Furthermore, the 1NN classifier is highly dependent on the representation to achieve good performance and is therefore a good indicator of the quality of the learned representation. The proposed methodology, referred to as mixup contrastive learning (MCL), is evaluated on the UCR archive , which consists of 128 UTS datasets, and the UEA archive , which consists of 30 MTS predict amount of mixing Fig. 2. The proposed framework. Two minibatches are sampled randomly from the data and combined using Equation 1. All samples are passed though an encoder f (·) resulting in a representation that can be used for down-stream tasks. Next, this representation is transformed using a projection head g(·) into a representation where the proposed contrastive loss is applied.
datasets. We compare with several baselines that span different types of time series representations: • Handcrafted features (HC): Extract the maximum, minimum, variance and mean value of each time series. This is an elementary and well-known approach that will act as a simple baseline.
• Raw input features (ED): Using the raw time series as input without any alterations. This will demonstrate if it is beneficial to transform the time series.
• Autoencoder features (AE): A deep learning-based baseline using an autoencoder framework. We use the same network as with the proposed method, and a mirrored encoder for the decoder. The model is trained using a mean squared error reconstruction loss for 250 epochs.
Autoencoder-based learning of features from time series with a reconstruction loss is a typical approach in the literature (Zhu et al., 2021;Kieu et al., 2019) • Contrastive learning features (CL): A deep learning-based baseline based on the widely used SimCLR framework (Chen et al., 2020). We use the same network as with the proposed method, but with different data augmentation and the standard contrastive loss of Chen et al. (2020) instead of the mixup contrastive loss. We consider two data augmentation techniques, gaussian noise with a variance of 0.25 (CL (σ = 0.25)) and dropout noise with a dropout rate of 0.25 (CL (ρ = 0.25)). These noise parameters represent an average amount of noise suitable for most datasets.   3. Accuracy on each dataset from the UCR and UEA databases. Each point represents the accuracy on one dataset, with the baseline along the vertical axis and the MCL along the horizontal axis. The diagonal line indicates where two methods have similar performance. Points above this line indicates that the baselines gives better performance and points below this line indicates that the MCL gives better performance. The figure shows that the proposed method provides superior performance to the baselines, as the majority of the points lie below the diagonal line. We use the fully convolutional network (FCN) proposed by Wang et al. (2017) as an encoder f for all contrastive learning approaches reported in this work. The FCN consists of three convolutional layers, each followed by batch normalization (Ioffe and Szegedy, 2015) and a rectified linear unit activation function, and an adaptive average pooling layer. The convolutional layers consist of 128, 256, and 128 filters from first to third layer. This choice is motivated by the FCN's strong performance on a number of time series benchmark tasks (Wang et al. (2017)) and its simplicity. Specifically, the encoder representation will be the output of the average pooling layer. For the projection head g, we use a two-layer neural network with 128 neurons in each layer and separated by a rectified linear unit non-linearity, inspired by Chen et al. (2020). All models are optimized using the ADAM optimizer (Kingma and Ba, 2014) for 1000 epochs, with the temperature parameter τ is set to 0.5 as suggested by Chen et al. (2020), and the α parameter set to 0.2 as suggested by Zhang et al. (2018). Statistical significance is determined using a pairwise t-test, where bold numbers indicate significance at a significance level of 0.05. The accuracy and ranking of the learned features (AE, CL and MCL) are based on the average across 5 training runs at the last epoch. Table 1  played in the Appendix. Table 1 shows that the simple HC baseline results in poor performance, even compared to no transformation of input (ED). Furthermore, the learned CL feature baseline gives comparable results to the ED features, while the AE features give a slight improvement over the ED features. However, the learned features based on the proposed framework gives the best performance on both the univariate and the multivariate datasets. Figure 3 shows a per dataset accuracy comparison of the proposed method with all baselines. Each point in Figure 3 represents the accuracy on one dataset from the UCR and UEA databases, with the baseline along the vertical axis and the MCL along the horizontal axis. The diagonal line indicates where two methods perform equally. Points above this line indicates that the baselines gives better performance and points below this line indicates that the MCL gives better performance. Figure 3 clearly shows that the majority of the points lie below the diagonal line, which illustrates the superior performance of the proposed method. Lastly, Figure 4 shows a boxplot of the accuracy across all datasets in the UCR and UEA databases. The figure corroborates Table 1, and illustrates that the proposed method outperforms all other baselines.

Transfer Learning for Clinical Time Series
We perform transfer learning for classification of echocardiograms (ECGs) datasets with limited amount of training data, which is a typical scenario for many clinical time series datasets. First, we train an encoder using the proposed contrastive learning framework on a pretext task where a larger amount of data is available. We consider different domains for the pretext task, but with a similar amount of data. The pretext task datasets are the Syntehetic Control (Synthetic), Swedish Leaf (Dissimilar), and ECG5000 (Similar), all obtained from the UCR archive. Next, we use the weights of the encoder to initialize the weights of a supervised model, in this case the FCN, and train the model using the standard procedure. Additionally, a baseline is included where the weights are randomly initialized using He normal initialization (He et al., 2015).
The results of the transfer learning experiments are presented in Table 2. Using the pretrained weights obtained through the proposed contrastive learning framework leads to improved performance on most datasets. For the ECG200, the random initialization gives the highest performance. This might be a results of the ECG200 having the most training samples of the four datasets. Furthermore, Figure 5 shows how the accuracy evolves during training, and demonstrates how using pretrained weights can lead to faster convergence and increased performance compared to random initialization. Also note that the models with weights pretrained on the similar and dissimilar domain displays a degree of overfitting after 50 epochs. At this point in the training, the loss has begun to saturate. Therefore, we believe that this overfitting might be a result of the model being to fitted to the pretext task, which hurts the performance for the down-stream task. Such challenges could be addressed through techniques such as early stopping (Girosi et al., 1995) or heavier regularization, which we consider a direction for future research. Next, the results in Table 2 indicate that the domain of the pretext task is important for the quality of the pretrained weights. Surprisingly, a pretext task from a dissimilar domain results in comparable results as a similar domain. It is natural to assume that a pretext task within a similar domain would be beneficial, but it is important to also consider the complexity of the data in the pretext task. In this case, the Swedish Leaf dataset is more complex as it has more classes and a more erratic nature compared to the periodic ECG5000 dataset. This might result in the encoder learning filters that can process more complicated data and generalize better to different tasks. Moreover, using the encoder trained on synthetic data also increased performance on some datasets, which indicates that useful information can be extracted even from generated data. This can be helpful for tasks with little data and no pretext task, as you can generate data and learn filters to initialize the model which might lead to a better representation. Fig. 5. Accuracy of a FCN with different encoder initialization on the CinCECGTorso test data. Scores are averaged over 5 independent training runs. The figure shows that the pretrained weights using the proposed framework leads to faster convergence and increased performance.

Discussion and Conclusion
In this work, we have focused on contrastive learning of time series representations through the injection of noise, motivated by the recent success of contrastive learning on image data. However, a different line of research for contrastive learning of time series representations is using temporal information to discriminate between samples. Most recently, Franceschi et al. (2019) achieved promising results by combining a convolutional neural network encoder with a novel triplet loss, where temporal information was used to perform negative-sampling. Banville et al. (2019) proposed a self-supervised learning approach where an informative representation was obtained by predicting whether time windows are sampled from the same temporal context or not. Hyvarinen and Morioka (2016) proposed a time-contrastive learning principle that uses the nonstationary structure of the data to learn a representation where optimal discrimination of time segments is encouraged, and demonstrated how the time-contrastive learning could be related to nonlinear independent component analysis. Hyvärinen et al. (2019) also proposed a generalized contrastive learning framework with connections to nonlinear independent component analysis. Exploiting temporal information can be beneficial when such information is discriminative but can also encounter challenges when faced with periodic data, where noisebased approaches might succeed. We envision that our noisebased approached can be combined with temporal-based con-trastive learning to reap the benefits of both approaches, and consider such a combination a promising line of future research. Lastly, a possible direction to improve the transfer learning part of our work is to include memory-based merging of features, as proposed by Ding et al. (2020). Such an approach could allow for samples from the source and target domain to be merged and potentially increase performance. This paper introduced a novel self-supervised framework for time series representation learning. The framework exploits a recent augmentation technique called miuxp, in which new samples are generated through combinations of data points. The proposed framework was evaluated on numerous datasets with encouraging results. Furthermore, we demonstrated how the proposed framework enables transfer learning for clinical time series with good results. We believe that our proposed framework can be a useful approach for time series representation learning.

Acknowledgement
This work was financially supported by the Research Council of Norway (RCN), through its Centre for Research-based Innovation funding scheme (Visual Intelligence, grant no. 309439), and Consortium Partners. The work was further partially funded by RCN FRIPRO grant no. 315029, RCN IKTPLUSS grant no. 303514, and the UiT Thematic Initiative "Data-Driven Health Technology". Table A.3, A.4, and A.5 displays the accuracy of all methods evaluated in the article on all datasets of the UCR and UEA databases, respectively. For the learningbased methods (AE, CL, and MCL), the scores represent the average accuracy across five independent training runs. Results for all 5 training runs and ranks on individual datasets are available at https://github.com/Wickstrom/ MixupContrastiveLearning along with code.