TimeMatch: Unsupervised Cross-Region Adaptation by Temporal Shift Estimation

The recent developments of deep learning models that capture complex temporal patterns of crop phenology have greatly advanced crop classification from Satellite Image Time Series (SITS). However, when applied to target regions spatially different from the training region, these models perform poorly without any target labels due to the temporal shift of crop phenology between regions. Although various unsupervised domain adaptation techniques have been proposed in recent years, no method explicitly learns the temporal shift of SITS and thus provides only limited benefits for crop classification. To address this, we propose TimeMatch, which explicitly accounts for the temporal shift for improved SITS-based domain adaptation. In TimeMatch, we first estimate the temporal shift from the target to the source region using the predictions of a source-trained model. Then, we re-train the model for the target region by an iterative algorithm where the estimated shift is used to generate accurate target pseudo-labels. Additionally, we introduce an open-access dataset for cross-region adaptation from SITS in four different regions in Europe. On our dataset, we demonstrate that TimeMatch outperforms all competing methods by 11% in average F1-score across five different adaptation scenarios, setting a new state-of-the-art in cross-region adaptation.


Introduction
Today, the availability of satellite image time series (SITS) data is rapidly increasing. For instance, the twin Sentinel-2 satellites provide imagery of the entire Earth every two to five days [1]. A frequent acquisition of images is crucial for vegetation-related remote sensing applications such as crop type classification [2,3]. Multi-temporal data enables capturing the phenological development of crops (i.e., the progressions of crop growth), a key dimension to discriminate each crop type [4]. Recently, the increasing availability of SITS along with advances in deep learning has led to crop classifiers with temporal neural architectures using convolutions [5,6], recurrent units [7][8][9][10], self-attention [11,12], or combinations thereof [13,14].
These crop classification models achieve impressive performance by capturing the temporal structure of the problem but rely on the existence of a large amount of labeled training data. While unlabeled SITS are plenty, access to labels in the region of interest (the target domain) is often either costly or otherwise unavailable. A possible solution is to train a model in a region with labels available (the source domain) and apply it to the unlabeled target region. However, when the two regions are geographically different, the dissimilarity between the source and target data distributions can cause a source-trained model to perform poorly when applied to the target region [15][16][17].
Addressing the distributional shift problem to adapt a source-trained model to an unlabeled target domain is in machine learning known as unsupervised domain adaptation (UDA) [15,18,19]. Here, we consider the cross-region UDA problem for SITS [20], where we are provided with labeled data from a source region and unlabeled data from a target region. In this setting, the source and target data distributions differ due to changes in local conditions, such as the soil, climate, and farmer practices, which cause spectral and temporal shifts [15].
Addressing the temporal shift is of particular importance when adapting crop classifiers to new regions, as we illustrate in Figure 1. While crops in different regions have similar growth patterns, the timing of key growth stages, such as the peak of greenness, is shifted along the temporal axis. As crops are classified primarily by their unique growth patterns, the temporal shift may cause misclassifications when a source-trained model is applied to a target region. For example, the shift in time could cause the phenology of spring barley to appear similar to that of winter barley in the target. Thus, accounting for the temporal shift is a key factor in cross-region adaptation.
A possible approach could be to train models that are invariant to temporal shifts, such as by applying random temporal shifts to the training data. However, as the temporal shift could be the main feature that separates two crop types, shift-invariant models have reduced classification ability compared to shift-variant models.
Another approach is to apply existing UDA methods. Typically, these methods address domain adaptation by constraining the classifier to operate on domain-invariant features [21]. This is achieved by training the classifier to perform well on the source domain while minimizing a divergence measure between features extracted from the source and the target domains [20,22,23]. While these methods have been successfully applied in various applications [19,24], they do not directly account for the temporal shift in SITS and have thus been reported to provide limited benefits in cross-region UDA [25]. More recently, selftraining methods have emerged as a promising alternative to domain-invariant methods [26][27][28][29][30]. Self-training iteratively generates pseudo-labels [31] for the target domain and then uses them to retrain the model with target data. To account for noisy pseudo-labels caused by the domain shift, these methods typically incorporate a refinement step where the noise is reduced in various ways, such as with generative models [28] or learned confusion matrices [29]. Still, no method considers the particular case of SITS where the pseudo-label noise is caused by a temporal shift.
In this paper, we propose TimeMatch, a self-training method for cross-region UDA where we directly account for the temporal shift. TimeMatch consists of two components: (i) the temporal shift estimation and (ii) the TimeMatch learning algorithm.
Estimating the temporal shift directly from the target data is difficult, as the lack of labels hinders e.g. the comparison of class-wise vegetation indices as in Figure 1. To address this, we propose an unsupervised method where we estimate the temporal shift from target to source with a source-trained model. First, we obtain the softmax predictions of the model when input target data with different temporal shifts. Then, we choose the temporal shift with high prediction confidences across a diverse set of classes. We show that this approach corresponds well to the actual climatic differences between the two regions. Moreover, as correctly classified examples tend to have higher prediction confidence [32], the estimated shift enables us to generate more accurate pseudo-labels in the target domain for self-training.
In TimeMatch learning, we therefore use self-training to adapt a model to the target domain. We propose an iterative algorithm where we alternate between temporal shift estimation and re-training the model for the target domain by learning from both labeled source data and pseudo-labeled target data. By doing so, the model learns discriminative target features for accurate crop classification in the target region.
Lastly, we present the TimeMatch dataset, a challenging new open-access dataset for training and evaluating crossregion models on SITS with over 300.000 annotated parcels from four different regions in Europe. Evaluated on this dataset, our approach outperforms all competing methods by 11% F1-score on average across five different cross-region UDA experiments.
In summary, our contributions are as follows: • We propose a method for estimating the temporal shift between a labeled source region and an unlabeled target region to reduce their temporal discrepancy.
• We propose TimeMatch, a novel UDA method designed for the cross-region problem of SITS, where crop classification models are adapted to an unlabeled target region by self-training on temporally shifted data for improved performance compared to existing methods. Our source code is available at https: //github.com/jnyborg/timematch.
• We release the TimeMatch dataset [33], a new dataset for training and evaluating cross-region UDA models on SITS from four different European regions.
This paper is organized as follows. Section 2 describes the existing literature related to our work. Section 3 describes the proposed method for temporal shift estimation and the TimeMatch learning algorithm. Section 4 presents our dataset and the experimental setup, and Section 5 the experimental results. Lastly, Section 6 concludes this work.

Related Work
TimeMatch is related to the existing work in unsupervised domain adaptation of learning domain-invariant features, time-series adaptation, cross-region adaptation, and self-training.

Domain-Invariant Methods
The predominant approach in UDA is to train the classifier to rely only on domain-invariant features [21,24]. To this end, several works consider adversarial training [22,34,35]. In domain adversarial neural networks (DANN) [22,34], the feature extractor is adversarially trained to produce domain-invariant features that are indistinguishable by a domain discriminator. Conditional domain adversarial networks (CDAN) [35] improves upon DANN by conditioning the domain discriminator on classifier predictions in addition to features to enable the alignment of multimodal data distributions.
Another approach is to align the feature distributions directly by minimizing a divergence measure. Choices for divergence measure include maximum mean discrepancy (MMD) [23], correlation alignment [36], or optimal transport [37,38]. Recently, JUMBOT [38] achieves stateof-the-art UDA results by using mini-batch unbalanced optimal transport to minimize the domain discrepancy of joint deep feature and label distributions.
While domain-invariant methods achieve strong results on computer vision datasets, they do not explicitly handle the temporal dimensions of SITS data and time series in general.

Time-Series Unsupervised Domain Adaptation
Few methods tackle the challenge of time series UDA. Current methods for time series are typically also based on learning domain-invariant features [39][40][41]. Recurrent domain adversarial neural network (R-DANN) and variational recurrent adversarial deep domain adaptation (VRADA) explore long short-term memory and variational recurrent neural networks as feature extractors, respectively, and learn domain-invariant features using the DANN method [39]. Likewise, the convolutional deep domain adaptation model for time series data (CoDATS) learns domain-invariant features with a temporal convolutional network with the DANN method [40]. However, while these methods are effective at learning domain-invariant features for time series, they are not designed to learn the temporal shift present in SITS.

Cross-Region Crop Classification
Lucas et al . [25] reports that existing UDA methods, including existing domain-invariant methods [42,43], perform poorly when applied to cross-region UDA of SITS due to the temporal shift problem and the change in class distribution between the two regions. Recently, Wang et al . [20] proposed the phenology alignment network (PAN) as the first method for cross-region UDA of SITS. PAN learns domain-invariant features with MMD [23] and a feature extractor consisting of gated recurrent units and self-attention. Still, as PAN learns domain-invariant features, the temporal shift problem is not directly addressed.

Self-Training Methods
Semi-supervised learning (SSL) is a similar task to domain adaptation, but where the labeled and unlabeled data are sampled from the same data distribution [44]. Many SSL methods are based on pseudo-labeling [31] (also called self-training [45]), where the model's own high-confidence predictions are used as labels for the unlabeled samples. In Mean Teacher [46], the model assumes a dual role as teacher and student. The student is updated by gradient descent with pseudo-labels generated by the teacher, whereas the teacher is updated by an exponential moving average (EMA) of student parameters to reduce pseudolabel noise. FixMatch [45] generates pseudo-labels for weakly-augmented inputs, and uses confident pseudo-labels to self-train the model on strongly-augmented inputs, regularizing the model to output consistent pseudo-labels for random augmentations of the input.
Recently, self-training has emerged for UDA as an alternative to domain-invariant methods [26][27][28][29][30]. By learning from both labeled source data and pseudo-labeled target data, self-training methods implicitly encourage feature alignment for each class without restricting the model to operate on domain-invariant features. However, since the domain shift often results in increased pseudo-label noise compared to SSL, existing methods introduce various refinement methods to reduce the noise, such as co-training [47], tri-training [26], conditional generative models [28], or confidence regularization [30]. Recently, Adversarial-Learned Loss for Domain Adaptation (ALDA) [29] proposes to refine the pseudo-labels with a noise-correcting domain discriminator.
Similar to this line of work, our approach is based on self-training. By directly accounting for the temporal shift, we can temporally align the target SITS with that of the source, which enables the generation of more accurate pseudo-labels compared to existing self-training methods that do not.

TimeMatch
In this section, we describe our proposed method TimeMatch for cross-region UDA. We begin by formally defining the problem setting, followed by an overview of how TimeMatch addresses it. We then give the details of the two TimeMatch components: temporal shift estimation and TimeMatch learning.

Problem Setting
In crop classification, the input is a sequence of satellite images ) of length T i to be classified into one of the K crop classes. In object-based classification, which we focus on in this work, each x i ∈ R Ti×Ni×C contains a sequence of N i pixels of C spectral bands within a homogeneous, agricultural plot of land which we refer to as a parcel. Each is typically represented by the day-of-year [8,12], and makes it possible to account for the irregular temporal sampling of most satellites. The goal of the crop classification task is to learn a model which predicts class probabilities p(y|(x i , τ i )) ∈ R K , typically learned with supervision from labels y ∈ {1, . . . , K}.
In this work, we consider the problem of cross-region UDA. We are given a source domain of n s labeled SITS and a target domain  Figure 2: Overview of TimeMatch. Both the student and teacher are pre-trained on the source domain. Temporal Shift Estimation: We input shifted target data to the teacher model and obtain its predictions for each shift. We then score each shift by the confidence and diversity of the teacher predictions, and the shift with the best score is output as the temporal shift estimate δ t→s and δ s→t = −δ t→s . TimeMatch Learning: The teacher generates pseudo-labels for unlabeled target data shifted by δ t→s . Then, the student is updated for (non-shifted) target data using the pseudo-labels, and for source data shifted by δ s→t using the available source labels. As a result, the student is adapted to the target domain with both generated target labels and actual source labels. After the student parameters have been updated with gradient descent, the teacher parameters are updated as an exponential moving average (EMA) of the student parameters. As both models adapt to the temporal shift of the target domain, the best shift for pseudo-labeling with the teacher changes and must be re-estimated. The EMA ensures the teacher adapts slowly which enables δ t→s to be re-estimated each epoch only for improved training efficiency and pseudo-label accuracy.
of n t unlabeled SITS. We assume both the source and target domains consist of SITS acquired over a single year (January 1 to December 31) and in geographically different locations. The two domains can be associated with different data distributions, as changes in local conditions, e.g. soil, weather, climate, or farmer practices, cause domain discrepancies [15]. In this work, we focus on the domain discrepancies caused by temporal shifts (Section 1). Although not explicitly addressed in this work, there are other sources of discrepancies that might occur. For example, the local topography or soil conditions could impact not only the temporal development of crop growth, but also the spectral values, which could change the spectral signature of the same crop type in different regions.
Because of these data discrepancies, models which are trained with the labeled source domain can fail when applied to the unlabeled target domain [16], thus hindering the large-scale application of crop classifiers. To this end, our goal is to adapt a model trained on D s to make accurate predictions on D t . We do so by explicitly estimating the temporal shift between the two regions to generate accurate pseudo-labels for D t . Then, we re-train the model with target data using the pseudo-labels, thereby adapting the model to the spectral and temporal properties of the target region. We note that the classes in the source may not be exactly the same as the classes in the target. This complicates UDA, which typically assumes a closed-set setting [48], where the set of classes in the source and target domains are equal. For simplicity, we focus on a closed-set setting by adapting a classifier trained for the main K − 1 crop types in the source region, plus an "unknown" class containing all remaining source data. This ensures that all target examples can be classified to either one of the K − 1 crop classes or "unknown".

Approach Overview
Here we give an overview of how TimeMatch addresses the cross-region UDA problem before describing the full details. A visual presentation of TimeMatch is given in Fig We aim to estimate the temporal shift between the source and target regions to reduce their domain discrepancy (see Figure 1). We represent the temporal shift by a scalar δ t→s ∈ Z (as the number of days), here in the direction from target to source. Note that the shift in the opposite direction is obtained by δ s→t = −δ t→s , so we only have to estimate one shift. To shift the target domain by δ t→s , we write τ t + δ t→s , meaning δ t→s is added element-wise to each target day-of-year. With our proposed method for temporal shift estimation (Section 3.3), we obtain estimates for δ t→s and δ s→t .
In TimeMatch learning (Section 3.4), we use δ s→t to construct a target-shifted source domain , which has reduced domain discrepancy to the unlabeled target domain D t due to the temporal alignment. We therefore use self-training to learn from the labeled D s→t and unlabeled D t . To do so, TimeMatch learning unifies temporal shift estimation with the loss function of FixMatch [45] and the exponential moving average (EMA) training of Mean Teacher [46], as we explain next.
We first obtain source-trained parameters by training a crop classifier with D s . We then duplicate the trained classifier into two models: the teacher and the student. Our TimeMatch learning algorithm aims to adapt both the teacher and the student to the new target region with selftraining. The teacher generates pseudo-labels for the target domain to train the student, and the knowledge learned by the student is then updated back to the teacher, thus the pseudo-labels used to train the student itself are improved. We generate pseudo-labels by using δ t→s to create an adapted target domain As D t→s is temporally aligned with D s , the source-initialized teacher generates more accurate pseudo-labels for D t→s than D t . The student is then trained with labeled D s→t and pseudo-labeled D t via the FixMatch loss [45], thereby leveraging both the available source labels and the target pseudo-labels to adapt the student to the target domain.
After updating the student, the teacher is updated via an EMA of the student parameters. As the two models adjust to the temporal shift of the target domain, the best shift δ t→s for pseudo-labeling with the teacher gradually moves to zero during TimeMatch learning. To adjust to the changing shift and ensure the pseudo-labels are consistently accurate, it is necessary to re-estimate the temporal shift of the teacher as it learns. However, repeating temporal shift estimation is computationally expensive, and drastically increases training time if done each training iteration. Therefore, in Section 3.4.3, we discuss how EMA training alleviates this issue by enabling the re-estimation to be done only once per epoch.
Next, we first describe our method for estimating the temporal shift before describing the loss function and learning algorithm of TimeMatch learning.

Temporal Shift Estimation
Estimating the temporal shift directly from the data is difficult, as labels are not available in the target domain. Without labels, we cannot separate the target data into each crop type, which prevents the computation of e.g. vegetation indices to compare the source and target phenology of each crop type directly.
Instead, we propose to estimate the temporal shift by calculating statistics on the predictions of a source-trained model when input temporally shifted target data. By doing so, we estimate the shift that aligns the target data with the source crop phenology learned by a model, leveraging the classification ability of the trained model to estimate the shift from unlabeled data. Another benefit of this approach is that it enables re-estimation of the best temporal shift for pseudo-labeling as the learned phenology of the model changes from source to target in TimeMatch learning.
One possible value to measure is the confidence of the model predictions. Intuitively, when a source-trained model is applied to correctly shifted target data, it should output more confident predictions than for incorrectly shifted target data. As correctly classified examples tend to have more confident predictions than wrongly classified or outof-distribution examples [32], we argue that a confident temporal shift indicates a better alignment of the target domain with the source which results in accurate pseudolabels and reduced domain discrepancy.
The confidence of a model for a particular shift δ t→s can be measured by the expected entropy: where H denotes the entropy, here computed over the predictions of the model θ when input temporally shifted target data sampled from D t .
To estimate a temporal shift with entropy, Equation 1 should be computed for each possible shift δ t→s ∈ {−∆, −∆ + 1, . . . , ∆}, and the estimated shift is then the one with lowest entropy. Here, ∆ defines the maximum possible shift (in days) to estimate between the source and target regions.
However, due to the class imbalance of SITS, relying on expected entropy alone could result in choosing a shift where the model outputs confident predictions for only the most frequent classes while ignoring the less frequent classes. This would hinder the adaptation of the model for the less frequent target classes. To address this problem, the diversity of the predicted marginal distribution should also be considered in the estimation. The marginal is given by: that is, the expected predictions of the model (parameterized by θ) when input shifted target data.
Ideally, the marginal distribution should match the class distribution of the target domain, as this indicates a shift where the model predicts a diverse set of classes according to their actual frequency. But since target labels are unavailable, so is the target class distribution. Instead, inspired by metrics for evaluating image generative models, we consider two options to address this: the Inception score [49] (IS), and the activation maximization score [50] (AM). Both metrics consider the entropy and marginal of a pre-trained model, but IS scores the marginal distribution by its similarity to a uniform distribution, whereas AM uses the actual class distribution.
As these metrics were originally proposed to evaluate the quality of generated images, we describe next how we repurpose them for temporal shift estimation. Finally, we describe an algorithm where IS is used to initialize the temporal shift for estimating the target class distribution with pseudo-labels and enable a better temporal shift estimate with AM.

Inception Score
IS is computed for a temporal shift δ by: where D KL (· ·) is the KL-divergence between two distributions, here the conditional distribution p θ y|(x t , τ t + δ) and marginal distribution p θ (y) predicted with model parameters θ. Higher values of IS indicate a better δ, as when the conditional and marginal distributions are different, this corresponds to a temporal shift where the former has low entropy (i.e., the model is confident), and the latter has high entropy (i.e., the model predicts a diverse set of classes). Hence, the temporal shift δ t→s is estimated by: where the estimated temporal shift maximizes IS for a source-trained model parameterized by θ s when applied to target data.

AM Score
A shortcoming of IS is that the highest score is achieved when p θ (y) is uniform [51], which corresponds to an even distribution of classes in the target domain. For SITS, where the class distribution is often highly imbalanced, this may cause IS to estimate a suboptimal shift. AM [50] addresses this issue by taking the actual target class distribution C t into account: AM consists of two terms: the first term is an entropy term on the conditional distribution, and the second is the KLdivergence between the underlying class distribution C t and the marginal distribution. Lower values of AM indicate a better δ, as the model is confident in its predictions, and the actual class distribution of the data matches the predicted distribution of classes. The temporal shift δ t→s is estimated by: δ t→s AM (θ s , C t ) = argmin δ t→s ∈{−∆,...,∆} AM(δ t→s , θ s , C t ).
where the estimated temporal shift minimizes AM.

Algorithm for Estimating Temporal Shift
While AM is more accurate at estimating the temporal shift, it requires knowledge of the target class distribution C t , which is not available. To address this, we propose to approximate the target class distribution for AM by pseudo-labels obtained with IS. We show our approach in Algorithm 1. First, we use IS (Equation 5) to estimate an initial shift δ t→s (line 3). This initial estimate allows us to shift the target domain so that more accurate pseudo-labels can be generated with a source-trained model. We then use the pseudo-labels to estimate the target class distribution C t (lines [4][5]. Finally, we re-estimate the temporal shift more accurately with AM andĈ t (line 6). Compute pseudo labels for each ( i =y for y ∈ {1, . . . , K} 6 Estimate temporal shift δ t→s ← δ t→s AM (θ s ,Ĉ t ) (Eq. 7) 7 Output: Temporal shift δ t→s

TimeMatch Learning
With our method for estimating the temporal shift, we can reduce the domain discrepancy between the source and target domains. The TimeMatch learning algorithm uses the temporal shift to train the student model for the target domain from teacher-generated pseudo-labels via the FixMatch loss [45] and EMA training [46]. We present the complete TimeMatch algorithm in Algorithm 2, and describe the details of each step in the following.

Pre-training on the Source Domain
As we rely on the teacher to generate pseudo-labels to train the student, it is important to obtain a good initialization for both models. Additionally, temporal shift estimation requires a source-trained model. Thus, we first use the labeled source domain to obtain source-trained model parameters θ s . Given a batch of labeled source data from D s , we optimize the following loss function: where L(·, ·) is a classification loss (e.g. cross-entropy or focal loss [52]) and B the batch size. After pre-training, we initialize the parameters of the student θ and teacher θ from θ s (line 2).

TimeMatch Loss
The TimeMatch loss consists of two terms: a supervised loss L s→t applied to the adapted source domain D s→t and an unsupervised loss L t applied to the unlabeled target domain D t . Our loss is based on the FixMatch loss [45]. To regularize the model to predict consistent pseudo-labels on randomly augmented versions of the same inputs, FixMatch applies two types of augmentation functions: weakly-augmented a(·) and strongly-augmented A(·), corresponding to simple and extensive augmentations of the input. We describe the form of augmentations we use for a(·) and A(·) in Section 4.4.
Let δ s→t and δ t→s be temporal shifts estimated given by Algorithm 1 using the teacher (line 5-7). To compute the 6 Algorithm 2: TimeMatch 1 Input: Labeled source domain D s , unlabeled target domain D t , source-trained parameters θ s , total epochs n and iterations m, pseudo label threshold , trade-off value λ, EMA decay rate α, learning rate η 2 Initialize student parameters θ ← θ s and teacher parameters θ ← θ s 3 Initialize estimated target class distributionĈ t = 0 4 for epoch = 1 to n do With S shifted by δ s→t , compute source loss L s→t (Eq. 9) 11 For each example in T shifted by δ t→s , generate teacher prediction q t i and pseudo labelsŷ t i (Eq. 10 and 11) 12 With T and confident pseudo labelsŷ t i with max(q t i ) > , compute target loss L t (Eq. 12) 13 Update student by gradient: θ ← θ − γ∇ θ (L s→t + λL t ) 14 Update teacher by EMA: θ ← (1 − α)θ + αθ using source labels y s i to update the student θ on strongly augmented source data shifted by δ s→t . This loss makes it possible for the student to learn the target phenology from shifted source data (line 10).
To generate pseudo-labels for the target domain, we obtain the predicted class distribution from the teacher when input source-shifted target data: where the teacher θ is input a weakly-augmented target sample, shifted by δ t→s . Then, we usê as pseudo-label (line 11). The student θ is then updated on strongly-augmented target data for confident pseudo-labels (line 10): where 1 is the indicator function, and is the confidence threshold for using a pseudo-label. With this loss, the student is trained with target data using pseudo-labels. The total loss minimized by the student in TimeMatch is: where λ is a scalar hyperparameter to control the trade-off between the supervised and the unsupervised loss (line 13).

EMA training and re-estimating temporal shift
By optimizing L all , the student and teacher are trained only for the target phenology, as L s→t shifts the time of the source to the target, while L t keeps the target in its original time. This loss enables a source-trained model to adapt to the crop phenology of the target domain.
However, by doing so, the source domain is gradually "forgotten", and as a result, it becomes unnecessary to apply the temporal shift δ t→s for pseudo-labeling the target domain with the teacher. This causes δ t→s to gradually move to zero during TimeMatch learning. Thus, if δ t→s is fixed to the same shift, the target samples will be wrongly shifted, which results in incorrect pseudo-labels. To address this, we re-estimate the temporal shift for the teacher during TimeMatch learning. As Algorithm 1 chooses the shift based on the confidence and diversity of model predictions, re-estimating the temporal shift with the teacher ensures the generated pseudo-labels remain accurate during training.
However, if the teacher is a direct copy of the student, the model will rapidly adapt to the target domain, which requires the temporal shift to be re-estimated every few iterations. But doing so drastically increases training time, as Equation 7 requires forwarding a large sample of target data for each possible temporal shift. We address this by introducing EMA training, where the teacher is slowly updated via an EMA of the student parameters (line 14): where α is a decay rate. By choosing α close to 1, we reduce the rate at which the teacher adapts to the target domain, enabling the re-estimation of δ t→s to be done only once each epoch (line 5). Moreover, by averaging model weights via EMA, we also obtain less noisy pseudo-labels [46].

PSE LTAE
...   [12,53]. Given SITS of an agricultural parcel, the PSE module process each time step independently by embedding a random sample of pixels. The results are then concatenated into a sequence of embeddings e i . The observation dates τ i , which we add temporal shifts to, are input to the model by adding their positional encoding to e i . The result is temporally processed by LTAE to a single embedding o i which is then passed to the classifier.
By re-estimating the temporal shift, the teacher and the shift can both evolve jointly during training, resulting in better pseudo-labels for improved cross-region adaptation. Note that δ s→t is not re-estimated (line 7). The first shift estimate represents the shift of the data, whereas the reestimated shift represents the shift of the teacher. By fixing δ s→t to the initial estimate, the source domain is kept aligned with the target domains during training, which enables semi-supervised learning.

Dataset and Materials
This section presents the TimeMatch dataset [33] and the materials for our experiments. We first introduce the crop classification model we use, followed by a description of the dataset and its pre-processing. Then, we describe the competitors and our implementation. Our source code is publicly available, and contains the implementation of TimeMatch and the competitors, a link to download our dataset, and the full experimental results: https://github.com/jnyborg/timematch.

Network Architecture
As model, we use the existing object-based crop classifier PSE+LTAE introduced by Sainte Fare Garnot et al . [12,53]. The network consists of two modules: the pixel-set encoder (PSE) and the lightweight temporal attention encoder (LTAE). See Figure 3 for an overview.
The PSE module handles the spatial and spectral context of SITS. Given SITS of an agricultural parcel, PSE samples a random pixel-set of size S among the N i available pixels within the parcel. The PSE is efficient compared to e.g. convolutions, which are time and memory-consuming when applied to irregularly sized parcels. As spatial information is lost by doing so, the PSE supports an optional extra input with various geometrical properties of the given parcel, such as its area. We do not input this extra feature to avoid biasing the model towards the shapes of parcels in the source region, which typically change depending on the local farmer practices. Thus, we only input the sequence x i ∈ R Ti×Ni×C , which is then embedded by the PSE for each time step independently.
The LTAE module [53] handles the temporal context by applying self-attention [54] with modifications to output a single embedding. It improves the accuracy and computational efficiency compared to the original TAE [12] by a channel grouping strategy and a learnable master query. The additional input τ i is input to LTAE by encoding the days with sinusoidal positional encoding [54] and adding the result to the output of PSE. As the positional encoding does not support negative inputs, we input negative temporal shifts by offsetting each τ i by the maximum temporal shift ∆. Given the sequence of PSE-embeddings and the encoded τ i , LTAE outputs a single embedding o i , which is then classified by a multi-layer perceptron to produce class probabilities p(y|(x i , τ i )) ∈ R K .

The TimeMatch Dataset
The TimeMatch dataset [33] contains SITS from Sentinel-2 Level-1C products in top-of-atmosphere reflectance. Four Sentinel-2 tiles are chosen in various climates: 33UVP (Austria), 32VNH (Denmark), 30TXT (mid-west France), and 31TCJ (southern France), abbreviated as AT1, DK1, FR1, and FR2, respectively. A map of the tiles is shown in Figure 4. We use all available observations with cloud coverage ≤ 80% and coverage ≥ 50% between January 2017 and December 2017. Figure 5 shows the resulting acquisition dates for the four tiles. We leave out the atmospheric bands (1, 9, and 10), keeping C = 10 spectral bands. The 20m bands are bilinearly interpolated to 10m.
For ground truth data, we retrieve geo-referenced parcel shapes and their crop type labels from the openly available Land Parcel Identification System (LPIS) records in Denmark 1 , France 2 , and Austria 3 . We select 15 major crop classes in Europe and label any remaining parcels as unknown. Figure 6 shows the selected classes and their frequency in each tile. We pre-process the parcels by applying 20m erosion and removing all parcels with an area of less than 1 hectare. This reduces label noise by removing pixels near the border of parcels, which are often less representative of the given crop class compared to the pixels in the middle, and also by removing small or thin polygons, which are typically miscellaneous classes such as field borders. The SITS are pre-processed for object-based classification by cropping the pixels within each parcel to input sequences x i ∈ R Ti×Ni×10 . Each input is then randomly assigned to the train/validation/test sets of each Sentinel-2 tile by a 70%/10%/20% ratio. Note that this process assumes knowledge of parcel shapes in the target region. If this is not available, TimeMatch may instead be applied for pixel-based classification by inputting single pixels (S = 1) to PSE+LTAE. We choose five different cross-region tasks (written as "source→target"): DK1→FR1, DK1→FR2, DK1→AT1, FR1→DK1, and FR1→FR2. When a Sentinel-2 tile is chosen as source, all labels of the train and validation sets are available for training. When a tile is the target region, no labels are available, except for the final evaluation on the test set. In contrast, many existing UDA methods assume that a labeled validation set is available for the target domain, and use it during training e.g. to select the best model [23,29,34,35]. However, this assumption is unrealistic, as if labels were available in real-world scenarios, they would be better used for training the model. Instead, we report all cross-region UDA test results with the model output at the end of training. Still, hyperparameters must be chosen with a labeled validation set. Thus, we tune hyperparameters with the target validation set for only one task, DK1→FR1, and apply the found hyperparameters to all remaining tasks (as done in [38]). The class distributions between regions differ significantly, and there may not be enough examples of a crop type in the source region for a model to learn their classification. Thus, when pre-training models on source data, we only use a subset of the available crop types with at least 200 examples in the source region (as indicated by the dashed line in Figure 6). The remaining classes are set as "unknown". When evaluating on the target data, we report results on the same selection of source classes no matter their frequency in the target.

Comparisons
Baselines. We consider the following baseline methods: • Source-Trained is PSE+LTAE trained on the source domain and applied to the target domain without domain adaptation. This result represents the lower bound cross-region performance of the model.
• Target-Trained is PSE+LTAE trained with labeled target data using the same classes as the source-trained. We note that by training with the source classes, which is required for comparison, infrequent classes may not be learned properly which increases the variance of the results. This result can be seen as the upper bound cross-region performance if labels were available in the target region.
Competing UDA Methods. We compare TimeMatch to five existing UDA methods. We reproduce these methods for SITS by replacing the original feature extractor with PSE+LTAE. For domain-invariant methods, we align the LTAE feature vector input to the final classifier (i.e., o i in Figure 3), similar to the original approach in these methods. We compare to the following methods: • FixMatch [45] is TimeMatch without the temporal shift estimation. As this method is semi-supervised learning, it shows whether UDA or SSL is more beneficial for cross-region adaptation.
• MMD [23] learns domain-invariant features by minimizing the maximum mean discrepancy metric.
• DANN [22] uses a domain classifier to learn domaininvariant features with adversarial training.
• CDAN+E [35] improves upon DANN by conditioning the domain classifier on the classification output and minimizing an entropy loss on target data.
• ALDA [29] is a self-training method where pseudolabels are refined by a noise-correcting domain discriminator. This method is in essence the most similar to TimeMatch.
• JUMBOT [38] learns domain-invariant features by a discrepancy measure based on optimal transport. We note that time-series domain adaptation methods R-DANN, VRADA [39] and CoDATS [40] also employ DANN to align the features extracted by temporal network architectures. Thus, the only difference between VRADA, CoDATS, and the DANN approach mentioned here is the backbone architecture, which in our case is the temporal model PSE+LTAE. PAN [20], a UDA method for SITS, learns domaininvariant features by minimizing the MMD loss for a temporal crop classification network. Unfortunately, we were unable to gain access to the source code of PAN for comparison. As an alternative, we include the MMD comparison, which is similar to PAN, except the crop classifier is changed to PSE+LTAE.
ShiftAug. To verify the benefits of estimating the temporal shift compared to training models that are invariant to temporal shifts, we implement a simple data augmentation technique to train shift-invariant models that we name ShiftAug. During training, ShiftAug uniformly samples δ ∼ U(−∆, ∆) for each training example and shifts the example by (x i , τ i + δ). By extending the training data to contain all valid temporal shifts with uniform probability, ShiftAug enables training models with invariance towards shifts. Note that ShiftAug is incompatible with the temporal shift estimation presented in Algorithm 1, which requires a shiftvariant model.
We implement all competing methods with and without ShiftAug. This reveals the degree at which existing methods can implicitly learn shift-invariance.

Implementation Details
All experiments are implemented in PyTorch [55] and trains on a single NVIDIA 1080 Ti GPU. Our implementation is based on the source code of PSE+LTAE [53].
Source-training. To initialize models on the labeled source domain, we follow the original training approach of PSE+LTAE [12]. We train for 100 epochs with the Adam [56] optimizer with an initial learning rate of 0.001 and we decay the learning rate using a cosine annealing schedule [57]. We use weight decay of 0.0001, batch size 128, and focal loss γ = 1. Inputs are normalized to [0, 1] by dividing by the max 16-bit pixel value 2 16 − 1. The best source-trained model is selected using the source validation set. We augment the inputs by randomly sub-sampling 30 time steps. The pixel-set size of PSE is set to S = 64 during training. The same setup is used for the targettrained model. For the final evaluation, we do not sample time steps or pixels, and instead input all available time steps (T = T i ) and pixels (S = S i ) for each example to the model. This ensures deterministic test results, and we also observe slightly improved results by doing so.
ShiftAug. When training with ShiftAug, all training data (both source and target) are randomly shifted during training as described in Section 4.3. ShiftAug is disabled during evaluation.
TimeMatch. We use the same training setup as the sourcetrained model but instead train for 20 epochs with a lower initial learning rate of 0.0001. We define an epoch as 500 iterations to fix the frequency in which the temporal shift is re-estimated. We use maximum temporal shift ∆ = 60 days, as we did not observe shifts greater than 2 months for our dataset in Europe. We set the tradeoff hyperparameter λ = 2.0 in Equation 13, EMA keeprate α = 0.9999, and pseudo-label threshold τ = 0.9. A sensitivity analysis of these hyperparameters is provided in Section 5.4. For the FixMatch [45] augmentations, we use the identity function for the weak a(·) in Equation 10 and randomly sub-sample time steps for the strong A(·) in Equations 9 and 12. These are used for simplicity, and we leave the use of more advanced augmentations for SITS to future work. At each iteration, we sample two minibatches of size 128, one from the source and one from the target, in order to calculate the TimeMatch objective in Equation 13. We use a class-balanced mini-batch sampler for the source domain to ensure each source mini-batch contains roughly the same number of samples for each class. This reduces the class imbalance problem for the source domain for improved performance. Additionally, we apply domain-specific batch normalization [58][59][60] by forwarding the source and target mini-batches separately instead of concatenated. This ensures the batch normalization [61] statistics are calculated separately for each domain, for improved adaptation.
Competing Methods. We re-implement the competitors MMD, DANN and CDAN+E following the domain adaptation library in [62], and use the original source codes for ALDA [29] and JUMBOT [38]. FixMatch [45]   and the student as the final model. All methods are initialized from a source-trained model. ShiftAug versions are initialized from the corresponding ShiftAug source-trained model, and we continue to use ShiftAug during training. As in TimeMatch, we train for 20 epochs and tune the hyper-parameters of these methods on the task DK1→FR1. The full details can be found in our source code. Table 1 shows the performance obtained with our approach and the re-implemented baselines and competitors. We report the mean and standard deviation of macro F1 scores, calculated from the results of three runs with different dataset splits.

Main Results
We observe that source-trained models transfer very poorly to new target regions, with an average F1-score of 39% on target data. In comparison, target-trained models on the same classes achieve 82% on average. We observe that training shift-invariant models with ShiftAug improves domain generalization, leading to an increased average score of 49%. This greatly motivates addressing the temporal shift in UDA.
Existing UDA methods, however, only slightly increase the performance of source-trained models, with the best result obtained by CDAN+E [35] with 47%. By incorporating our ShiftAug, we observe a performance boost across all evaluated methods, indicating that existing methods are unable to implicitly handle the temporal shift.
Our approach TimeMatch, where we explicitly estimate the temporal shift, outperforms all competing methods by 11% on average and 5% for their ShiftAug variants. This shows that accounting for the temporal shift is a key component for the cross-region adaptation problem of SITS. Moreover, the shift-variant approach of TimeMatch outperforms the shift-invariance strategy. We hypothesize that training for shift-invariance may complicate crop classification, as the classification of certain crop types is shift-variant. For example, spring barley and winter barley develop similarly over time but shifted, as also discussed in Section 1.
Comparing TimeMatch to the results of the targettrained model, we observe that our approach-without any target labels-recovers a significant part of the highest achievable performance if target labels were available, but we also find that there is room for improvement. From our results, we see that methods which explicitly account for the temporal shift, such as TimeMatch and the Shif-tAug variants of competing methods, generally outperform methods which do not. We therefore believe that further improvements can be gained by considering stronger forms of temporal alignment than shifts, such as class-wise alignments or time warping. We leave this interesting direction to future work.
Lastly, we highlight the results of the semi-supervised learning method FixMatch [45]. This method is similar to TimeMatch, but without temporal shift estimation. We observe that without ShiftAug, FixMatch obtains results worse than the source-trained model. This indicates semi-  supervised learning cannot address the cross-region task alone. With ShiftAug, however, the results are greatly improved on average. Interestingly, the performance is worse without ShiftAug for all tasks except FR1→FR2. Here, the source and target regions are the most geographically close, and as result, the temporal shift is also closer to zero (see the top-left table in Figure 4). This indicates that ShiftAug (controlled by ∆) is a trade-off between better long-range classification results and worse short-range results. In contrast, by estimating the temporal shift directly, TimeMatch does not have this issue and outperforms shift-invariance at both short and long distances.

Analysis of Temporal Shift Estimation
In Figure 7a, we show the change in the overall accuracy of a source-trained model when applied to target data with different temporal shifts for DK1→FR1. We also show the change in entropy, IS, and AM scores of the model. We observe a significant increase in accuracy by temporally shifting the target data. Calculating the statistics of entropy, IS, and AM from the predictions of the model works well as an unlabeled proxy to accuracy. We aim to estimate the shift with the highest accuracy (dashed blue line) for the highest quality pseudo-labels. For the shown example, the minimum of both entropy and AM correspond to the best shift. However, we find AM to be the most consistent across different adaptation tasks.
In Figure 7b, we show the rate at which the estimated temporal shift for the teacher goes to zero in TimeMatch learning when training with different EMA decay rates. When the shift changes, the previous estimate becomes suboptimal for generating accurate pseudo-labels. We address this by re-estimating the temporal shift during training. We observe that low decay rates (e.g. 0.99) require the shift to be re-estimated after a few iterations, which is inefficient. In comparison, a decay rate of 0.9999 allows us to only re-estimate the shift only once every epoch.
The table in the upper left corner of Figure 4 shows the initial temporal shifts estimated by our method. We find the estimated shifts are connected to the climatic differences  In the other direction, the opposite is true, and indeed, we estimate a negative temporal shift of −35 days. Note that these are off by 3 days due to estimation variance. Here, the two source-trained models used to estimate the temporal shift in each direction are trained with two completely separate source region, yet their estimated shifts are still roughly inverses. This indicates that the temporal shift learned by these models is connected to the phenological properties of their respective source regions.

Ablation Study
To better understand how TimeMatch is able to obtain state-of-the-art results, we perform an ablation study on the different components for the task DK1→FR1. We report the results in Table 2. We first study the impact of the EMA training. Instead of EMA, we set the teacher as a direct copy of the student (No EMA). We observe that training without EMA introduces a significant drop in F1-score. This shows that EMA is important to ensure high pseudolabel accuracy. Setting δ s→t = 0 disables the temporal shift of the source domain, and the student is trained with datasets with different temporal shifts. We observe a significant decrease in F1-score as a result. Disabling the balanced mini-batch sampler for the source domain also leads to a degradation of the performance. If the model is trained with class imbalanced source data, the teacher will make biased pseudo-labels for the samples from the target domain [64]. This hinders the TimeMatch learning process, as pseudo-labels for infrequent classes in the source domain are less likely to be generated for the target. By applying a balanced mini-batch sampler for the source, we address this problem by ensuring each source batch contains roughly the same number of samples for each category. Estimating the temporal shift with IS or entropy instead of AM results in a slight performance drop. Domain-specific batch normalization is simple to implement, as it just requires forwarding source and target batches separately instead of concatenated. Disabling this   component results in a small average performance loss with notably higher variance.

Sensitivity Analysis
Here we study the sensitivity of the TimeMatch hyperparameters. The results are shown in Figure 10. Higher values of α lead to better results, with a decay rate of 0.9999 being the best. However, increasing it to 1.0, so the teacher is not updated, results in a drop in F1-score, as the teacher cannot benefit from the knowledge learned by the student. The confidence threshold controls the trade-off between the quality and quantity of pseudo-labels. A threshold of 0.9 gives the best F1-score and further increasing the threshold to 0.95 drops performance as a result of too few pseudo-labels, which particularly decreases performance for the less frequent classes. Finally, the trade-off parameter λ controls the importance of the source domain loss L s→t with respect to the target domain loss L t . We observe that this hyperparameter is less important than the other two, but setting λ = 2.0 gives the best results.

Visual Analysis
Finally, we visualize the ability of TimeMatch in learning discriminative features for the target domain. In Figure 8, we visualize t-SNE [63] embeddings of target domain features from source-trained, CDAN+E (the best competitor on average), TimeMatch, and target-trained models on the task DK1→FR1. The colors of the points represent their class (black is the unknown class). With TimeMatch, the  target features are better clustered into their respective classes compared CDAN+E, which does not result in much better feature separation than the source-trained model. The target-trained plot shows the best possible learned features when training with all available target labels. Even with labels, the classes are not perfectly separated, e.g. for unknown/meadow or winter triticale/winter wheat. Figure 9 shows example parcel predictions in a small area for the source-trained and TimeMatch models compared to the ground truth. The colors represent the same classes as before. We observe a large class confusion for the source-trained model, in particular between winter barley (blue) and winter wheat (dark pink), which are also not separated well in Figure 8. Without using any target labels, TimeMatch resolves this issue, resulting in clusters that better resemble the ground truth.

Conclusion
This paper presented TimeMatch, a novel cross-region adaptation method for SITS. Unlike previous methods that solely match the feature distributions across domains, TimeMatch explicitly captures the underlying temporal discrepancy of the data by estimating the temporal shift between two regions. Through TimeMatch learning, we adapt a crop classifier trained in a labeled source region to an unlabeled target region. This is achieved by a learning algorithm that combines temporal shift estimation with selftraining, where target pseudo-labels are generated using the estimated temporal shift from target to source. Lastly, we presented the TimeMatch dataset, a new large-scale cross-region UDA dataset with SITS from four different regions in Europe. Evaluated on this dataset, TimeMatch outperforms all existing approaches by 11% F1-score on average across five different adaptation tasks, setting a new state of the art in unsupervised cross-region adaptation. While this demonstrates that TimeMatch reaches strong results, there is still a gap with the performance obtained by fully supervised approaches. To overcome this limitation, we hypothesize that stronger temporal alignments, e.g. class-wise alignments or time warping, could further improve the performance. Another possibility is to perform domain adaptation across both time and space, which in addition to the temporal aspect also brings new considerations, such as the change in parcel shapes over time and crop rotations. We hope our proposed method and released dataset will encourage the remote sensing community to consider the challenging cross-region adaptation problem and its temporal aspect.