Learned Representation of Satellite Image Series for Data Compression

: Real-time transmission of satellite video data is one of the fundamentals in the applications of video satellite. Making use of the historical information to eliminate the long-term background redundancy (LBR) is considered to be a crucial way to bridge the gap between the compressed data rate and the bandwidth between the satellite and the Earth. The main challenge lies in how to deal with the variant image pixel values caused by the change of shooting conditions while keeping the structure of the same landscape unchanged. In this paper, we propose a representation learning based method to model the complex evolution of the landscape appearance under different conditions by making use of the historical image series. Under this representation model, the image is disentangled into the content part and the style part. The former represents the consistent landscape structure, while the latter represents the conditional parameters of the environment. To utilize the knowledge learned from the historical image series, we generate synthetic reference frames for the compression of video frames through image translation by the representation model. The synthetic reference frames can highly boost the compression efﬁciency by changing the original intra-frame prediction to inter-frame prediction for the intra-coded picture (I frame). Experimental results show that the proposed representation learning-based compression method can save an average of 44.22% bits over HEVC, which is signiﬁcantly higher than that using references generated under the same conditions. Bitrate savings reached 18.07% when applied to satellite video data with arbitrarily collected reference images.


Introduction
Currently, video satellite remote sensing has become a new trend in smart city development. Low-Earth-orbit satellites are released in an increasing number to record and monitor the landscape from space, which drives more and more extensive applications of the satellite video, such as smart transportation and disaster management. Compared to the hyperspectral image, satellite video can continuously observe a specific place, and it is helpful to detect and respond to unusual activities in time. However, owing to the gap of a large amount of data stream and the real-time demands of remote sensing data analysis, the remote surveillance applications in smart cities are greatly restricted.
Unlike the conventional city surveillance videos, satellite videos are usually taken with high resolution, for example 12,000 × 5000 for the Jilin-1 satellite videos. It can cover a large surface with great details in one frame, but it results in a large amount of video data, which is hard to transmit promptly by the current transmission channel between the satellite and Earth. Even compressing them with the latest coding standard high-efficiency video coding (HEVC) [1], it still cannot meet the needs of the real-time analysis demands [2]. More efficient and specific data compression techniques for satellite videos are in high demand.
To achieve a high compression ratio of the satellite images and videos, the dictionary-based sparse representation methods are proposed. Benefiting from the optimization algorithm, such as K-SVD [3][4][5] and low-rank matrix recovery theory [6][7][8], those methods train a dictionary based on machine learning to represent satellite data sparsely. Through constructing and optimizing the dictionary model, the remote sensing data can be efficiently represented by some atoms of the dictionary. Another type of representative work proposed to use the image prior exists in the historical images. Liu et al. [9] proposed to use a collection of historical remote sensing images as prior knowledge and introduce image feature extraction and registration to compress images. Tao et al. [10] further combined the same prior knowledge and Bayesian dictionary learning to compress remote sensing images.
Xiao et al. [2,11] discovered a new type of redundancy that exists in multi-source satellite videos, named long-term background redundancy (LBR). This type of redundancy is caused by a similar background across different images and videos captured at the same land location from different times. As the number of video clips shooting at the corresponding area increases, the redundancy across multiple video clips becomes significant. In their work, the LBR can be eliminated by creating a long-term background referencing library, containing high-definition geometrically registered images of the entire area. Then, the synthetic frames can be generated based on geometric matching, radiometric adjustment, and quality adjustment, which were further used as the reference to eliminate the LBR. The long-term background referencing library has shown its ability to represent the historical background prior to the same landscape. However, the radiometric adjustment or quality adjustment can only change the tone of the whole image, but cannot deal with the appearance change caused by seasonal variation or diurnal variation. The synthetic frame cannot achieve pixel-wise similarity to the target frame. Thus, the compression gain from the synthetic reference frame will be decreased.
In this paper, we offer a representation learning-based method to model the complex evolution of the landscape appearance under different conditions by making use of the historical image series. Each series contains multi-temporal images at the same location, namely providing different landscape appearances under various daily and seasonal conditions. The aim is to utilize the learned representation model to generate a more precise reference frame to the target frame from historical images. The challenge lies in how to deal with the variant image pixel values caused by the change of shooting conditions while keeping the structure of the same landscape unchanged. Inspired by the disentangled representation learning [12][13][14][15][16], we propose a disentangled representation learning of satellite image series. The description of the same landscape is determined by separating the stable structure from the changing environment. Based on that, the reference frame for the current target frame is generated by combining the stable structure extracted from the reference library and the current environmental parameters obtained from the current target frame. The synthetic reference frame can be adopted into the compression framework for satellite video compression proposed by Xiao et al. [2].
Experiments are conducted on real video clips from video satellites to evaluate the performance of the proposed method. The results reveal that the proposed method could achieve 44.22% bitrate savings on average over the main profile of HEVC compared with 30.95% bitrate savings in [2].
There are three main contributions of this work: (1) We propose a new representation model for the remote sensing image series, which can model the consistent landscape, as well as the variational shooting conditions.
(2) We integrate the learned reorientation of historical images into the satellite video compression framework through the generation of reference frames.
(3) Our proposed video compression method outperforms the state-of-the-art compression scheme.
The remainder of this paper is organized as follows: Section 2 provides a literature review regarding related work. A detailed introduction of the proposed synthetic reference image generation method is illustrated in Section 3. Section 4 reports our experimental results, and Section 5 concludes the paper.

Related Work
Our work is related to the current video compression method, the representation method of historical image series, and the multi-model image-to-image translation method. Therefore, we review the related work from these three aspects.

Compression Method on Satellite Videos
In recent years, the video form of satellite data, which can capture the correlation between the images, has gained increasing popularity. To compress the satellite videos, the general video compression methods have been applied to video satellites. For example, the video compression standard H.264 [17] was equipped in the Skysat [18], which was launched by Skybox and captured 1280 × 720 (720P) of satellite videos. Satellite Jilin-1 was outfitted with the latest video compression standard HEVC. Those video compression standards aim to eliminate the spatial-temporal redundancy that exists in the intra-and inter-frames of one video, which cannot remove the long-term background redundancy from multi-source satellite video. Xiao et al. [2] and Wang et al. [11] proposed the historical knowledge library-based video compression methods. They constructed a long-term background reference library composed of high-resolution historical images. By adjusting the historical images to the target images for reference, those methods can get a tremendous increasing compression ratio, especially in terms of the intra-coded picture (I frame). However, they adopted the linear model in the pixel domain, such as radiometric adjustment and quality adjustment, to translate the historical image, which cannot deal with the complex change of the appearance of the same landscape.

Compression Method of Image Series
Similar to the characteristic of satellite videos, the image set also contains a large number of similar images with common objects or backgrounds. Multiple shooting devices usually capture them at various times from different positions. To compress this type of data, some image series representation methods were proposed [19][20][21][22][23]. Yue et al. [19] proposed to retrieve the most similar historical data from the cloud as a reference. Shi et al. [20] proposed to sort the images via mining the similarities between images and reorganizing the image series into the video sequence, then coding them like a video. To compress the repetitive moving object in the surveillance videos, Xiao et al. [24,25] proposed to construct one vehicle knowledge library and generate the synthetic reference object based on perspective transformation and residual compression. Chen et al. [26,27] further improved this work by lighting compensation and texture mapping. Those methods provide ways to represent relations between multiple historical image series. Still, their computational complexity is high, and the actual differences between the historical image series of satellite videos make the satellite data compression hard to find similarities at the pixel level.

Image-to-Image Translation Method
The color correction of the registered historical image can be regarded as the image-to-image translation. That is to say, images taken at any time of a location can be generated from a historical image of that location. The image structure remains unchanged while the image color is changed. Recently, Sanchez et al. [28], Gonzalez-Garcia et al. [29], Zhu et al. [30,31], and Huang et al. [12] presented work that dealt with multi-modal output. In particular, Zhu et al. [31] invented a CycleGAN model [30] to address the diversity of output. Sanchez et al. [28] proposed a cross-domain autoencoder model that integrated the variational autoencoder (VAE) and the generative adversarial network (GAN). Those methods performed well for artistic image translation. However, they often generate structural artifacts, which makes it difficult to apply them directly for generating reference images. According to the features of satellite videos, we propose a new translation model using satellite images to learn the reference frame generation task.

Preliminary on Reference Image Generation
Preliminary studies on using the prior knowledge to improve the compression efficiency mainly focus on using images as references. In this work, the reference image can be an original historical image [32,33] or a synthesized image [24,26,34]. In our previous works [2], we used the current Google Map image as the historical reference. The reference was adjusted to generate the reference image for the compression of the target frame. Before introducing the proposed method of this paper, we first do a brief review on how to generate the reference frame of our previous work. In this paper, we make an improvement in reference frame generation to provide a more similar reference to the target frame to improve the coding efficiency.
Search for historical image: The overall framework is presented in Figure 1a. The whole process of generating the reference frame consisted of four steps: image selection, geometrical matching, radiometric adjustment, and quality adjustment. After selecting the geographically corresponding historical image from the image library, we conducted geometrical matching from the historical image to the target frame I f , so that the reference image was geometrically fit to the target frame. Then, the radiometric adjustment was used to compensate for the color difference between the reference and the target frame. After that, the quality was adjusted to match the quality of the reference image (e.g., blurry) to the target image, which might be caused by the imaging platform. After this step, the reference image was ready for the encoding process.
The searching process is shown in Figure 2. The longitude and latitude of the center of the captured area were used to locate the current position. According to the coordinates, the lens focus of the camera, and flight altitude of the satellite, the range of the captured area could be roughly obtained, which is shown as the orange area in Figure 2. Taking the location error into account, the coverage of historical image was a little larger than that of the captured area. The final searched historical image from Google Earth is shown as the blue area in Figure 2, which is denoted as I h .
Geometric matching: Due to the offset of shooting angle and the error of location, there will be a certain deviation of position between the historical image I h and the captured frame I f . Thus, feature registration was applied to I h . Specifically, the Scale-Invariant Feature Transform (SIFT) algorithm was employed to extract feature points of I h and I f . Feature points were composed of feature descriptors and positions of I h and I f , which are denoted as f I h ,p I h and f I f ,p I f , respectively. By calculating the Euclidean distance between the feature descriptor pairs f . The RANdom SAmple Consensus (RANSAC) matching scheme was applied for an affine transformation matrix M for I h to I f . Finally, the registered background reference I g was obtained by applying M to I h .
Radiometric adjustment: To adjust the color difference between the target frame and the reference image, a transfer model [35] was used. Different from the original lαβ space, the video frames were recorded in the YUV space, in which the first channel was lightness and the other two channels were color components. We modified the transfer model into the YUV space by: are the color values in the reference image after radiometric adjustment I c and the previous reference image I g , respectively.
are the mean and standard deviations of YUV from the values from the target frame I f . By using this model, the color of the reference I g r was adjusted according to the color statistics of the current frame.  Quality adjustment: To solve the problem that the Google reference usually has higher quality than the video frames, the quality of reference image I c should be adjusted. The isotropic 2D Gaussian blur filter was used to simulate quality degradation of the video frame. The mean of Gaussian distribution was set to zero. The standard deviation σ was determined by minimizing the pixel value differences between I c and I f : where B k r is the k th block from I c r and p k c is the k th corresponding pixel value from the current frame. G(σ) is the Gaussian kernel, whose values are defined by the parameter σ as follows.
After the Gaussian blur, we obtained the final reference image I r , which will be used for the prediction in the encoding framework.

Representation Learning-Based Reference Image Generation
The inter-prediction in the existing framework of video compression highly depends on the similarity of pixels in each coding-block. However, the radiometric and quality adjustment in the previous method is a global process, which cannot provide an optimal reference for every coding-block. Therefore, instead of using only one historical image for the reference, we propose to utilize a series of historical images to improve the quality of the reference image. Let . Those images from the same location share the same landscape, while the shooting conditions change. Therefore, we attempt to learn a representation of the image series. The representation consists of a content feature and a style feature for each image. The representation model learns to extract the same content feature within an image series of a location, but uses different style features to represent the individual variations. In this way, the content feature records the consistent landscape structure of a specific location, and the style feature records the variations.
Take a target frame I f and its co-located image series L f = I 1 f , I 2 f , ..., I l f i for example. L f exists at both the encoding side and the decoding side of the compression. At the encoding side, the target frame I f is input into the representation model to obtain its style feature. The style feature is then transmitted to the decoder side. Then, at the decoder side, the content code of this location is extracted from any image of L f , which should be the same as the content code of the target frame. It combines with the transmitted style feature to reconstruct the reference image I r .
The advantage of this representation is that the shared content feature can be treated as the redundancy among the images of the same location. Namely, it does not need to be transmitted. The style code records the individual variation of target frame, which needs to be transmitted for the reconstruction of the reference image. However, compared to the content feature, the length of the style feature is very short, which is friendly for encoding.

Network Architecture
In order to learn a disentangled representation of an image series, we chose to use an unsupervised disentanglement model similar to the one proposed in [12]. In the model, one image is decomposed into a content domain and a style domain. We assumed that images taken at any time in region c can be generated from the image at another time. In other words, the two images should have the same content feature, and they can convert to each other after exchanging their style features. Therefore, we propose a model to learn the mappings M 1 : x − → y and M 2 : y − → x. As is shown in Figure 3 the reconstruction of image y, denoted asŷ. The reconstruction of image x is generated in a similar process. The content encoder, style encoder, and the decoder are shared between images. In this way, the content encoder is learned to extract common information (the landscape) between remote sensing images, whilst the style encoder is learned to record the independent information (other factors).
To enforce the representation learning of the historical image series, evaluations on the common content features and the quality of reconstructed images are employed. A discriminator function D : I − → {0, 1} is introduced to tell whether the generated image is fit to the distribution learned from the historical images.

Objective Function
To train the representation learning network, the loss function is composed of several terms. The shared content feature loss is used to enforce the content encoder to learn the common features. The reconstruction loss is used to apply the learning of other factors and to ensure that the encoders and decoders are inverses. Moreover, the adversarial loss is used to restrict the distribution of translated images to the image distribution of historic images.
• Shared content feature loss: As for the shared content features, Enc x c and Enc y c must be identical, i.e., Enc x c = Enc y c . Therefore, our goal is to minimize the L 1 distance between them. The loss function of the content encoder is defined as: • Reconstruction loss: The reconstruction loss contains two meanings. The first is that the disentangled image x can be reconstructed by its content feature Enc x c and its style feature Enc x s . In this case, the encoder and decoder form an auto-encoder model. The first term of reconstruction loss is the self-reconstruction of each image. Taking x for example, the self-reconstruction loss is defined as: Image y is the same.
The second term for reconstruction loss is a cross-reconstruction loss. The exclusive feature only contains the particular information of each image. Therefore, the reconstruction of image x, Dec Enc y c , Enc x s should be similar to x. For the sake of disentangled representation, the reconstruction loss of x is: Therefore, the overall reconstruction loss is the sum of two reconstruction loss terms: • Adversarial loss: In order to force the distribution of reconstruction to be close to the distribution of the real image, PatchGAN [36] was adopted for our model. The decoder can be regarded as the generator, and it is trained to reconstruct images, which can be classified as true by the discriminator, i.e., Disc Dec Enc The generator tries to maximize the loss while the discriminator tries to minimize it. The overall adversarial loss is the sum of both reconstructed images x and y: • Total loss: Our model is trained jointly in an end-to-end learning manner. The objective is to minimize the following total loss defined as: where w rec and w c are the constant weights of corresponding losses.

Representation Learning-Based Image Translation
The procedure of image translation based on the learned representation model is shown in Figure 4. We first obtain the selected historical image after geometric matching I g . Then, it is used to extract shared feature Enc I r c by the trained content encoder. The target frame I f is used to extract style feature Enc I f s . Then, both of them are passed through the decoder, generating the reconstruction Dec Enc I g c , Enc I f s . Since the reconstruction contains both the structure of I g and the style feature of I f , the generated image I r is used as the reference image.

Improvement of the Compression Scheme
A video clip contains the intra-coded picture (I frame), predicted picture (P frame), and bidirectional predicted picture (B frame). Typically, the I frame is compressed by intra-frame coding, which costs a higher bitrate than inter-frame coding. The other two types of frame are usually compressed by inter-frame coding. Particularly for the satellite video, the spatial correlation is weak, while the long-term temporal correlation is strong due to the slowly moving background. Generally, the bitrate cost of the I frame is about 2-10 times over that of the P frame. Therefore, using the generated background reference as the reference frame to remove the LBR of I frames can significantly improve the coding efficiency. The encoding procedure is sketched in Figure 5. Technically, The proposed representation learning model was trained on the server with the historical image series. The model was stored in both the encoding side (the satellite) and the decoding side (the Earth). For each target I frame, one historical image covering the shooting area was selected and was geometrically transformed to the I frame with a transformation matrix M. Then, the image passed the content encoder of the representation model to extract the content feature of this area. In the meantime, the target I frame also went through the style encoder of the representation model to obtain the style feature. Afterward, the content feature and the style feature were input into the decoder of the representation model to reconstruct the reference image. The translation procedure needed only one forward step. We adopted the method in [37] for the encoding with the reference image for the I frame. Besides the output bitstream of the frame, we also needed to transmit the transformation matrix and the style code to the decoding side, which was compressed using the Lempel-Ziv-Welch method [38]. Figure 5. The framework of our proposed coding scheme. We use the learned representation model to generate reference frame. With the synthetic reference frame, we adopt inter-frame compression for both the intra-coded picture (I frame) and predicted picture (P frame).
At the decoding side, the selected image was firstly geometrically transformed using the transformation matrix. Then, its content feature was extracted by the content encoder of the representation model, which would be combined with the transmitted style feature to reconstruct the reference image. By using the reference image, the target frame could be decoded.
If the geometric structure of the current frame changed from the historical images, the current encoding process would not be affected since the wrongly generated area of the synthesized reference would not be used in the inter-frame prediction. It would result in a little decrease of the compression ratio.
In order to update the library of the structural change, we compared the content features from both the current frame and the selected historical image based on the following cross-correlation metric: where Enc x c is the content feature from the current frame and Enc re f c is the content feature from the selected historical image.
If the difference of the content features exceeded a threshold τ, the newly reconstructed frame from the encoded bitrate would be added into the series of the historical image. The parameter τ was set based on the deviations from the content features of the image series of the same geometric structures. We used the reconstructed frame from the encoded bitrate to update the library because we could only have the reconstructed frame on the Earth for library updating.

Implementation Details
We implemented this model using the Pytorch toolbox and optimized the translation net and the discriminator using the Adam algorithm [39] with β 1 = 0.5, β 2 = 0.999, and a learning rate of 0.0001 following [40]. In the training period, we use a batch size of eight, and the training is stopped after 1,000,000 iterations. The loss weight λ rec and λ c were set to 100 and 50, respectively. We chose the dimension of the style code to be eight across all datasets in [12]. We adopted the model trained on the SYNTHIA dataset [41] by Huang et al. [12] as model initialization and then fine-tuned the model on our data. Random mirroring was applied during training. For each image pair, we kept the resolution of them unchanged, but randomly cropped a patch of 256 × 256 as input to train our model. We trained our model on the server with a single Nvidia Titan X Pascal, and it took roughly five days for training. The trained representation model was about 115 megabytes. For a better understanding of the readers, we show the training process of our model in Algorithm 1.
In the test phase, the content feature and the style feature were extracted by the trained representation model from the target frame without cropping. The model rand at 0.171 s per frame (s/f) on a Nvidia Titan X Pascal for images of size 3840 × 2160. We also tested the model in a Nvidia Jetson TX2, which was selected as one part of an embedded system developed for a small satellite set to launch in the year 2020. Nvidia Jetson TX2 contains four ARM Cortex A57 cores and one GPU with 256 CUDA cores. The model ran at 0.97 s/f on a Nvidia Titan X Pascal for images of size 3840 × 2160.
In the experiment, we conducted the data compression tests based on the standard HEVC codec for different purposes (details shown in Table 1). Our method could also be integrated into other codecs since we mainly provided synthetic reference images. The implementation was based on the low-delay configuration of an HEVC test model HM16.20 [42]. This implementation was compared to the unmodified HEVC codec to test the effectiveness of the proposed method on satellite video compression. The rough reference image without translation and the synthesized reference image generated by [2] were also compared to evaluate the effectiveness of our method. The testing was implemented on a server with eight Intel i5 CPU of 2.6 GHz and an Nvidia Titan X Pascal.

Algorithm 1
Training procedure of our proposed model. 1: while iterations t < T train do 2: Sample and crop a mini-batch of patch pairs (x, y) from training data; 3: Extract the content and style feature to get (Enc x c , Enc x s ) = Enc(x); (Enc Calculate L GAN by (x,x), (y,ȳ), (x,x) and (y,ŷ) (Equation (7)) 10: Update the encoder and decoder with L c , L rec , L cross and L GAN

11:
Update the discriminator with L GAN 12: end while

Datasets
We evaluated our proposed method on a subset of satellites video clips captured by Jilin-1. The image series from historical Google images were used for training our translation networks. The detailed introduction about these two datasets are as follows: • Video clips from satellites: There were five video clips from the satellite Jilin-1 over the seaport of Valencia (Figure 6a), the airport of Atlanta (Figure 6b), a railway station of Munich (Figure 6c), a park of Madrid (Figure 6d) m and the urban center of Valencia (Figure 6e). Those video clips were cut out from the original 12,000 × 5000 resolution and had a unified size of 3840 × 2160 with 300 frames, 10 fps. • Image series from historical Google images: To show the evolution of the landscape, we employed the historical images of the same landscape from Google Earth as the training data. The image series contained 5000 images, which included an average of ten images for each landscape. The example images of the test landscape from the last five years are shown in Figure 7. They showed that the structure of the same landscape was almost unchanged, but the appearance was changed due to the environment.
In subjective the experiment, evaluation Bjøntegaard metrics for the delta satellite PSNR video (BD-PSNR) data and Bjøntegaard delta rate (BD-Rate) [43] were utilized as the metrics for the objective evaluation of coding performance.

Intermediate Results from the Background Reference Generation
We compared our representation learning-based image translation method with the following three representative methods to show its effectiveness:

•
Color transform [2]: manipulating the color image by imposing the mean and standard deviation of the style image onto the content image.
• AdaIN [40]: aligning the mean and variance of the content features with those of the style features by the adaptive instance normalization (AdaIN) layer.

•
PhotoWCT [43]: transferring the style of the reference image to the content image by a pair of feature transforms, whitening, and coloring.   Figures 8 and 9 show the intermediate results of the park and urban center for background reference generation from the historical Google images. It can be easily noticed that the historical Google images were different from the target frames from satellite videos in terms of shooting angle, color, and texture details. With the help of geometric matching, the reference was nearly aligned with the target one. The color transform from [2] could adjust the color according to the statistics of the target frame, but it could be observed that color deviation still existed. AdaIN and PhotoWCT performed well for image color transform. However, they generated structural artifacts (e.g., distortions on object boundaries), especially in the complex scenes, such as satellite videos. We also translated the reference image to the target frame with our proposed method, which represented the consistent content feature by the high-dimensional spatial map and represented the different style by a low-dimensional vector. From the intermediate result, we could see that the proposed image translation method could successfully handle the evolution of the landscape caused by the environment changing. (b) reference image I g after geometric matching; (c) reference image from I g after color transform [2]; (d) reference image from I g after adaptive instance normalization (AdaIN) [40]; (e) reference image from I g after PhotoWCT [43]; (f) reference image from I g after image translation using the proposed method. Figure 9. Reference images of the clip urban center. (a) Sample frame I f from satellite video clip urban center; (b) reference image I g after geometric matching; (c) reference image from I g after color transform [2]; (d) reference image from I g after AdaIN [40]; (e) reference image from I g after PhotoWCT [43]; (f) reference image from I g after image translation using the proposed method.

Experiments with Satellite Video Clips
The compression results of the proposed method compared with HEVC are presented in Table 2. In the experiment, we compare three methods to analyze the effectiveness of our method in generating good background references. 1) NoTran: the results from reference images generated with only geometric matching; 2) ColorTran: with the color transform method [2] 3) DeepTran: the proposed method.
In general, the average bitrate savings of our method could achieve up to 44.22%, compared to the average bitrate savings of 30.95% of ColorTran and 18.07% of NoTran. It was proven that the similarity of the background reference had a high effect on the improvement of the satellite video compression ratio. We also noticed that in different video clips with different video content, the highest bitrate reduction appeared with the seaport and airport, where there were large areas of uniform texture, such as sea and aircraft runway. The phenomenon was similar to [2] of the low-efficiency prediction at places containing a projection difference due to the projection difference in geometric matching.
The rate-distortion (RD) curves for the tested satellite video clips are shown in Figure 10, revealing similar results as we obtained from Table 2. The RD curves for the new reference-based compression method including NoTran, ColorTran, and DeepTran showed there was a significant improvement over the low bitrate situation. The improvement degree decreased over the increasing of the bitrate, especially the NoTran, which was even lower than HEVC in the videos of the railway station and urban center. The curves for the proposed method were higher than other curves in the four video clips, representing the general effectiveness of the proposed method in bitrate reduction for satellite videos.

Discussion
As shown in the results from image reconstruction, we can notice that the quality of the reconstruction is significantly improved compared to the color adjustment. This is probably owing to the high representation capability of deep networks. The common content feature from the network helps us to effectively disentangle the shared geometric structures across a series of images. In contrast, the style features record the variational factors causing the change of the images. The good reconstructed reference image further leads to the reduction of the compression bitrates as expected, since the more similar the reference to the target frame naturally leads to the higher compression ratio.

Conclusions
This paper proposed a representation learning-based satellite video compression method. The key idea was to make use of the prior knowledge embedded in the historically captured images from multiple times to help the compression of the current video data since the structure of the landscape does not always change. To deal with the pixel value change caused by the daily illumination change or seasonal change, a representation model was trained to represent the distribution of the images. Taking two images from the same location, the representation model would output the same content features and different style features. The former was considered as the LBR, which did not need to be transmitted; the latter was the conditional parameters, which needed to be compressed and transmitted. By using the synthetic reference image as the reference frame, the proposed method could save an average of 44.22% bits over HEVC, which was significantly higher than that using references generated under the same conditions. Bitrate savings reached 18.07% when applied to satellite video data with arbitrarily collected reference images.