Cloud-based storage and computing for remote sensing big data: a technical review

ABSTRACT The rapid growth of remote sensing big data (RSBD) has attracted considerable attention from both academia and industry. Despite the progress of computer technologies, conventional computing implementations have become technically inefficient for processing RSBD. Cloud computing is effective in activating and mining large-scale heterogeneous data and has been widely applied to RSBD over the past years. This study performs a technical review of cloud-based RSBD storage and computing from an interdisciplinary viewpoint of remote sensing and computer science. First, we elaborate on four critical technical challenges resulting from the scale expansion of RSBD applications, i.e. raster storage, metadata management, data homogeneity, and computing paradigms. Second, we introduce state-of-the-art cloud-based data management technologies for RSBD storage. The unit for manipulating remote sensing data has evolved due to the scale expansion and use of novel technologies, which we name the RSBD data model. Four data models are suggested, i.e. scenes, ARD, data cubes, and composite layers. Third, we summarize recent research on the application of various cloud-based parallel computing technologies to RSBD computing implementations. Finally, we categorize the architectures of mainstream RSBD platforms. This research provides a comprehensive review of the fundamental issues of RSBD for computing experts and remote sensing researchers.


Introduction
The accumulation of historical archives and the advancement of sensors in recent years has led to an explosion of remote sensing datasets (Toth and Jóźków 2016;Zhu et al. 2019), which are often regarded as remote sensing big data (RSBD) (Ma et al. 2015) or big remotely sensed data (Casu et al. 2017). With the launch of Landsat 9 on 27 September 2021, the Landsat series of satellites has been continuously observing Earth for nearly 50 years (Masek et al. 2020;Roy et al. 2014). The Sentinel satellites from the European Space Agency (ESA) had acquired 24.87 petabytes of remote sensing data by the end of 2020 (Drusch et al. 2012). Series of high-resolution remote sensing satellites such as SPOT (French 'Satellite pour l'Observation de la Terre'), Gaofen (Chinese high-resolution Earth imaging satellites), and IRS (Indian Remote Sensing Satellites) have been successively launched for various applications. Satellite data can be better leveraged and explored with the efforts of international organizations such as the Global Earth Observation System of Systems (Mhawish et al. 2021). RSBD has profoundly advanced remote sensing science, enabling a global perspective and a long-term historical view to re-conceptualize Earth . It not only expands the spatiotemporal scope of the study area but also stimulates a revolution in the remote sensing methodology. Over the past several decades, remote sensing research has gradually developed from qualitative remote sensing based on the statistical models of digital signal processing to quantitative remote sensing characterized by the consideration of physical models (Asrar et al. 1985;Liang 2003). Recently, remote sensing research has entered the data-driven era (Hey, Tansley, and Tolle 2009;Zhang et al. 2019), e.g. machine learning (Jordan and Mitchell 2015) and deep learning (Lecun, Bengio, and Hinton 2015). Most of all, with the progress of RSBD, an increasing number of researchers and engineers are working with RSBD, effectively contributing to research on global sustainable development, global climate, food security, natural disasters, agriculture, etc. (Allen et al. 2021;Gray et al. 2020;Moon, Kim, and Chan 2019;Neal et al. 2019).
Although RSBD has a promising future, its technical implementation remains difficult, and the identification of current technical challenges facing RSBD remains a broad issue. Yang et al. (2017) identified eleven main challenges for implementing RSBD, including data storage, transmission, analysis, architecture, and quality. In addition, Chi et al. (2016) found three common challenges, including proper data identification and big data computing and collaboration. These challenges have been mainly induced by the dramatic increase in data volume, which far exceeds the capacity of conventional computing technologies. For instance, Hansen et al. (2013) mapped global forest gains and losses from 2000 to 2012 at a 30 m spatial resolution, processing 20 terapixels of data. However, the manipulation of these massive datasets requires abundant human and material resources. As a result, only a few leading research institutions or companies have had access to RSBD, unfortunately leading to difficulty in leveraging RSBD and seriously restricting development.
Cloud computing is a big data service delivered through the Internet (Yu et al. 2017;Armbrust et al. 2010). It originated from E-commerce and social networks (Sakr et al. 2011), and has been widely applied to RSBD over the past several years (Varghese and Buyya 2018). These applications include Google Earth Engine (GEE) (Tamiminia et al. 2020), Microsoft Planetary Computer ('Planetary Computer' 2022, and Earth on Amazon Web Services (AWS) ('Data and Information Access Services' 2021). Cloud computing is underpinned by big data technology and mainly consists of five service models, including Software as a Service (SaaS), Platform as a Service (PaaS), Infrastructure as a Service (IaaS), Data storage as a Service (DaaS), and Function as a Service (FaaS) (Dillon, Wu, and Chang 2010). These services differ from previous computing technologies (e.g. high-performance computing) and make RSBD more accessible to the public. First, cloud computing typically relies on a set of commodity machines, and is usually priced as 'pay-per-use' and supports the elastic expansion of resources on-demand (Wang et al. 2010). As a result, the cost is much lower than sophisticated and expensive high-performance computers Gupta et al. 2013). Second, cloud computing is delivered through the Internet, which helps the open sharing of remote sensing data and research, therefore promoting the progress of FAIR principles (findable, accessible, interoperable, and reusable) (Wilkinson et al. 2016). Third, a robust big data ecology has been formed around cloud computing after years of development, shielding users from the technicalities of massive computing. Thus, cloud computing helps RSBD researchers and engineers focus more on algorithms and analysis rather than being hindered by computer technology (Saxena et al. 2020). Fourth, cloud computing is suitable for data-intensive computing such as RSBD applications (Yang et al. 2019). The combination of cloud computing and remote sensing has proven to facilitate and promote RSBD. This has attracted a growing interest in remote sensing research as a potential solution for large-scale spatiotemporal analysis.
Previous studies have evaluated the progress of cloud-based RSBD in terms of the acquisition, storage, computing, analysis, transmission, and visualization (Ma et al. 2015;Chi et al. 2016;Zhang, Zhou, and Luo 2021;Wang and Yan 2020) and in specific applications (Sarker et al. 2020;Balti et al. 2020;Qu et al. 2020). In this research, we performed a broad technical review of cloud-based RSBD storage and computing and summarized the key architectures of cloudbased RSBD platforms. Section 2 discusses four concerns posed by RSBD storage and computing, namely raster storage, metadata management, data homogeneity, and the computing paradigm. Section 3 and 4 review the available cloud-based technologies and our current understanding of RSBD storage and computing. Finally, Section 5 provides the conclusion of the review. Overall, four data models, two computing types, four processing models, and five types of RSBD platform architectures are identified and discussed in this research, which broadly assesses state-of-the-art RSBD technologies and helps readers explore advanced RSBD.

Raster storage
The volume of remote sensing data long ago stepped into the petabyte era and is moving towards the exabyte and zettabyte era. Conventionally, raster data is stored as arrays in multiple file formats, including the hierarchical data format (HDF) (MODIS), GeoTIFF (Landsat), and JP2000 (Sentinel-2). The size of a single dataset generally ranges from megabytes to terabytes. In addition, some cloud-based platforms store raster data as tiles in lightweight data formats (Yao et al. 2020) following grid discretization with Discrete Global Grid Systems (DGGS) for fast visualization and online computing. These include portable network graphics (PNG) and the Joint Photographic Experts Group (JPEG) image format.
The unprecedented increase in remote sensing data poses severe challenges for RSBD raster data storage . First, the volume of RSBD far exceeds the capacity of standalone storage hardware, such as block storage or the redundant array of independent disks (RAID) (Gomes, Queiroz, and Ferreira 2020). Distributed storage systems (DFS, introduced in Section 3.1) can preserve petabytes of data (Lü et al. 2011). However, the storage cost is extremely high for both individuals and government departments. For instance, the United States Geological Survey (USGS) considered charging for access to widely used sources of remote sensing data (e.g. Landsat) in 2018 to recover costs from users (Popkin 2018). Second, the retrieval of array data generates costly operations due to the specificity of remote sensing data structures, leading to a decrease in I/O efficiency and an increase in access latency. These types of operations differ from conventional big data I/O operations and are rarely optimized by existing big data technologies (Zhao et al. 2018). For example, in the case of time-series analysis, remote sensing data are often scattered in several individual files/objects, resulting in numerous random access and expensive data transformations (Extract-Transform-Load). Consequently, data storage schemes need to be further developed to support big data storage technologies and remote sensing data.

Metadata management
Remote sensing data comes with complex and vital metadata information. The full utilization of metadata is valuable for the reliability and quality of raster data. The management of remote sensing datasets must rely on metadata information, such as the band, resolution, capturing time, etc. In addition, the complete metadata describes essential information for tracing raster data quality, such as cloudiness and solar altitude, and ensuring the robustness and reliability of the subsequent analysis (Barsi et al. 2019).
Several bottlenecks limit the use of complete metadata. First, there are substantial metadata entries for remote sensing datasets, and the quantity often exceeds the capacity of conventional metadata management systems. For example, the European Space Agency (ESA) Sentinel-2 product includes hundreds of metadata fields. Second, there are differences in metadata information formats. Generally, the metadata for remote sensing data is stored in the form of key-value pairs. However, some metadata, such as cloud masks and pixel quality assessments, are stored as vectors or rasters. Unfortunately, traditional metadata storage technologies do not support the management of unstructured metadata. Third, the metadata structures of the remote sensing data acquired from different sources are heterogeneous and lack a unified standard, leading to semantic ambiguities between datasets (Closa et al. 2019). Therefore, standards and management systems, such as the National Aeronautics and Space Administration's (NASA) Unified Metadata Model ('NASA Unified Metadata Model' 2022), need to be formulated for metadata to ensure multi-source remote sensing data interoperability. These standards need to be developed and adapted for the novel RSBD applications emerging in the era of cloud computing (e.g. tile data retrieval, metadata queries). Therefore, it is crucial to implement the technology migration from big data to remote sensing metadata (Al-Yadumi et al. 2021).

Homogeneity
The homogeneity of input data is essential for data mining (Yu et al. 2017), while heterogeneity is a common feature of big data (Wu et al. 2014). In Section 2.2, we mentioned the heterogeneous characteristics of metadata. However, the heterogeneity of raster data is more complex and essential, especially for multisource remote sensing data (Zhan et al. 2018;Pastor-Guzman et al. 2020). Specifically, we assess the homogeneity of remote sensing raster data in terms of two aspects.
Homogeneity refers to the identical physical characteristics and quality of remote sensing data, such as the spectral meaning (e.g. central wavelength), processing level, accuracy, resolution, and projection. These characteristics can affect the accuracy and robustness of any subsequent analysis. Typically, heterogeneous data can be fixed during pre-processing (Young et al. 2017). However, pre-processing large amounts of data is difficult because some pre-processing steps still require manual intervention and cannot be executed in a fast and parallelized fashion (Rittger et al. 2021;Wei, Chang, and Bai 2020). Moreover, homogeneity restricts the integrity and continuity of remote sensing data in both the temporal and spatial dimensions. Spatiotemporal continuity is necessary for remote sensing analysis to ensure greater accuracy over a more extensive scale (Kuo et al. 2018). A remote sensing analysis with discontinuous data can lead to inconsistent results . Nevertheless, continuity is commonly inaccessible over a large region of interest due to the specificity of remote sensing data acquisition modes and plans ( Figure 1).
It is a challenge to improve the types of homogeneity mentioned above, which relates to remote sensing science and big data processing. Specific data pre-processing theories and algorithms are needed for technical support. Recent research has made great efforts toward improving the homogeneity of remote sensing data, e.g. spatial-temporal data fusion Zhu et al. 2018) and multi-source remote sensing data harmonization (Claverie et al. 2018). Additionally, corresponding computational technologies are expected to implement rapid homogenization for large remote sensing datasets Gao et al. 2022).

Parallel computing
Parallel computing simultaneously implements computation by dividing the main computation into smaller processes (Almasi and Gottlieb 1989). Parallel remote sensing computing lies at the core of RSBD, and enables the full exploitation of big data's scalable computing capability. Previous reviews have described the processing in remote sensing computing as (Chi et al. 2016): where Y denotes the computing results, X is the input remote sensing dataset, and F is the mapping function that transfers the input datasets into the result. There is no need to consider implementing parallelized computing for a single-threaded application scenario. Therefore, Eq. (1) can be simplified as: (2) where f refers to a single-threaded remote sensing algorithm. Eq.
(2) is the most basic case. However, in a big data scenario, the scale of X may be huge and beyond the capacity of a standalone computing node. In this case, the computation needs to be simultaneously processed with more computational resources to reduce the overhead time cost. However, the parallelization execution strategies are largely different among algorithms. Therefore, it is not easy to find a generalized computational framework that can be adapted to all remote sensing analysis algorithms, such as Eq. (1).
RSBD analysis can be grouped into two types, data-separable computing and data-inseparable computing, to further decompose the problem and investigate different solutions. Data-inseparable computing cannot be parallelized by partitioning the data. This type of computing requires a large amount of global information from the whole dataset, such as unsupervised classification, principal component analysis (PCA), etc. A parallel processing method that simply divides the dataset will produce side effects due to the tile edges (Lassalle et al. 2015). Parallel computing methods for such analysis are usually individualized. In other words, it is difficult to generalize a parallel computing paradigm for all data-inseparable remote sensing algorithms. Google Earth Engine's 'spatial aggregations data distribution model' pre-implements several data-inseparable computing algorithms. Each algorithm is implemented individually and transparently by Earth Engine using the MapReduce computing paradigm (introduced in Section 4) (Gorelick et al. 2017).
On the contrary, data-separable computation can be considered as a series of independent subtasks by partitioning the datasets. In other words, not much external information is needed while processing each partition, and: where f is the processing algorithm for a sub-partition, F is the algorithm for integrating the partitions into a complete output Y, x is a sub-partition of X, and X = {x 0 , x 1 , . . . , x n }. Data-separable computing is supported by cloud computing and is known as embarrassingly parallel or pleasingly parallel computing in computer science (Barcelona-Pons et al. 2019). This type of computing has been widely applied in RSBD using quantitative remote sensing, artificial intelligence, etc. For example, Pekel et al. integrated the computing power of 10,000 computers to map 30 m global water bodies for almost 30 years based on an expert system classifier (Pekel et al. 2016). Ni et al. extracted 10 m rice-growing areas in northeast China using machine learning (Ni et al. 2021). Xie et al. produced 30 m annual irrigation maps based on MODIS and Landsat data for the United States from 1997 to 2017 using a random forest classifier (Xie and Lark 2021). All of the above studies have relied on the parallelization of data-separable computing. Additionally, the studies used Earth Engine's 'image tilling data distribution model' for spatial partitioning and 'streaming collections' for temporal partitioning (Gorelick et al. 2017).
Overall, data-inseparable computing requires custom implementation, while data-separable computing can be implemented based on generic processing paradigms. However, despite the similarities in parallel computing paradigms, there are differences in the data partitioning strategies, analysis algorithms, and combination algorithms for each specific analysis. In addition, there are strong relationships between the way the data is partitioned and parallel algorithms. Therefore, a unified framework is urgently needed to regulate and constrain remote sensing algorithms and distributed execution paradigms.

Challenges in the DIKW hierarchy
This section introduces the four primary challenges facing RSBD, which are expected to be resolved with cloud-based approaches. However, there are significant differences between RSBD and traditional data processing. RSBD involves both remote sensing technology and computer science, which fully illustrates its multidisciplinary nature. Traditionally, remote sensing science has only been applied to limited scales and needs to be re-examined to support large-scale applications.
In addition, computer science and technology have been traditionally oriented to conventional business, and need to be tailored to remote sensing to support the management and mining of RSBD. This cross-fertilization perspective is closely related to the four challenges. Rowley (2007) defined the Data-Information-Knowledge-Wisdom (DIKW) hierarchy. This concept can help explain the relationships between the four identified challenges and RSBD ( Figure 2). DIKW Data corresponds to raw remote sensing data, which is associated with the challenges of data storage and metadata management. DIKW Information corresponds to data that conforms to homogeneity, such as the data cube or analysis ready data (ARD) ('CEOS' 2022). Homogeneity should be addressed when transforming DIKW Data into DIKW Information. The parallel computing problem exists in both the process of transforming DIKW Data into DIKW Information and DIKW Information into DIKW Knowledge. Finally, DIKW Information is transformed into DIKW Wisdom using human intelligence, which is used to assist real-world decisions and practices. The process of transforming and formalizing wisdom from knowledge remains challenging for RSBD.

Cloud-based big data storage
Currently, there are five leading cloud computing and big data technologies applicable to RSBD storage, including the Object Storage System (OSS), Distributed File System (DFS), Relational Database Management System (RDBMS), NoSQL, and array database management systems (array DBMSs).
The Object Storage System (OSS) manages data in the form of objects, each of which is identified with a globally unique identifier. In particular, the RESTful API allows data access via HTTP, which means that the object can be easily accessed from anywhere on the network. In addition, OSS can manage additional metadata for data descriptions.  (Shvachko et al. 2010). DFS supports more comprehensive interfaces and features in comparison to OSS (Weil et al. 2006). However, DFS can suffer from the bottleneck problems of primary nodes, which restricts the upper limit of scaling to some extent. For example, the metadata of the Hadoop Distributed File System is Figure 2. Relationship between the DIKW pyramid and the four major concerns. stored in the primary node's memory, which restricts the number of files that are stored (Shvachko et al. 2010). In addition, the files stored in DFS can only be accessed through the mounted hosts, which is not as flexible as OSS.
Relational database management systems (RDBMS) are a widely used database model (Codd 1970). RDBMS is oriented toward transactional operations and focuses on the properties of atomicity, consistency, isolation, and durability (ACID). The reliability and stability of RDBMS have been greatly improved with the development of RSBD. Some RDBMS, such as PostgreSQL, can manage spatial data and have been widely used for remote sensing metadata management. However, there are apparent bottlenecks in the standalone RDBMS load capacity. A cloud-based distributed RDBMS, NewSQL, was proposed to enhance the scalability of traditional RDBMS for massive structured data (Pavlo and Aslett 2016). Google Spanner is an example of this technology (Corbett et al. 2013).
The onset of Web 2.0 drove an increasing need to manage a large amount of unstructured data, which gave rise to NoSQL, e.g. MongoDB, HBase (Mehul Nalin 2011), and Google Big Table (Chang et al. 2006). Unlike RDBMS, NoSQL does not support transactional operations and ACID properties. On the contrary, NoSQL emphasizes the principles of consistency, availability, and partition tolerance (CAP) (Gray and Reuter 1992;Cattell 2010), thus improving concurrency, efficiency, and horizontal scalability. The various NoSQL technologies have distinct technical characteristics that can be applied in different application scenarios. These are generally classified into four types, including wide-column, key-value, document, and search engine. More detailed reviews of NoSQL can be found in the respective literature (Davoudian, Chen, and Liu 2018;Guo and Onstein 2020).
Array database management systems (array DBMSs) are a type of scientific database dedicated to the storage and management of array-like scientific data (Zalipynis 2021). Array DBMSs are often grouped as NoSQL. However, we decided to introduce them individually due to their natural affinity for remote sensing and geospatial data (Zalipynis 2020). Array DBMSs support SQL-like queries and operations on arrays (e.g. resampling and aggregations). Such advanced manipulations are essential for remote sensing data management because they simplify data retrieval and pre-processing (Zalipynis 2021;Appel et al. 2018). In addition, array DBMSs generally optimize I/O through the underlying technology, which is beneficial for online RSBD computing. For example, TileDB optimizes the performance of concurrent I/O and sparse arrays by turning multiple random-writes into a single sequential write (Papadopoulos et al. 2016). Furthermore, some advanced array DBMSs support horizontal scaling. For example, SciDB (Brown 2010) and RasDaMan (Baumann et al. 2018) support distributed deployment, and TileDB supports share-nothing cloud computing architecture, storing files as AWS (Amazon Web Services) S3 objects in the cloud. However, importing massive scientific data into an array DBMS may be time-consuming. In addition, as far as we know, no cloud services directly provide array DBMS services, and users can only build an array DBMS through IaaS.
Users can build most of the storage technologies introduced above using IaaS. In addition, cloud services also provide out-of-the-box storage services (SaaS). SaaS helps users focus more on business by avoiding database maintenance. Table 1 identifies the open-source technologies corresponding to different storage technologies and mainstream cloud computing SaaS products.

Cloud-based data storage for RSBD
Massive remote sensing data storage consists of raster data and metadata storage.

Raster storage
Currently, remote sensing raster data is generally stored as cloud-optimized data formats in an OSS or DFS. In addition, it can be stored in NoSQL databases in the form of tiles or in an array DBMS in the form of arrays. Cloud-optimized data formats are optimized to improve I/O performance in cloud storage. As we mentioned earlier, OSS does not support file opening and writing operations, which is inconvenient for computing. For example, even though only a portion of the data is accessed, the complete remote sensing dataset must be downloaded, resulting in high redundant overhead costs. Cloud-optimized data storage formats for remote sensing data such as Zarr and Cloud Optimized GeoTiff (COG) have emerged and improved the performance of RSBD data storage. Among them, Cloud Optimized GeoTiff (COG), a GeoTiff data format optimized for cloud computing and storage, has been widely adopted ('Cloud Optimized GeoTIFF' 2022). A 16-kilobyte header file is first parsed when accessing COG within OSS. Subsequently, a portion of the remote sensing data can be read on demand without downloading the entire dataset. Furthermore, the COG file retains the original data resolution and creates internal overview copies for lower resolutions, significantly improving the data retrieval efficiency of web-based applications. As a result, COG can improve the retrieval efficiency in both DFS and OSS. Currently, COG is attracting increased attention. For example, COG replaced GeoTiff in 2021 as the standard data format for Landsat series data Collection 2.
OSS can store remote sensing tile and raster data, and cloud-optimized data formats, especially COG, are becoming major storage formats for OSS. Amazon, Microsoft, and Google currently use OSS to store a large amount of remote sensing data ( Table 2). The data stored in an OSS can be easily shared through the Internet and leveraged for analysis and visualization, promoting open sharing and the use of remote sensing data. Users can efficiently access the data with little charge and without considering data management and server maintenance, greatly promoting FAIR principles.
DFS is a mature big data storage technology and the dominant storage system of RSBD platforms. Distributed file systems cannot share data as easily as OSS, but provide advantageous functions such as appending writes and modifications. There is no apparent requirement for direct data sharing for a computing platform, but there is a clear need for functions such as append write and random read. For example, Digital Earth Australia (successor of the Australian Geoscience Data Cube) stores Landsat archives in the Lustre system (Braam 2019) within the Australian National Computational Infrastructure (Lewis et al. 2017). The JRC Earth Observation Data and Processing Platform (JEODPP) stores remote sensing datasets in EOS, a DFS designed for the European Organization for Nuclear Research (Peters, Sindrilaru, and Adde 2015;Soille et al. 2018), and Earth Engine stores a large amount of remote sensing data in Google Colossus. NoSQL supports the storage of large amounts of unstructured data and can therefore preserve RSBD raster data. Wide-column NoSQL is suitable for storing a vast amount of unstructured data such as tiles. GeoTrellis (Kini and Emanuele 2014), a Spark-based RSBD computation engine, stores remote sensing tiles and vector geospatial data in wide-column NoSQL. However, wide-column NoSQL lacks support for data indexing, especially the spatial index needed for remote sensing data. Therefore, users must carefully design the row keys to enable spatiotemporal queries . In-memory key-value NoSQL databases store data as key-value pairs in distributed memory and can cache the intermediate data for online remote sensing computation. For example, Earth Engine stores the cached data from the service in an in-memory database to reduce secondary access latency (Gorelick et al. 2017). However, the volume of remote sensing data can far exceed the memory storage capacity. Thus, in-memory key-value NoSQL is not suitable for persistent remote sensing data. Document NoSQL provides spatial indexing capability and can support the storage of extensive individual data with more comprehensive capabilities. Wang et al. (2019) and Cheng et al. (2020) implemented storage systems for RSBD based on MongoDB. They stored the data in Mon-goDB after further slicing and achieved the management of remote sensing data based on Mon-goDB's rich query capability. Overall, NoSQL supports more advanced functions than OSS or DFS, such as spatiotemporal data management. It has obvious advantages in RSBD application scenarios, but the cost of NoSQL is much greater than DFS or OSS. As far as we know, the practice of petabyte-level NoSQL-based RSBD storage still requires further research and exploration.
Unlike other storage systems that store remote sensing data as files, array DBMSs store and manipulate remote sensing data as arrays. Array DBMSs optimize efficiency based on the underlying storage technology. More importantly, array DBMSs support high-level array manipulation for managing remote sensing data, including data storage, metadata management, indexing, etc. In other words, an array DBMS can be used as an RSBD data management system with a few additional components. For example, EarthServer (Baumann et al. 2016) implements the storage of a large amount of remote sensing data using RasDaMan (Baumann et al. 1998). Furthermore, some array DBMSs can process remote sensing data (e.g. reprojection, resampling) and have been applied in tandem with machine learning (Ordonez, Zhang, and Lennart Johnsson 2019). However, array DBMS-based RSBD data management is still in its infancy. One of the major challenges lies in incorporating data into NoSQL, which can be costly. Specifically, all raw datasets should be pre-processed into the unified format recognized by each type of array DBMS, which is a time-consuming process (Lewis et al. 2017).

Metadata storage & management
RSBD metadata storage and management are mainly based on NoSQL, RDBMS, and NewSQL.
RDBMS is suitable for storing and managing remote sensing metadata and is the mainstream technical approach for RSBD management systems. For example,  and Zhou et al. (2021) stored remote sensing data in distributed MySQL and PostgreSQL systems, respectively. The Open Data Cube stores metadata in PostgreSQL (Killough 2018). In our case, we managed the metadata of petabytes of remote sensing datasets (ten million metadata entries) with a standalone PostgreSQL instance. However, there is an upper limit to the storage capacity of RDBMS. Therefore, RDBMS is only suitable for the rapid construction of structured remote sensing metadata storage for medium-sized datasets. Cloud-native NewSQL overcomes the problems with scalability and is ideal for storing significant metadata volumes. For example, Earth Engine uses Spanner as one of its data management tools (Gorelick et al. 2017). NewSQL open-source technology and cloud computing services are still being developed.
NoSQL can store unstructured data such as heterogeneous remote sensing metadata (Guo and Onstein 2020). For RSBD data storage systems, data and metadata are often rarely modified after they are input into the database. Therefore, compared to RDBMS, NoSQL's lack of support for ACID is acceptable for RSBD management systems. Search engine is a type of NoSQL that supports full-text search such as Solr and Elastic Search. Search engine NoSQL builds inverted indexes in memory to achieve high performance and a robust full-text index. Fan et al. (2017) stored the metadata of remote sensing in SolrCloud and implemented a full-text search. Their process supports advanced functions such as fuzzy queries and has good adaptability for the complex structures of remote sensing metadata. However, it is costly to implement such storage systems using search engine NoSQL. Wide-column and document NoSQL are also used for metadata storage. For example, Earth Engine adopted the Big Table storage system (Gorelick et al. 2017). Wang et al. (2019) and Cheng et al. (2020) used MongoDB to store both raster data and metadata to achieve integrated data/metadata storage.
3.3. Data model: scene, ARD, data cube, composite layer The ultimate purpose of data storage is to prepare data for analysis, and thus, homogeneity must be considered. The homogeneity of raster data is not prominent in the case of small-scale remote sensing analysis, and computing is mainly implemented within a scene by a standalone processing node. There is an increasing requirement for advanced remote sensing data organization due to the expansion of spatiotemporal scales, which we propose as a data model related to data organization, data structure, and data production methods. It is necessary to adopt a suitable remote sensing data model for large-scale analysis within cloud computing for specific application scenarios (algorithms, parallel computing strategies, etc.). In the past few years, several data organization schemes have been developed for RSBD analysis, including scenes, ARD, data cubes, and composite layers (Figure 3).
Scenes are the most basic organization for remote sensing data and have been widely applied during the past several decades. Conventionally, satellites collect remote sensing data in strips and transmit them to the ground segment. The ground segment processes the remote sensing data through corrections and evaluations and profiles them according to a regular grid (e.g. the Military Grid Reference System adopted by Sentinel). The pre-processed remote sensing data are the most common form of remote sensing data corresponding to the remote sensing images fetched from data providers (e.g. USGS, ESA Copernicus). The use of cloud computing greatly diminishes the cost of acquiring scene data. In addition, COG technology also enhances the efficiency of remote sensing data access with higher degrees of freedom.
Analysis Ready Data (ARD) (Potapov et al. 2020) was initiated by the Committee on Earth Observation Satellites (CEOS) ('CEOS' 2022). The original intent of ARD was to reduce the threshold for users to leverage the data by reorganizing discrete datasets into regular blocks with Figure 3. Relationships between scenes ARD data cubes and composite layers. a fixed size, resolution, and projection, thus minimizing data processing and correction (Dwyer et al. 2018). ARD must be radiometrically and geometrically corrected, and evaluated for quality at the pixel level using a uniform standard to achieve homogeneous physical characteristics (Frantz 2019). In addition, ARD is commonly reorganized into global unified Discrete Global Grid Systems in the form of tiles. For example, the USGS produces Landsat ARD applying the latest Collection 2 archive standard. Zhong et al. (2021) developed an ARD data product based on GaoFen satellite data.
Most of the previous RSBD research stacked multiple datasets into a 'composite' before analysis according to specific rules, which we name as the composite layer (Thorp and Drajat 2021). The composite layer refers to layer-like datasets produced by pre-processing and combining all available data over a certain spatial-temporal range, such as remote sensing data products (Gong et al. 2020). We borrowed the term 'layer' from geographic information systems (GIS) to emphasize that there is one and only one value for each pixel in the region of interest. It is characterized by its ideal homogeneity. Thus, it is the best input data model for non-time series analysis. Therefore, the composite layer is the closest remote sensing data model to the 'Information' in DIKW. For small-scale applications, scenes or ARD can be approximated as the composite layer, especially when the scale of the region of interest is comparable to that of scenes or ARD. However, data coverage at large scales is not guaranteed for medium and high-resolution remote sensing data, which is caused by the data acquisition model or long revisit periods. Therefore, the differences between scenes and layers are more prominent in large-scale applications.
Recently, the data cube has gradually become the focus of RSBD research, especially cloud-based RSBD applications (Lewis et al. 2016;. The data cube reorganizes and stacks ARD data along a time dimension and dissects them according to a regular grid. A data cube can be a sparse collection of time-series data or a dense mosaic of the best available values over time. In contrast to the composite layer, which discards multiple available values, the data cube aggregates all available ARD datasets over time to approximate the layer as closely as possible. It is the best input for large-scale remote sensing analysis (especially time-series analysis). However, the data cube tends to be sparse in medium-and high-resolution remote sensing applications. Xu et al. (2022) further developed the connotation of the data cube to improve the homogeneity and proposed Computation Ready Data (CRD). CRD considers the continuity of remote sensing data and the diverse computational needs. This approach helps promote the use of interpolation and spatiotemporal fusion algorithms to fill missing data in the data cube. CRD further reorganizes data according to computational needs and fills in the data model and computation gap.
As shown in Figure 3, the relationships between the data models can be summarized as follows.
1. The data volume and information quantity decrease from left to right. The scene datasets retain the most information and the largest volume. The generation of ARD data filters out poor-quality data and reduces the data volume. The data cube screens out data outside a specific spatial and temporal range according to certain conditions, which further decreases the available data. Ultimately, the composite layer generally reduces the dimensionality of the data cube and therefore possesses the smallest data volume. The smaller the data volume, the more convenient it is for transmission and sharing. Consequently, the composite layer is the ideal data model for data propagation. 2. There is a significant difference in the complexity of data production between data models. The production of ARD from raw scene data involves time-consuming and computation-intensive processing, such as rigorous radiation and geometric correction. However, the creation of the data cube or composite layer is mostly a reorganization of datasets with relatively low computational complexity. The cost has been significantly reduced for such data-intensive processing due to the support of cloud computing technologies (e.g. COG). Hence, it is not efficient to produce ARD on-demand since the online construction of the data cube can be quickly implemented with the use of cloud computing (Giuliani et al. 2020).
3. Any processing will cause a loss of accuracy, and the accuracy loss of different data models can vary. The loss of accuracy caused by the correction process in ARD production is acceptable and unavoidable in most RSBD applications (Qiu et al. 2018). However, the production of data cubes or composite layers will affect accuracy in comparison with the original datasets. In most cases, data cubes transform the original datasets with different projections and resampling, and composite layers filter values according to customized rules. Therefore, storing datasets in the form of data cubes or composite layers can lead to irreversible losses in accuracy compared with ARD. 4. Homogeneity gradually increases from left to right. ARD datasets possess homogeneous physical characteristics and data cubes and composite layers are observed to approach continuity. Additionally, the layers and data cubes are closer to the array data model in computer science. Therefore, both are suitable for implementing remote sensing analysis.
In this section, the current cloud-based storage technologies are introduced and cloud-based remote sensing data storage technologies are reviewed. Finally, we provide insights into the data models for RSBD applications. Table 3 summarizes the remote sensing data storage schemes and data models for fifteen systems and studies produced from 2016 to 2021. The conclusions are drawn as follows.
1. NoSQL, DFS, Array DBMS, and OSS can store raster data, while NoSQL and RDBMS can manage remote sensing metadata. Figure 4 further summarizes the characteristics of the four NoSQL databases, aside from array DBMS.

For cloud-based RSBD raster storage, OSS is the mainstream solution for sharing open data,
while DFS is the primary solution for RSBD platforms. In the context of cloud computing, public cloud-based OSS services can significantly lower the cost of RSBD management. Therefore, the RSBD systems in recent years have been more inclined to use OSS. NoSQL and array DBMSs have great potential, but there are still some limitations such as the high cost of use. 3. Both NoSQL and RDBMS can be used to implement cloud-based RSBD metadata storage, and the choice of technology depends on the specific application scenario. On the one hand, NoSQL can manage the complete archive of remote sensing data, which is difficult to achieve through RDBMS Cheng et al. 2020). On the other hand, the functionality and performance of RDBMS has been widely demonstrated. For example, RDBMS has a higher cost performance in the scenario of small and medium volume RSBD management. However, NoSQL is more applicable for larger RSBD systems. 4. Data models are becoming critical for RSBD applications. The cost of producing ARD is generally high and results in an acceptable loss of accuracy (D'Andrimont et al. 2021). Therefore, it is appropriate to store remote sensing data as ARD in the cloud for common applications. In contrast, data cubes and composite layers have less production overhead, suffer from significant accuracy loss, and possess a much smaller data volume than scenes (Sudmanns et al. 2020). Therefore, it is better to produce data cubes or composite layers on-demand in the cloud before propagation . Most importantly, composite layer and data cube models are more suitable for implementing remote sensing computing due to their homogeneity.

Cloud-based big data processing
Technologies for cloud computing and big data are complex, specialized issues that are beyond this work's scope. This section introduces three active and promising processing technologies for RSBD computing: simple batch, MapReduce, and array-based processing ( Figure 5). In addition, we introduce a lightweight virtualization technology known as containerization.

Simple batch processing
Simple batch processing refers to a simple processing model where each task is independent of others. This method has been used for more than twenty years. Some examples include the Portable Batch System (PBS), Azure Batch, and AWS Batch (Casado and Younas 2015;Henderson 1995). Typically, users pre-define the execution program and execute a series of identical computing tasks in a batch. Each input set corresponds to a processing instance and outputs a result. Each independent computing task does not affect other computing tasks, and such tasks can be executed asynchronously. Simple batch processing has been widely used in offline batch processing. However, simple batch processing cannot accomplish complex analysis because it does not support inter-task communication.

Mapreduce processing
MapReduce (Wang et al. 2010) is a popular batch processing model for cloud and distributed processing that was first announced by Google in 2004 (Dean and Ghemawat 2008). MapReduce adapts the idea of functional programming and uses two core operators, Map and Reduce, to implement individual data and aggregate operations, respectively. In 2010, Zaharia et al. developed Spark based on MapReduce, a memory-based distributed processing engine (Zaharia et al. 2010). Spark supports more affluent operators and uses programming languages to support flexible batch processing. In addition, Spark optimizes scheduling by building directed acyclic graphs (DAGs) for workflows and employing a 'lazy' mechanism. Under the 'lazy' mechanism, some of Spark's operators only record processes and defer the actual computation to optimize the execution route. Spark preserves intermediate data with a memory-based data model called Resilient Distributed Datasets (RDDs). RDD can improve the computation efficiency by nearly a hundred times compared with MapReduce (Zaharia et al. 2010), especially for tasks with multiple iterations such as machine learning and deep learning (Lunga et al. 2020). However, the large data volume of data-intensive computations often exceeds memory storage capacity, leading to decreased efficiency. Cloud processing provides services that help users quickly implement MapReduce jobs. Users can build their MapReduce clusters on virtual machines or directly use cloud-hosted Hadoop or Spark services, such as Amazon Elastic MapReduce (Amazon EMR) and Google Dataproc.

Array-based processing
Array-based processing is a computational technology for scientific array data (Lu, Appel, and Pebesma 2018). The well-known Numpy and OpenCV can implement a variety of complex processes for array data. However, such software is limited by the resources of a single machine and cannot be efficiently applied to very-large arrays. Recent technologies have implemented array manipulations with parallel batch processing for large-scale arrays, e.g. Dask (Rocklin 2015), such as the computational functions of Earth Engine and array DBMSs. Earth Engine adopts Flu-meJava (Chambers et al. 2010), a MapReduce processing engine, to manipulate remote sensing data as an array-like Collection or Image. However, from the usage point of view, such array-based processing is different from MapReduce processing. Users of array-based processing do not need to worry about the underlying parallel processing implementation, such as task scheduling, but directly use arrays as the processing object. At present, array-based processing is still developing and has some known flaws. Array DBMSs lack the flexibility for implementing user-defined functions (Mehta et al. 2017). Furthermore, performance optimization is still not as robust as traditional batch processing technologies (Mehta et al. 2017). However, most scientific data, such as remote sensing, mainly consist of arrays. The direct manipulation of arrays can shield users from many underlying problems. Therefore, we believe that array-based processing will play an increasingly important role in scientific big data processing in the future. All three processing technologies introduced above can implement large-scale remote sensing analysis computations. In Table 4 we briefly summarize the mainstream open-source technologies and the related cloud-based solutions. However, batch processing technology requires the rewrite of remote sensing analysis algorithms, which hinders researchers from performing scientific analysis based on RSBD to some extent (Mehta et al. 2017;Camara et al. 2016).

Containerization
Containerization is one of the core concepts of cloud-native computing (Li 2019;Pelle et al. 2019). It is widely used in cloud-based processing such as FaaS and serverless applications. Containerization is a virtualization technology that packages algorithms with lightweight containers and provides the runtime environments needed for algorithms, e.g. Docker (Merkel 2014). This technology allows the stable execution of various remote sensing algorithms in different host environments, which improves the portability of remote sensing algorithms by decoupling the algorithms from the host machine (Xu et al. 2022). The technology is essential for cloud-based RSBD because it can port remote sensing algorithms from the local environment to the cloud . Containers can be leveraged jointly with the big data processing technologies mentioned above (e.g. batch processing). In addition, they can be managed by container orchestration platforms in the cloud. Kubernetes is one of the most famous open-source container orchestration platforms; it was developed by Google and contributed to the Cloud Native Computing Foundation in 2015 (Bernstein 2014). Borg (Verma et al. 2015), the internal Google version of Kubernetes, is used for resource scheduling and load balancing within Earth Engine.

Cloud-based computing for RSBD
As introduced in Section 2.4, RSBD applications can be grouped into two types, data-separable computing and data-inseparable computing. We introduce cloud-based computing for RSBD from these two perspectives.

Data-separable computing
Data-separable computing covers most remote sensing analysis applications, such as pixel-based and tile-based analysis and has a simple parallelization strategy with better feasibility. The following example illustrates the processing paradigm. Bishop-Taylor et al. (2021) extracted the coastal zone changes for Australia from 1988 to 2019. The study partitioned the region of interest into subregions by space and then produced data cube datasets for each sub-region. Subsequently, the shoreline changes from 1988 to 2019 in each sub-region were extracted using simple batch processing. Finally, the study combined all sub-regions and obtained the complete coastline change for Australia. This example outlines most of the routines for data-separable computing with a simple batch, including (1) partitioning the region of interest, (2) constructing data cubes for each partition, (3) computing each partition, and (4) combining the partitions. This kind of computing has been widely applied in RSBD, especially in data pre-processing and mapping. Each partition's implementation can be considered as an individual computing task, which corresponds with the computing paradigm of simple batch processing. In addition, other processing models, such as MapReduce, can perform such computing as well.

Data-inseparable computing
There are dependencies between data-inseparable computing tasks, and thus, the input data should be homogeneous as data cubes or composite layers. Currently, data-inseparable computing is mainly processed using MapReduce and array-based processing. The implementation of parallel remote sensing processing algorithms based on MapReduce provides flexibility (Chebbi et al. 2018). MapReduce and Spark offer a range of flexible operators which can build complex computational pipelines to implement diverse parallelized computations for remote sensing analysis. There are a number of studies that have implemented various data-inseparable remote sensing computations based on MapReduce-like technologies, such as K-Means clustering analysis (Chebbi, Boulila, and Farah 2016), parallelized mosaics (Jing et al. 2017), deep learning (Sun et al. 2019), and object-based segmentation . GeoTrellis (Kini and Emanuele 2014) is a Sparkbased MapReduce processing technology that is oriented to remote sensing data processing and provides a set of APIs for remote sensing analysis. The MapReduce paradigm has also been widely used in the spatial information domain, such as SpatialHadoop (Eldawy and Mokbel 2015), GeoSpark (Yu, Wu, and Sarwat 2015), and GeoMesa (Hughes et al. 2015). However, there are still some known shortcomings in implementing data-inseparable computing with MapReduce. Compared with high-performance computing and message passing interfaces, there has been an attempt to shield the user from the underlying programming issues as much as possible. However, remote sensing researchers still must deal with complicated issues in manual parallel computing. The performance of some memory-based MapReduce technologies, such as Spark, is considerable. Yet, they are not suitable for 'data-intensive' RSBD analysis (Makrani et al. 2018). Memory-based processing requires a large memory capacity to handle large remote sensing datasets, but there is a slight improvement for algorithms with low iterative computation requirements.
Apart from MapReduce processing, array-based processing is also suitable for data-inseparable computing. Arrays, especially composite layers, are the principal unit of remote sensing analysis. Array-based processing generally pre-defines diversified APIs to manipulate arrays in parallel. The built-in array operations and machine learning algorithms can be used for remote sensing analysis, such as vegetation index extraction and classification (Villarroya and Baumann 2020). For example, Earth Engine is dedicated to remote sensing analysis with many algorithms specializing in remote sensing science, such as time series analysis (Hamunyela et al. 2020) and cloud detection algorithms (Qiu, Zhu, and He 2019). Array-based processing shields remote sensing researchers from the underlying implementation of parallel computing, thus helping them to focus more on the computation itself. However, current array-based processing still has specific problems. Most array-based processing technologies, such as array DBMSs or Dask, are not designed for remote sensing applications, and the provided APIs do not support professional remote sensing processing. Therefore, additional computing implementations are required as post-processers. For example, (Pagani and Trani 2018) constructed a data system with RasDaMan and implemented subsequent remote sensing analysis based on R programming. Furthermore, the highly packaged APIs reduce the flexibility of implementing user-defined algorithms. For example, users cannot extend any analysis that Earth Engine does not support. Fortunately, this problem might be gradually alleviated with the development of open-source array-based processing technologies. Furthermore, the efficiency of current array-based processing (e.g. Dask) is lower than that of MapReduce (e.g. Spark) with some additional restrictions (Fu et al. 2020). Finally, current array-based processing services are mainly maintained by the open-source community (e.g. Pangeo), restricting the available computational resources.

RSBD platforms
The RSBD platform is the best practice for RSBD applications. Any RSBD storage or computing systems cannot work individually. Instead, RSBD computing should work closely with storage systems, and platforms should implement RSBD applications by integrating storage and computing. Figure 6 summarizes the five architectures of RSBD platforms as characterized by the data model. The four components from left to right are storage, the data model, processing, and output.
1. Type 1 platforms parallelize the computation of scenes using simple batch processing and dataseparable computing. The outcome of such platforms is delivered in the form of scenes. In terms of architecture, Type 1 platforms are composed of two parts, the data storage system and simple batch processing system. The data storage system is responsible for providing data services to the processing system, and the simple batch processing system manages the analysis algorithms and maintains execution tasks. Such platforms were widely used in early batch remote sensing data production and analysis . For example, the European Space Agency's G-POD project pre-populates many algorithms for remote sensing scenes and data and offers batch processing services. 2. Type 2 platforms are the mainstream technology route adopted for large-scale remote sensing analysis and computation. They consist of three main parts: the data storage system, data cube production system, and simple batch processing system. Type 2 platforms are similar to Type 1 platforms in that they support data-separable computing. In contrast, this type of platform processes the data into data cubes as the input and then outputs data cube datasets. For example, JEODPP divides the computational tasks of the data cube into independent subtasks and then uses HTCondor to implement multi-task batch processing (Soille et al. 2018;Corbane et al. 2017). 3. Type 3 platforms adopt MapReduce processing and support data-inseparable computing. For example, the ScienceEarth platform has been used to process remote sensing data as a data cube with the implementation of large-scale remote sensing analysis using Spark (Xu et al. 2022;Xu et al. 2020). Though MapReduce is powerful, it is not friendly to remote sensing researchers. Users must deal with detailed configuration issues in parallel computing by themselves, which prevents the widespread use of MapReduce in RSBD applications. Therefore, such platforms have a high difficulty threshold for their implementation and use. 4. Type 4 platforms use array-based processing as the data processing system and support datainseparable computing. Such platforms consist of three parts, including the data storage system, data cube production system, and array-based processing system. To be specific, the data cube production system is responsible for processing remote sensing scene data into array-like data types (e.g. data cubes and composite layers). The array-based processing system pre-defines many high-level APIs for the analysis of large arrays. 5. Type 5 platforms use array DBMSs as the core component and support data-inseparable com- puting. An array DBMS can store and manage massive remote sensing data with distributed systems. More importantly, it can internally implement array-based processing. Therefore, an array DBMS can independently construct an RSBD platform. For example, Kuo et al. (2018) built an RSBD platform based on SciDB, which supported shared-memory parallelization (SMP) and distributed memory parallelization (DMP). However, only simple array processing is supported by current array DBMS technology. Therefore, other computing systems are needed in most cases to achieve the complex analysis and computation of remote sensing data (Pagani and Trani 2018).
In addition to the five types of platforms, some studies further package the technologies on top of existing RSBD platforms. For example, BACI offers web-based services based on Google Earth Engine (Poortinga et al. 2018). OpenEO proposes a unified API for data management as well as the computational resources of different platforms (Schramm et al. 2021). OpenEO accesses data and services in virtual data cubes, allowing for deeper comparisons between compatible Earth observation cloud services rather than accessing them directly.
This section identifies four significant cloud-based processing technologies and provides insights into data-separable and data-inseparable computing for RSBD. In addition, we summarize the five major architectures of current RSBD platforms in terms of their data models, data storage, and type of processing. As shown in Table 5, we summarize the representative platforms according to their technology routes. The conclusions are provided below.
1. Simple batch processing has been widely applied in RSBD applications, especially in data-separable computing, while data-inseparable computing is becoming a popular topic in RSBD. Datainseparable computing can be implemented using MapReduce or array-based processing. Although current array-based processing is in its infancy, this promising technology may lead to the next generation of remote sensing processing. 2. There are five principal types of RSBD platforms. Type 1 and Type 2 architectures can only handle the most basic data-inseparable computing. Type 3 platforms have the best scalability for various algorithms but have a specific use threshold for computer technologies. Type 4 platforms are the most advanced implementation at present. However, Type 4 platforms require high construction costs and pose significant technical challenges. Type 5 platforms have reasonable prospects but are restricted by the development of array DBMSs.
3. Regardless of the type of RSBD platform, no single platform can comprehensively handle all diversified RSBD applications (Ni et al. 2021;Xie and Lark 2021;Chen et al. 2021). Under these circumstances, multiple platforms should be jointly leveraged. Thus, data must be transported between platforms (Lu and Wang 2021;Arvor et al. 2021;Brombacher et al. 2020), and a standard RSBD data model is necessary. Data cubes are a promising approach for meeting this standard.

Conclusion
The joint promotion of space technology, remote sensing science and technology, and computer technology has enabled humans to enter a new era of Remote Sensing Big Data (RSBD). RSBD is the best means to realize global remote sensing analysis and will become the backbone of Big Earth Data, making additional contributions to sustainable human development (Guo et al. 2017;Guo et al. 2021). This research introduces state-of-the-art technologies and research trends concerning RSBD storage and computing. In addition, this study provides a preliminary glance over the basic issues of RSBD for computing experts and remote sensing researchers, especially those who tend to work with large-scale remote sensing research and applications. We would like to arouse the reader's interest in RSBD through this research. However, RSBD is a broad topic, and we could not thoroughly review it from a comprehensive perspective. Therefore, some issues (e.g. open data, data security, confidentiality, visualization, etc.) are not mentioned. Additionally, despite our extensive literature references, there are inevitably controversial representations and claims in the manuscript. Remote sensing data mainly consists of raster data and metadata. Many studies have already accumulated valuable research for RSBD storage. The storage technology for RSBD has achieved satisfactory results for moving data from a single machine to clusters and from the local Array-based environment to the cloud. Among them, cloud-based optimized remote sensing data storage technology and cloud storage technology represented by OSS, NewSQL, and NoSQL will become mainstream technical solutions for RSBD storage and management. Data homogeneity is necessary for large-scale analysis. In this regard, the current RSBD technology mainly adopts four data models with different homogeneity characteristics; these include scenes, ARD, data cubes, and composite layers. The data cube has good compatibility with cloud computing and can provide RSBD analysis through homogeneous multi-dimensional remote sensing data. We suspect that this data model will become the mainstream RSBD data model in the future. According to the computational paradigm, RSBD computing can be divided into data-separable and data-inseparable computing. Data-separable computing has better parallelism and remains the computation type in most RSBD analyses. On the other hand, data-inseparable computing is the current hot topic. There are three mainstream cloud-based big data technologies for remote sensing data analysis: simple batch processing, MapReduce processing, and array-based processing. Simple batch processing has been widely used. The MapReduce-based parallel computing paradigm can be applied to more complex remote sensing analysis applications. Moreover, array-based processing provides an easy-to-use and promising technical tool for remote sensing scientists. In this review, the five types of RSBD platform architectures were summarized. Type 4 has the most advanced architecture, which adopts the data cube model and array-based processing.
The multidisciplinary methodologies of RSBD are growing rapidly, and RSBD platforms have already played important roles in various fields. However, in the future, the integration of satellite, airborne, ground-based, geospatial, and even socioeconomic data is needed to produce more effective solutions to real-world problems, which will face additional challenges. Novel real-time data processing paradigms, artificial intelligence algorithms, and innovative tools are expected to be assembled into RSBD platforms to extract more desired information from remote sensing data. The emergence of Federated Learning, a novel distributed processing paradigm, guarantees data security during training and RSBD analysis. In addition, recent advancements in deep learning models provide a promising approach for RSBD interpretation over large scales. GPU-accelerated RSBD platforms are an ideal host for neural network architectures and open datasets. The highquality information extracted at a global scale will provide a new impetus for improving RSBD methodologies and platforms in remote sensing. In addition, the information will aid the scientific community in assessing global disaster risk, monitoring climate change, and addressing the United Nations Sustainable Development Goals (SDGs).

Disclosure statement
No potential conflict of interest was reported by the author(s).