NDRank: optimised parallel search for weather analogues

ABSTRACT Global meteorology data are now widely used in various areas, but one of its applications, weather analogues, still require exhaustive searches on the whole historical data. We present two optimisations for the state-of-the-art weather analogue search algorithms: a parallelization and a heuristic search. The heuristic search (NDRank) limits of the final number of results and does initial searches on a lower resolution dataset to find candidates that, in the second phase, are locally validated. These optimisations were deployed in the Cloud and evaluated with ERA5 data from ECMWF. The proposed parallelization attained speedups close to optimal, and NDRank attains speedups higher than 4. NDRank can be applied to any parallel search, adding similar speedups. A substantial number of executions returned a set of analogues similar to the existing exhaustive search and most of the remaining results presented a numerical value difference lower than 0.1%. The results demonstrate that it is now possible to search for weather analogues in a faster way (even compared with parallel searches) with results with little to no error. Furthermore, NDRank can be applied to existing exhaustive searches, providing faster results with small reduction of the precision of the results.


Introduction
Multidimensional arrays are data structures where each element, composed of multiple values, is indexed with multiple dimensions. Currently, one common use of multidimensional data are found in the representation and storage of spatio-temporal data, where the dimensions are time, latitude, and longitude. Some datasets also have an additional altitude dimension. In these datasets, for each data point, multiple values (observation, parameters of simulation results) may be available. This data structure is widely used to represent the physical world data in astronomy (Zhang et al., 2012), climatology, meteorology, or geophysics because the data collected in these fields fit well an array-like model (Stonebraker, Duggan, et al., 2013) and provide a suitable representation of the physical world.
One of the possible uses for this available data is in the finding of weather analogues for various applications, either to predict the future evolution of some weather parameter or just find past periods of time with similar weather evolution. Analogue forecasting (Bergen & Harnack, 1982;Toth, 1989) is a weather forecasting method in which a prediction for a time period is done, in its simplest implementation, by looking for a moment in the past when the atmosphere presented a similar weather state. For the prediction of weather for a certain forthcoming period, the weather analogue algorithm first searches the historical data to find in historical data a time period where relevant features follow the searched pattern. After finding in the past an analogue, the prediction of the weather for the forthcoming period can be inferred from the evolution of past weather following the found weather analogue.
There are different approaches to performing the comparison between a time period and historic data, with respect to the selection of features, or similarity calculation, but all resort to an exhaustive search in the historic data. All available algorithms need to compare the features in the preceding data with those of the time instances in the past. Since this requires access to all the historic data, it is a brute-force approach. These algorithms return an ordered list of past weather analogues sorted by similarity. With this type of search algorithm, requiring access to all the records in the time dimension, the increase in the size of the data becomes a problem. To handle the increase in search time, new optimisation techniques have been developed, such as parallelization of the search, new comparison and similarity algorithms, or even statistical methods to define similar weather patterns.
The concept of weather analogue can now be used separated from the main objective of weather forecasting in applications that handle activities related to or affected by weather conditions. Some human activities, such as energy production or consumption, are tightly related to the weather conditions, and their prediction and study require access to past periods with similar weather characteristics and evolutions. The use of these past data (weather analogues) will be application dependent (for instance for the validation of machine learning models), but potentially fast access to past data is preferred over data precision.
Weather multidimensional datasets tend to be massive in volume due to the addition of relevant features, an increase in time and space of observation data, and the addition of modelling results on a global scale. For example, ERA5's data (Hersbach et al., 2017) for 1 year can reach the scale of terabytes. This scale becomes problematic with respect to atrest storage, but more importantly, it increases the search time when using exhaustive, brute-force approaches in multiyear historical datasets.
The sheer amount of data and the type of search for weather analogues render generic Database Management System (DBMS) unusable in this application. This DBMS is optimised to execute queries based in indexes and not to perform searches based on the feature values. As such, even array data-structure oriented DBMS, such as SciDB (Zhang et al., 2011) and RasDaMan (Baumann et al., 1998), are obviously unfit (Stonebraker, Brown, et al., 2013) and inefficient in the execution of weather analogue searches. All this leads to the development of specific applications targeted at weather analogue searches and supported by low-level data storage (such as GRIB (Dey et al., 2007) and NetCDF (Rew & Davis, 1990)) of the multidimensional data.
Independently on the internal data organization, programming language, or even parallelization techniques, most implementations still require the iteration over the whole historical dataset to find the most similar analogues. Since these searches and comparisons are done on feature values, data locality and organization may allow the development of more efficient or directed searches.
The work here presented is driven by two observations on the way weather analogues are used and the characteristics of the underlying weather data. With respect to the use of the results of weather analogue results, although the algorithms return a list of analogues ordered by similarity, not all the resulting data are used: only those more similar are useful. Even implementing algorithms that limit results to a certain similarity threshold, it is still necessary to search for the whole historical dataset to guarantee that the returned results include the most similar analogues.
The second observation is that data usually used in weather analogues has high locality, i.e. adjacent data (in space and time) for a certain feature varies continuously and with low derivative. In weather data, there is a high probability that the neighbour positions do not differ significantly in value, since locations that are physically close to one another tend to be correlated and with small variations between them. The same observation can be made on the evolution of weather-related features in time for a certain position since, as time passes, the values of features usually change gradually and continuously. Taking this into consideration, if the resolution of the historical data (on the spatial or temporal dimension) is reduced, it is still possible to observe the same overall evolution and characteristics of the features on this new smaller dataset.
In this work, we apply the previously presented ideas to relax search procedures and to optimise execution time. This application results in a heuristic algorithm called NDRank: it performs searches on lower resolution datasets and only returns a user-defined number of results in two sequential phases. The first phase performs an exhaustive search on a lower resolution dataset that returns a list of analogues sorted by similarity. The best results from the first phase are used in the second phase to perform a local similarity calculation based on the high-resolution dataset. Besides the NDRank algorithm, we also present the results of parallelizing the search with a split of work in the time dimension.
The implementation of the proposed optimisations presents promising initial results. Using realistic datasets (with respect to features, size, and searches) we obtained close to optimal speedups with the parallelization, and we obtained further performance gains with the application of the search heuristic.
This document starts by detailing the problem of searching for weather analogues from a computer science perspective. Then it presents the proposed solutions (the parallelization of Brute-force and our new NDRank solution) and describes a general architecture of distributed systems capable of executing the Parallel Brute-force and NDRank. In the end, the evaluation process and discussion of the obtained results are presented.

Weather analogues for weather forecasting
Analogue forecasting is a weather forecasting method (Iseh & Woma, 2013) that consists in searching on past historical weather data for time periods that resemble the current weather state: the weather analogues. The forecast can be inferred by the evolution of the weather following the returned analogue.
If, for example, one wishes to make a weather prediction of what is going to happen in the next few days, by using the weather analogue approach, the current weather state is used to search in historical data for a period (known as a weather analogue) that resembles the current state. The weather's evolution after the weather analogue can be used to predict the current weather state's evolution.
As a forecasting method, weather analogues can predict and use features available on the dataset, use different similarity algorithms, or limit the search area, but always require a large multidimensional time-series dataset that details the state of the atmosphere for an extended period of time. From a computer science perspective, searching for weather analogues requires a specific input in a multidimensional time-series format and calculates the similarity value for every existing match (in a brute-force fashion) in the historical dataset.
The pseudo-code for the current implementation is presented in Algorithm 1 and will be referred to as Brute-force from now on.
Here, we generalize the similarity calculation algorithm and do not present any optimisation that can be done on the selection of the features or geographical area. Nonetheless, in all current implementations, the loop in lines 5 to 8 in Algorithm 1 iterates the whole dataset to compare all weather analogues with the searching sample. The function in line 6 computes the similarity between the input data and the subset of the past data. Some of the algorithms currently used to calculate similarity are Pearson correlation coefficient (PCC) and root mean square deviation (RMSD) (Mo et al., 2014).
The computational complexity of current implementations is O(mn), where n is the length of the historical data and m is the length of the analogue to be found or input. Due to the linear increase of search time with the increase of historical data, there has been some work developed to computationally optimise this approach.
One of the proposed optimisations reduces the overall computational complexity: Yang and Alessandrini (2019) managed to modify the code inside the outer loop by replacing the sum-of-product operation with a convolution, thus reducing the complexity to O(nlog(n)). Here, the inner loop reduces its complexity from O(m) to O(log(n)) but, in the main loop, the need to transverse the whole historical dataset remains. results.sort() ▷ Sort by similarity 10: return results 11: end function Another optimisation, by Raoult et al. (2018), tries to tackle the brute-force problem directly by creating a group of fingerprints that index the whole dataset. The main idea is to, previously to the searches, create fingerprints to the whole historical dataset by ways of wavelets compression (Raoult et al., 2018). Then, a search is executed by creating a fingerprint of the input and finding a match against the collection of fingerprints. This match will then point to a position of the original dataset. This strategy looks promising, but it has some limitations.
Although most of the uses of weather analogue do not need the whole search results (only using those with the best similarity), it is impossible to infer the location of the best analogues nor stop the search after finding an analogue with a certain high level of similarity.
If the objective of analogue search is the direct prediction of a certain variable, techniques exist to improve the results precision or even reduce the prediction time. Analogue Ensembles (AnEn) (Delle Monache et al., 2013) provide more precise forecasting, but still require the search on the whole historical observation datasets (Fanfarillo et al., 2021), as such the need to speed up this search step is still of great need. PAnEn  parallelizes such search using common parallel programming libraries (OpenMP and MPI) to take advantage of supercomputers' large storage and processing power. Another approach is to totally use probabilistic methods to implement the forecasting (Fanfarillo et al., 2021); here, the precision of the results is worse than classical Analogue Ensemble, but the execution times and necessary memory are much lower.
Besides statistical forecasting methods, all others require the search on the whole historical observation datasets. Even with proposed optimisations the cost to iterate over all the spatio-temporal dimensions is the major limiting factor to execution speed. Parallelization of this search exists , but requires access to dedicated supercomputer infrastructures. The access to this specialized computational infrastructure and the size of the data still limits the use of the analogue searches in the private sector, in areas such as renewal energy production, management or markets.

Proposed solution and architecture
With the need for weather prediction in new applications and the increase in size and complexity of data, the need to further optimise this type of search presents itself as of high value for the current state of the art, as all current strategies to look for weather analogues require the processing and access to the whole dataset in the time dimension.
However, the current developments in the search for weather analogues are focused on reducing errors and applicability to new problems. Work on increasing the performance and scalability of these algorithms is limited and still allows further improvements.
In this work, we propose two different ways to optimise the search for weather analogues. The first corresponds to a classical parallelization where the data are independently distributed by different concurrent processes and the partial results are aggregated at the end. The second corresponds to the definition of a heuristic that takes into consideration the similarity of local data to reduce the search space.

Parallelization of Brute-force
When used by weather analogue algorithms, weather data can be easily and efficiently partitioned in the time dimension, thus suiting perfectly to the divide-and-conquer paradigm of parallelization. Each of the parallel tasks will search for the patterns in an independent subset of the original data and finally, processes found the matches; the partial results are aggregated to produce the final list of weather analogues. Each parallel task receives as input the pattern that is to be searched and the limits of the time interval assigned to it. For example, if the dataset has 1 year of data and there are 4 processes, then the first 3 months are assigned to one process, the second 3 months are assigned to another, and so on.
In theory, this solution can attain close to ideal speed-ups, reducing the processing time in line with the number of parallel processes, because each process will handle an evenly split sub-set of the data and the number of processing operations performed by each process do not change much. The list of analogues ordered by similarity can be produced by simply concatenating the partial results and re-ordering them on a central node, without much overhead.
One issue that may affect the algorithm's performance is related to the cost of transferring the data to each parallel process. Due to the dimension and size of data and network latency, if the data are explicitly transferred to each process at each execution, the communication delay may reduce the attained speed-up.

NDRank -heuristic guided solution
All existing optimisations for weather analogue searches try to decrease the execution time of the similarity calculation code, leaving the main loop to iterate the whole dataset. With an increase in the temporal length of the historical data, even with existing optimisations, the search time will increase. The previously proposed parallelization may reduce the overall search time, but the number of operations remains the same, allowing a novel optimisation to be applied to provide further gains.
Even if the search returns a limited number of results (those with higher similarity) the algorithm still needs to perform a full search on the whole dataset, because it does not have any information on where the best results may be, thus requiring the search up to the last time instance. To reduce the number of executions of the loop that iterates in the time dimension, for instance, the only viable solution is to reduce the number of time instances in the historical dataset.
Here, we propose and heuristic search (called NDRank) that takes into consideration that: • not all returned values are used in further processing, but only a limited number with the highest similarity is useful, and • most weather-related data values for a certain position in time and space are related to its neighbour.
The heuristic search is composed of two phases and returns a list of analogues with a length previously defined by the user. The first phase of the algorithm searches for weather analogues on a lower resolution version of the data. The first increase in efficiency will come from the use of a dataset with fewer time instances. The reduction of the time dimension resolution can be accomplished by aggregating and averaging each feature for every position with the nearby time instances. All other spatial dimensions can remain unaltered, or if deemed necessary, can also suffer a similar resolution decrease.
Due to the temporal relation of data in consecutive observations, the variation of data in the time dimension is gradual, and the lower resolution dataset will still maintain the broad evolution of every feature. The feature value evolution is still visible without the variations in a smaller period.
The first phase will return a list of all the candidates that should be sorted by similarity. This list will point to the time periods where there may exist a suitable (with the highest level of similarity) weather analogue. If all the returned time interval candidates are evaluated against the original high-resolution dataset, no gains are attained, so in our approach, only a limited number of candidates are evaluated, corresponding to the previously user-provided number of useful and necessary weather analogues. The gains from the reduction of the number of operations performed on the whole historical dataset (first phase) will outperform the cost of the additional similarity calculation for this limited number of time periods.
The proposed solution (called NDRank) can then be described as the Algorithm 2.
The production of the candidate list (Line 3 in Algorithm 2) uses the same algorithm as the Brute-force approach, but searches on a low-resolution dataset, thus yielding lower processing time. After finding the candidates only the best are processed and validated against the original dataset (lines 5 to 8).
This newly defined solution has a computational complexity of O(N′M + RM), where N′ is the number of time instances in the low-resolution dataset, M is the size of the input (in its time dimension), and R is the number of desired results. Although it may look like the complexity of this solution is higher than the original Brute-force, it is necessary to take into consideration that the N′ in the NDRank is much lower than on all other available solutions due to the reduction of the dataset's resolution. Furthermore, the value of R can also be orders of magnitude lower than N.
Since the first phase uses an unmodified Brute-force approach, any of the optimisations proposed in the literature or the parallelization described in Section 3.1 can be easily applied. Although reducing the time resolution of the original dataset will work for most Algorithm 2. NDRank heuristic weather analogue search. results.sort() ▷ Sort by similarity 10: return results 11: end function applications, it is also possible to apply the same idea to the spatial dimension and reduce the spatial resolution, obtaining further gains when searching in large areas. The process to create the low-resolution dataset should be done offline before the execution of any search. The selection of the dimensions to process and the level of resolution decrease should be evaluated and defined so that its use offers the highest time execution gains without adding erroneous results.

Architecture
Since both approaches here presented complement each other, Figure 1 presents the architecture for the parallelization of both Brute-force and NDRank.
The proposed solutions (Figure 1) are based on the master-worker pattern and are independent from the parallel programming language, supporting execution infrastructure and data storage. The supporting execution infrastructures should provide mechanism to launch parallel tasks, the communication of initial parameters between the master and workers, the access to distributed and partitioned data storage and the synchronization/coordination mechanism between works (such as a message queue or barrier).
The Brute-force parallelization (Figure 1(a)) follows the classical Master-Worker pattern, where the master is responsible for receiving the input from the user (or other client) and sending it to all existing workers. Then it waits for a response from all the worker tasks and merges the results into a single response. In this case, each worker will only search on its partition of the original dataset.
In the NDRank parallelization (Figure 1(b)), worker tasks are responsible for executing the Brute-force search for weather analogues in the low-resolution dataset and, in the second stage, a verification of the analogue candidates in the original full resolution dataset. With respect to execution control, besides the master task, there is also a merger task that operates after the termination of the Brute-force search on the low-resolution dataset and coordinates the start of the second-stage search on the original highresolution dataset (corresponding to line 4 on Algorithm 2). In Figure 1(b), the merger is represented by a separated task and supported by a message queue but a simple barrier, code on the master and other mechanisms to transfer information between the workers and the merger can be used.
The Brute-force and NDRank parallelization components (shown in Figure 1) also differ in the datasets accessed by each worker. In the Brute force, only the original full resolution dataset is partitioned so that each worker task processes an independent and similar amount of data, while in the NDRank parallelization, each worker should access his own full resolution and low-resolution dataset partitions.
In the parallelization of the Brute-force solution, a simple master-worker architecture is implemented and follows the regular bag-of-tasks processing steps: (1) The master process starts all workers and sends them the request information; (2) Each worker process executes a Brute-force search on a subset of the dataset; (3) The master process is idle, waiting for the termination of every worker process; (4) Whenever each worker finishes, it sends to the master the results; (5) After all workers have finished, the master sorts the results and returns them.
The heuristic solution is more complex and works as follows: (1) The master process starts all workers and sends them the request information; (2) Each worker process executes a Brute-force search on a subset of the lowresolution dataset; (3) The master process is idle, waiting for the termination of every worker process; (4) Each worker process after finding all analogues on its subset of the low-resolution dataset submits them into the queue; (5) The merger process reads all the results from the queue and merges them into a final list of candidates (line 4 in Algorithm 2); (6) The merger then submits the candidates back to the queue; (7) The worker processes read the merged results and search for those analogues on the full resolution dataset; (8) After processing all the candidates, each worker submits the results to the master and terminates; (9) After every worker has finished, the master sorts the partial results and returns them.
The proposed architecture can also implement the non-parallelized versions of the Brute force and NDRank if only one worker is available, and as such can be deployed on a single computer, not requiring the use of a parallel cluster of computers.

Data distribution
The strategy used to distribute each portion of the dataset to the various worker processes depends on the situation. Both the Brute-force search and the search on NDRank first phase are sequential searches that evolve by looking at the whole dataset. Therefore, the dataset can be distributed and partitioned with a contiguous Time Interval strategy as can be seen in Figure 2(a), where each worker process receives all the dates corresponding to a contiguous period. Since all parallel tasks are independent and the work to process each period is constant independently of the search pattern, the work distribution will also be even.
For the second phase of NDRank, a different data partition strategy must be chosen to keep a good load balancing across the worker nodes. Since the location of the candidates is unknown and not necessarily evenly distributed in time, a Time Interval distribution would require some workers to process more candidates.
To solve this problem, a strategy where instead of giving a time interval of the dataset to each node, the dataset is distributed in a round robin fashion where each timestamp is distributed worker by worker. Here, the data are portioned depending on the time of the day and not on the date, as illustrated in Figure 2(b). For instance, a year of data with hourly time resolution would be distributed by 12 processes in a way so that each hour of all the days in the dataset would be assigned to only one worker node (e.g. worker 0 will have 00:00 AM and 12:00 PM, worker 1 will have 01:00 AM and 01:00 PM etc.).
Since each worker is going to have a portion of every day, month, and year existing in the dataset, its data are not continuous, and, after each worker processes a candidate against its hourly data, it is necessary for the master process to aggregate the partial data. This approach requires an additional partition of the data and an aggregation of the partial similarity computations, but again, guarantees an even distribution of work among all parallel processes.

Implementation
The technologies selected for this preliminary version of the Brute-force and NDRank parallelization should be cost-effective, efficient, and widely available outside of the academia and research community. As such, the typical parallel processing environments (such as openMP and MPI) were discarded and replaced by technologies more oriented to cloud computing. This change also allows the evaluation of the classical parallelization of weather analogue searches in a novel environment.
The master, worker, and merger processes were all implemented in Python 3, using the Xarray (Xarray core developers, 2014) library, and the communication between the master process and the workers was implemented using gRPC (gRPC Authors, 2013) on top of plain TCP sockets for the transfer of the input parameters. The implementation of the queue is based on a Kafka (Apache Software Foundation, 2022) cluster with a single broker.
These technologies are easily installed in a single computer but are also available in current cloud infrastructures. The development of the first versions proved to be simple and easy, can be executed in existing parallel infrastructures and is close to production quality.
These systems were deployed in the Google Cloud Platform infrastructure (Google, 2008), where each worker is executed in a single virtual machine. The launch of the virtual machines and corresponding worker processes is implemented using disk images, preconfigured with Packer (HashiCorp, 2013). Each subset of the original and low-resolution data is stored in Persistent disks that are attached to each virtual machine, thus reducing data transfer times.

Data preparation
The dataset used in NDRank is based on the ECMWF ERA5 (Hersbach et al., 2017) dataset and for a single year, its size ranges in the order of tens of Gigabytes. It encompasses data to the whole earth with a spatial resolution of 0.1° × 0.1°, and a temporal resolution between 1 and 6 h. For each data cell, multiple weather features can be selected. To some of these features, values for different altitudes are also available.
The spatial references used by ERA5 are either a Reduced Gaussian Grid or Spherical Harmonics (Range Weather Forecasts, 2022). Since these grids do not use the latitude and the longitude as spatial coordinates, in order to simplify the resolution reduction process (the process of creating the low-resolution dataset from the original one), all data requested was processed and converted into a Regular Gaussian grid.
Since the similarity value calculated by the Pearson Correlation Coefficient did not differ significantly between each returned weather analogue, it was also decided to use anomaly values instead of the real feature values. This occurred because both datasets store the parameters' values themselves. To create a more significant variation between time instances, both datasets were further processed to store anomaly values and not the parameters' values.
For some data files at the time instance 6:00 AM, there was no available data, resulting in grids filled with 0 in that time instance. As such, it was decided to process this dataset to remove these empty grids.
In order to optimise the data transfer, it was decided that datasets were partitioned a priori into a fixed number of subsets, corresponding to the number of worker processes, as described in Section 3.4. For testing, this required the experiment with various numbers of subsets (the same as the number of worker processes), but for a production environment, this number of subsets can be fixed. For the Brute-force parallelization and first phase of NDRank, the data was partitioned as described in Section 3.4. These partitions should be performed offline as soon as new data from ECMWF is made available to the public.

Evaluation and results
The evaluation process has two main objectives: understand the attained speedups of both solutions and evaluate whether NDRank heuristic results are correct and can replace the existing Brute-force algorithms.
To achieve this goal, a set of requests was executed on both the original Brute-force, the parallel solution, and the heuristic version using four different datasets. For the parallel version (Brute-force and NDRank), the same set of requests was executed with a varying number of worker processes (from 1 to 6). The returned list of results from the heuristic version was then compared with the first top N results from the Brute-force verifying the correction of the heuristic implementation.
For each request, the execution time is measured by the master process from the point at which the input parameters are sent to the worker processes to the point where all the results are received and merged by the master process. The master process writes the search results to separate csv files, while the execution time per request is recorded in a separate file.

Environment
Both solutions were deployed on the Google Cloud Platform in the region us-east4-a. All worker processes were executed on n2-standard-2 (Google, 2022b) compute instances (with a cost of 0.0875$ an hour (Google, 2022c)) and both the master processes and the Kafka processes were deployed on c2d-highcpu-2 (Google, 2022a) (with a cost of 0.06754$ an hour (Google, 2022c)) compute instances.
Each portion of the dataset processed by each worker is stored in separate SSD Persistent Disks that are attached to each worker at run-time. This file storage solution reduces the data access times.
As the main goal of the evaluation process is to study NDRank's performance and not what is the most suitable hardware for it, these specifications were chosen to provide a good performance/cost ratio and to be a feasible option for deployment. As such, the select environment (programming hardware and supporting infrastructure) may not be the one offering better performance, but represents the widely available and easily accessible resources, that companies outside the academia and research institutions have access on the cloud. The selection of this environment and set of technologies also came from the fact that professionals with the skills to put the proposed solution into operation in this type of clusters (virtual machine launched in a cloud provider infrastructure) are becoming increasingly available.

Data and queries
In order to evaluate NDRank the data used in the evaluation was extracted from ECMWF ERA5 (Hersbach et al., 2017) dataset. Four testing datasets were used for the evaluation process (Table 1). Datasets 1 and 3 had a total of 7 parameters (such as Snow Depth and Sea Surface Temperature) and were organised in time intervals of 3 h. The other two datasets (2 and 4) had a total of 6 parameters (such as Mean surface net short-wave radiation flux, and Mean top net short-wave radiation flux, clear sky) and only had bi-daily data available at 12 PM and 6 PM.
The difference between Datasets 1 and 2, and Datasets 3 and 4 are the time periods they cover and the resolution they have, and consequently the overall data size. Datasets 1 and 2 represent a period between January 1980 and December 2020.
After downloading and processing the data with the methods mentioned in Section 3.6, dataset 1 had a dimension of 787 GB and dataset 2 had a total dimension of 293 GB. To reduce the duration of the whole evaluation process, instead of using both datasets with their full resolution, new versions with a resolution 4 times lower in both the latitude and longitude dimensions were created, leading to Datasets 1 and 2 having a size of approximately 163 GB and 35 GB, respectively.
Datasets 1 and 2 allow the execution of the tests over large periods of time in an acceptable time frame. Nonetheless, there is also some value in executing a separate evaluation process on data with a full resolution. To serve this purpose, Datasets 3 and 4 were created. They have the same parameters as datasets 1 and 2, respectively, but Datasets 3 and 4 cover a two-year period from January 1980 to December 1981, having a total of 38.4 GB and 14 GB, respectively.
The first phase of the heuristic searches for candidates on low-resolution datasets. For each of the experiments, they were then created by applying further resolution reduction: latitude and longitude were reduced by a factor of 4 both, and the time dimension by a factor of 2.
In the evaluation process, 10 different time periods were searched for weather analogues in each dataset. Each period had several time instances that ranged from 3 to 7 days. To add variability in the searches, for each period, two distinct parameters were searched in each dataset: • Snow Depth (sd) and Sea Surface Temperature (sst) for Dataset 1 and 3; • Mean surface net short-wave radiation flux (msnswrf) and mean top net short-wave radiation flux, clear sky (mtnswrfcs) for Dataset 2 and 4.
For the heuristic solution, the requested number of results was 200.

Results
In order to retrieve the execution times, for each dataset, solution and number of worker processes, 20 searches were performed. The average of the 20 execution times was used to calculate the performance. Figure 3 presents the execution time and speedups for the parallelization of the Bruteforce. This parallelization attained speedups close to the ideal, with values ranging from 5.52 to 5.94 when using 6 processes.
The results of the heuristic solution are presented in Figure 4. Here, besides the execution time for one process (Brute-force and NDRank), the graphics also present the results for the parallelization of the NDRank.
Comparing the results using only one processor (Brute-force versus NDRank), it is possible to see that on all datasets the NDRank heuristic search is faster than the Bruteforce, and in three of the datasets, the speedups are close to 4. These gains come from a reduction of operations, due to the reduced resolution dataset, even with the overhead of the second phase.
The second observation is that the parallelization of the two phases of the heuristic code also renders gains. With any number of parallel workers further gains are attained. The speedup goes up with the addition of up to six workers: from 3.94 to 15.19 (3.8 times faster) in dataset 2 or from 3.86 to 45.58 (11.8 times faster) in dataset 3. It is also possible to observe that in datasets 3 and 4 the speedups from parallelizing the heuristic are superoptimal.
Dataset 4, being the smallest, can give a broad indication of the minimum gains. The gains from the parallelization of the Brute-force are also close to optimal, but the gains from the non-parallelized heuristic are low. Further parallelization allows a substantial reduction of processing time. When comparing Datasets 1 (163Gb) and 2 (35Gb), the gains from the heuristic and its parallelization are also lower on the smaller dataset.
To evaluate the precision of the results produced by NDRank, it is necessary to compare the analogues returned by NDRank with the best ones returned by Brute-force approach. In our experiments, a perfect execution of NDRank would return 200 results that would perfectly match those returned by the Brute-force approach. In this experiment for each dataset, we executed 20 searches with different inputs (using NDRank and Brute-force). Table 2 presents how similar are the results of NDRank with the corresponding Bruteforce search. To create this table, the results of the same NDRank and Brute-force search were sorted by decreasing PCC and analogues with same index were compared and defined as similar. Each line presents the results for the execution of 20 different executions, and each cell then contains the number of executions whose highest PCC analogues are similar (percentage value in the header row) to the best Brute-force analogues. For instance, executions whose results are a total match with those of Brute-force appear in columns 100%, while those executions where the first 50% of the result are similar to the brute (i.e. executions where the analogue at 100th position is different) appear in column 50%.    returned by Brute-force, for instance there are two executions where the analogues in the first position (with higher PCC) are different from the corresponding Brute-forces: 12 searches returned the same results as the brute-force and in 4 searches the first 100 analogues were the best ones. In Dataset 4, the number of experiments that did not produce the same results as the Brute-force is even higher, but a finer comparison of the similarity between these results with those of the Brute-force approach should be done.
To further compare the results returned by NDRank with those of the Brute-force search, it is necessary to calculate the differences with respect to the PCC of the specific analogues found. Figure 5 shows the modulus of the difference between the resulting analogues of NDRank and Brute-force. The values presented on the Y-axis correspond to the PCC difference between analogues (in NDRank and Brute-force results) with the most approximate PCC.
Only data for Datasets 3 and 4 are shown because for Datasets 1 and 2, the graphics would only show a flat line at the origin since the resulting sets are exactly the same for NDRank and Brute-force.
The flat lines with a modulus of zero correspond to similar analogues found using with both methods. For Dataset 3, the NDRank search for analogues on the Sea Surface Temperature ( Figure 5(b)) is exactly the same as those from the Brute-force. In the other searches ( Figure 5(a), (b) and (d)), most of results were common to both methods. In Dataset 3, even the executions that render 0% of similar top analogues (column 0% on Table 2 and green line in Figure 5(a)) have most analogues common to the results from Brute-force.
With respect to dataset 4, for some of the experiments, the first different analogue has a high PCC, but the differences between the remaining analogues are rather low. Most of the differences are in the order of 10 −3 and the others are in the order of 10 −2 . Only one search for analogues (using the Mean top net short-wave radiation flux, clear sky parameter, in Figure 5 (d)) presented a PCC difference in the order of 10 −1 .
Comparing the similarity of the results between NDRank and Brute-force, we concluded that for all the searches, most of the results only differ from the corresponding Brute-force results on the fourth decimal case (e.g. 0.9994 vs. 0.9993), when they are not similar.

Discussion
In terms of the speedup obtained between both parallel approaches, it is possible to verify that a significant gain was obtained. Due to the time dimension, there are also some differences between the results for Datasets 1 and 2 and the results from Datasets 3 and 4.
The parallelization of the Brute-force system presented an optimal speedup for all datasets. A smaller dataset would, normally, cause a worse speedup, because the parallel component is smaller and, therefore, the cross-talking factor and the serial components should have a more significant impact, but this is not the case for the parallelization of the Brute-force system since speedups for datasets 2 and 4 are still close to optimal.
The attained speedups are also in line with those attained with existing distributed memory parallel solutions. PAnEn  exhibits a similar evolution of the speedup for the analogue search: on NCAR Cheyenne supercomputers the attained speedups were 1.8 and 3.3 (for 2 and 4 tasks), while the implemented and deployed parallelization version attained on average 1.92 and 3.8 (also for 2 and 4 tasks). For the experiments done, the efficiency of the parallelization is on average 95%. This comes in line with the comparison of the application of Amdahl's law to different ratios of serial/ parallel code, rendering about 1% to 5% of non-parallelizable code. These results prove that the proposed parallel version and the selected technologies scale well.
The gains obtained in Datasets 1 and 2 when applying NDRank with only one process are significant in most executions because the heuristic solution spends most of its time executing a Brute-Force search but on a much smaller dataset. The reduction in the spatial (4 times smaller in the latitude and longitude) and temporal (2 times smaller) dimensions reduces the number of operations and execution times on this already small portion of the code.
The reduction of these dimensions' resolution decreases the size of the datasets, but also reduces the data input time and the memory footprint of the applications, thus adding performance gains. On a smaller dataset (Datasets 2 and 4), these gains are lower, but also increase the effect of the second phase on the final speedup. The speedups from applying the heuristic to these datasets are lower (when compared with Datasets 1 and 3) because the second phase remains constant in time.
The effect of the size of the dataset on the speedup can also be seen in the parallelization of the heuristic. The smaller the dataset (and original processing time) the lower the speedups when adding more parallel workers. This can be seen when comparing Dataset 1 with 2 and Dataset 3 with 4. Since the low-resolution dataset in Dataset 2 is far smaller than the original, the serial part of the code and the cross-talking factor are now more noticeable: code executed by the merger process and the master process (when merging the results). Since Dataset 4 has a lower time resolution (two measurements per day), the time gains are also smaller, because the effects of the non-parallelized code are more prevalent and overcome the gains of parallelization, and (since the number of results requested was 200) the second phase still searches on a large portion of the original dataset, further overwhelming the gains obtained by the search on the first phase.
The gains from parallelizing the NDRank approach on dataset 3 (Figure 4(c)) are superoptimal: on dataset 3 the search time for one process is 20.06 min, to six processes is 1.7 min, thus rendering a parallelization speedup of 11.8. The super speedup of 11.8 is higher than the maximum expected of 6, because of a side-effect of the reduction of loaded data by each process. Although the code is the same and the total number of operations split by the processes is constant, the way the data are loaded is different: for one process, every time a candidate is verified, a 1.6-GB file is loaded, while for 6 processes, each file is only 300 MB. Since each process now loads and manages a smaller dataset portion, its management is more efficient by the library and there is a reduction in the loading and managing data overhead.
The variety of datasets (with respect to the type of data, dimensions, and size), and the reached results prove that to large datasets, the speedups possible will be in line with or even better than all those presented in this work. Larger datasets will either reduce the sequential part (in the case of parallelization) or reduce the number of operations in the case of the NDRank first phase.
Regarding correctness, the heuristic architecture always returned a perfect set of results for datasets 1 and 2. The main factor that leads to the heuristic result quality comes mostly from how the system processes the candidates in the second phase. As the low-resolution dataset had half the time resolution from its original dataset, whenever the candidate pointed to a specific timestamp, all the immediate neighbour timestamps were investigated in the second phase. For example, if the low-resolution search pointed to the 3rd of January from 1981 at 6:00 AM, the timestamps at 3:00 AM and at 9:00 AM would also be investigated. This behaviour guarantees that the number of time instances to be investigated in the second phase is a much larger number than the final number.
This was not the case for Datasets 3 and 4 where a significant difference in results can be observed, even under the same request configurations as in Datasets 1 and 2. This demonstrates that the data that are being used always play a role in the quality of the returned results. Although it does not always return the same set of results as Brute-force, the heuristic returns a set of results with similarities close to the Brute-force analogues.
The complexity of the heuristic solution looks worse than the original version and some of the literature optimisations, but this does not affect the overall execution time because the first phase is done on a lower-resolution dataset and the second phase is limited to a small number of iterations, thus leading to a lower total number of operations.
Although the proposed solution was implemented and evaluated using cloud-based tools and infrastructures, it could be easily implemented using other techniques' targets at classical high-performance computing libraries such as MPI (Gropp et al., 1996), OpenMP (Dagum & Menon, 1998) or distributed computing infrastructures such as hadoop (The Apache Software Foundation, 2006) or dask (Anaconda Inc., 2015).
Besides the reduction of the overall execution time (measured by the user from the beginning to the end of the search), the application of NDRank and its parallelization also reduces the total workers' execution time. By comparing the speedup results for the NDRank parallelization with the optimal speedup, we conclude that the attained values are higher. These higher values (for instance speedups of between 11 and 45 for 6 nodes) show that the total cost of the resources' allocation of a cloud infrastructure reduces between 2 and 7 times.
These results make it more attractive for users that only have access to cloud infrastructures requiring payment for the renting of processing nodes.

Conclusion and future work
Multidimensional arrays are a common representation of many scientific problems. They present challenges in their representation and processing as datasets' size increase. Data collection in the physical world is increasing rapidly, making the management of these arrays a more frequent problem.
In the cases where search operations must be executed on top of these arrays, these challenges can be clearly noticed. Without considering any context of the used data, a search operation is forced to evaluate every possible match. When data are difficult to be represented efficiently and massive in size, any basic search operation becomes slow and ineffective.
A perfect example of this is the search for weather analogues, a search operation that tries to find a moment in time that resembles the current weather state. In our solution, we presented NDRank, an algorithm that takes advantage of the characteristics of weather datasets and uses a heuristic to guide the search process over the original dataset. This heuristic is created by executing a Brute-force search on a dataset with a lower resolution first and then taking these results to understand which regions should be searched on the original dataset.
Complementary to NDRank, an implementation in a parallel solution was also presented. The proposed parallelization, in conjunction with NDRank presented positive performance results and a very satisfying correctness in its responses. NDRank leveraged the reduction of the resolution of data to decrease execution times, and it was possible to verify that the reduction of the resolution of the time dimension had a strong positive impact on performance and that the reduction of the spatial resolution only gave gains in scenarios where the retrieval of data from disk to memory was a large bottleneck.
The results look promising because the search analogues returned by NDRank were similar in quality (or presented minimal differences in the order of 0.1%) when compared with the Brute-force, and the speedups were high. Furthermore, the execution of NDRank on a cloud infrastructure allows the reduction of the resources allocation costs, making these results relevant for users that do not have access to dedicated parallel computing infrastructures, but also use the cloud for running their searches. This first version of NDRank can be further optimised (with respect to data management and access in the second phase) to attain more gains.
Since our approach differs significantly from and complements other optimisations, it is our understanding that it is possible to integrate them to add further gains. All the optimisations described in the literature are targeted at the way to calculate similarity and all require the exhaustive search over the whole historical dataset, NDRank here proposed can be integrated into those solutions (Delle Monache et al., 2013;Hu et al., 2021) adding further gains at the expense of a slight reduction of results' precision. Nonetheless, the use of NDRank can be configured (with respect to the number of returned analogues and the resolution reduction), tested, and the results' precision validated before being applied to any specific application.