Premonition Net, A Multi-Timeline Transformer Network Architecture Towards Strawberry Tabletop Yield Forecasting

Yield forecasting is a critical first step necessary for yield optimisation, with important consequences for the broader food supply chain, procurement, price-negotiation, logistics, and supply. However yield forecasting is notoriously difficult, and oft-inaccurate. Premonition Net is a multi-timeline, time sequence ingesting approach towards processing the past, the present, and premonitions of the future. We show how this structure combined with transformers attains critical yield forecasting proficiency towards improving food security, lowering prices, and reducing waste. We find data availability to be a continued difficulty however using our premonition network and our own collected data we attain yield forecasts 3 weeks ahead with a a testing set RMSE loss of ~0.08 across our latest season.


Introduction
Precise and accurate yield forecasting is a key component in Fresh Produce (FP) Supply Chain Management (FSCM), since it plays a critical role in price negotiations, logistics, and scheduling. In particular accurate yield estimates, are required a minimum of 3 weeks ahead (in the strawberry domain) which we call the horizon (Fig. 1), so that adequate time can be given to bidding, labour timetabling, logistics, and procurement. However, forecasting FP is incredibly difficult especially over a 3-week horizon where any number of variabilities can exist such as environmental fluctuations. Dealing with the latter would necessitate some weather forecasting to be considered, which is a problem in its own right. Instead, we show how good yield forecasting can be and improve upon current practices while allowing for future works to delve specifically into weather forecasting.
Yield forecasting is difficult in particular due to the lack of data needed to develop forecasting models. Such data is mostly non-existent, or incredibly difficult to attain, largely because of the difficulty around data collection, the perceived sensitivity with which this data is held, and the lack of clear benefits to the digital collection of such data. We also see resistance to the positive dynamic impetus of modernisation requiring a departure from growers' previous fixed practices.
FP optimisation is of global strategic importance since horticulture and agriculture are some of the biggest producers of greenhouse gasses, such that there can be a significant benefit to optimising production * Corresponding author.
E-mail address: gonoufriou@lincoln.ac.uk (G. Onoufriou). or minimising waste. In the UK our government has committed to reducing greenhouse gasses to net 0 by 2050, and agriculture has been expressly named as a key contributor of greenhouse gasses in the United Nations Climate Change Conference 2021 (COP21). Inaccurate forecasting or more specifically under/overestimation leads to food waste and destruction costs or importing of FP from abroad. Assuming the cause of this discrepancy/variability is adverse weather conditions, then those same weather conditions will have affected geographically approximate growing sites. In the UK climate discrepancies usually mean fruit must be imported from abroad, given our size, to meet any given procurement contract, as all the neighbouring growing sites will have suffered the same adverse environmental conditions and thus under-production.
Other works (outlined with more detail in Section 1.1) have sought to solve the lack of data availability in agriculture using satellite/ remote-sensing data, using various machine learning, statistical, and some deep learning techniques. In this paper, we show how we can collect data at some scale but with local/ high granularity, including fruit images, weather conditions, and irrigation data locally. Here we shall focus specifically on strawberry yields of strawberry tabletop and how we can predict them. We exemplify this approach at our Riseholme strawberry tabletop/polytunnel growing site and employ this data to create accurate forecasts with this 3-week horizon/ window to meet the needs of the bidding and procurement process. We do all  Past (purple-pink), present (blue) and premonition (yellow) timelines/ windows overlayed on a depiction/rough reference of strawberry yields through the years of 2020 and 2021 along with temperature. Depicting the point of prediction relative to (at the seam of) horizon and history. this in collaboration with Berry Gardens Growers (BGG), one of the UK's largest soft, and stone fruit producers, and with their direction on industry standards to keep as close to the typical expectations as reasonably possible. We also have fortnightly visits by agronomists to ensure we are growing the strawberries satisfactorily.
We use this data in various neural network architectures in Section 2 and evaluate their performance in Section 4, since the literature would suggest that deep learning approaches are the most performant even for FP. Of these new architectures, we showcase our Premonition Network which seeks to improve upon current tabular/sequence prediction approaches using all three forms of context, the past, the present, and the premonition of the future. We use the past to learn the overarching distribution, we use the present to set some scale and granularity, and we use the premonition for variability from the standard distribution.

Related work
There are relatively few works in strawberry yield prediction using deep learning, instead, the majority focus on statistical machine learning, and almost none refer to privacy considerations (Maskey et al., 2019;van der Velde and Nisini, 2019;Jafari et al., 2020;Bouras et al., 2021;Gastli et al., 2021;Hopf et al., 2022;Paudel et al., 2021;Zhu et al., 2022;Bali and Singla, 2022). However, several papers have stressed that a lack of data availability (Pearson et al., 2019;Durrant et al., 2021Durrant et al., , 2022, or more specifically a high expense of acquisition significantly hinders the smooth application of state-of-the-art neural networks towards the creation of powerful forecasting models (Chen et al., 2019;Maskey et al., 2019;Jafari et al., 2020;Gastli et al., 2021;Nassar et al., 2020). Many of the aforementioned papers largely choose to tackle this lack of data by using satellite imagery although in some cases they use the California strawberry commission data paired with the California strawberry commission irrigation management information system (CIMIS). Unfortunately, the data mentioned in these papers are behind multiple walls, and the CIMIS data is currently unavailable from the original source, so while we were able to find an excerpt of the CIMIS data elsewhere we were unable to find the full dataset making it very difficult to compare to.
Many different proposals for methods of predicting/forecasting yield (generically) exist, some using classical machine learning (e.g. Paudel et al., 2021) others such as those by Nassar et al. (2020) use neural networks in their specific case a mixture of CNN, LSTMs, GRUs and some attention heads. However, all emphasise the need for better forecasting systems as demand increases and supply decreases due to global factors such as (but not limited to) COVID-19 and the Russia-Ukraine war. Current yield forecasting methods are highly archaic, often times they can be as simple as forecasting the average of the last few years' yields, or simple linear models based on heat hours. One such example is the European Commission's MARS crop forecasting system (MCYFS) which has purportedly seen no improvement in its forecasting performance since 2006 and uses no machine learning. Lastly, the work by Paudel et al. (2021) shows that machine learning can already at the very least match (at the start of the season) or beat existing large-scale traditional crop yield forecasting systems such as the aforementioned MCYFS system.
The MCYFS system from 2006 to 2015 has a median MAE of 0.379, 0.368, 0.570 in soft wheat durum wheat and grain maize (van der Velde and Nisini, 2019). The most performant forecasts for this system appear to be sunflower yields at 0.162 MAE. However the assessment carried out by van der Velde does not state over what period these yield predictions are made specifically whether that be a few weeks, days or months ahead making this also a difficult comparison to make. It is also apparent that forecasting is becoming increasingly difficult with the higher degree of variability in climate conditions as the performance of this largely static forecasting system seems to be in slow decline (van der Velde and Nisini, 2019).
More sophisticated and bespoke machine learning for strawberry tabletop forecasting has still a long way to go, with only a handful of applications appearing in the literature. However, as previously stated data is incredibly difficult to attain in this domain. Nassar et al. (2020) appears to show how the compound deep learning models outperform standalone deep learning models and traditional machine learning models. Nevertheless, as with much work in this space, it is difficult to garner any concrete comparable statistics. From one of their diagrams (14) we believe we can see their most performant model to produce an MAE loss of roughly 0.14 or 14% MAPE. They call this model Attention-ConvLSTM2D. While we do not have access to the same data as they have, we have seen even simple GRU models attain similar performance in our strawberry tabletop. However, we believe we can improve this performance on our own data by means of attention as their paper G. Onoufriou et al. would also suggest, but instead of standalone attention heads we intend to use a much more complex and performant transformer model.
Transformers as proposed by Vaswani et al. (2017) are state-ofthe-art neural network components for sequence-to-sequence problems. Strawberry yield prediction is such a problem thus we are keen to implement and use them in this scenario, having used other methods to varying degrees of success in the past (Onoufriou et al., 2020(Onoufriou et al., , 2021a. We also note that in contrast to our previous techniques transformers and their attention heads can help focus the neural network into parts of the data that are most important thus reducing the need for quite as much data compared to equivalently complex neural networks. In short, yield forecasting is essential for improving on food security, and sustainable development (Zhu et al., 2022). Yield estimation is difficult due to a lack of data availability and thus a lack of research using modern data-hungry techniques in this domain (Chen et al., 2019;Maskey et al., 2019;Jafari et al., 2020;Gastli et al., 2021;Nassar et al., 2020). Most attempt to solve this data shortfall by using remote sensing, or by using a select few difficult-to-attain datasets like the California commissions data (Jafari et al., 2020;Nassar et al., 2020;Zhu et al., 2022). Few works have applied modern deep learning/neural networks successfully to agriculture, especially strawberries, the majority use either old neural network forms or do not use neural networks at all.

Material and methods
We have collected 3 years of strawberry tabletop data at our Riseholme campus. This data comprises 2 polytunnels, each with 5 rows of strawberry tabletop, each tabletop being 20 m long. Thus in total, we had 200 m of strawberry tabletop over any single season. Over these rows we had two different June bearing varieties at any one time from Driscoll's Zara, Katrina, and Malling Centenary. Fig. 3 shows the two varieties chosen for the 2021 growing season from the aforementioned three, as can be seen, their performance while similar, differ in that Katrina is expected to output more total yields in any given picking session on average. The data capture devices we employed for this strawberry tabletop was: • Irrigation data from the tabletop irrigation system. This includes features describing the nutrients, moisture levels, soil temperature, input irrigation, and irrigation runoff. With a sample rate of 1 sample per 2 min.
• Environmental data from a central weathervane which collected information about: Temperature, humidity, wind direction, wind speed, solar radiance, and precipitation. With a sample rate of 1 sample per 15 min. • Yield weight and quality data from our strawberry picking team.
With a sample rate of 2 full picks per row per week.

Data wrangling
One of the biggest challenges when working with any time-series dataset is to ensure synchronicity. Since all 3 data sources are sampled at different sometimes overlapping intervals it was necessary to re-sample the datasets to achieve synchronisation. We opted to synchronise over the 15 min intervals to match the weathervane data. We later downsampled the synchronised data to a much more manageable 4-hour interval when fed into our MTT.
One of the other challenges when working with any data is missing or unrepresentative samples. Unfortunately in real-world scenarios we always expect to capture some missing or inaccurate data, especially when humans are necessarily involved in the process. We chose to use a forward-fill strategy whereby any missing values are filled with the last known values. The only features not forward-filled are ones that are sampled too infrequently to be able to reasonably forward-fill them. This means any missing values in yields for instance (which are collected bi-weekly) are removed as we cannot reasonably infer them from neighbouring values. Now that we have a regular dataset with no missing values we can begin example extraction as per Fig. 1. We create hopping windows that end on/are aligned to observed yield outcomes in the current/ predicted-for year. The window lengths we chose are 21 days for the premonition, 12 weeks for the present and the cumulative period for both combined in the previous year as the past. This way we have information on adverse weather forecasts, current strawberry performance, and performance of strawberries at the same site last year. We then create time sequences using expected date ranges. the historic data and when we have specific outcomes for fruit yields. This meant we roughly formed 2 examples for every week in the growing season. We then further split this data by row into training (2,3,4,6,7,8,10), and testing (1, 5, 9) sets, while further subdividing the training set into training and validation using k-fold cross-validation where = with a batch size of = 32 which resulted in = 10 batches. We held out the two final shuffled batches as a per-epoch validation set. We split in G. Onoufriou et al. this manner to ensure there is no overlap between training and testing sequences, and it enables us to have a full multi-year view since there are not enough years of data with which to hold out.
Finally, we normalised our dataset feature-wise using a basic linear transformation Eq. (1).
Where the desired normalised feature value for at timestep post normalisation ′⟨ ⟩ is in [ , ]. We chose our range to be [−1, 1]. We inverted our results to real values using the inversion Eq. (2).

Architecture
As can be seen in Fig. 4, our MTT consists of 3 differently parameterised transformers merged together using a dense layer. Thus our architecture is comprised of 3 encoders, 3 decoders and a dense layer.

Encoder and decoder
As is standard for transformer networks it is necessary to decide upon some form of positional encoding (Vaswani et al., 2017). In our case we use a standard fixed positional encoding where even positions are encoded using Eq. (3) and odd positions are encoded using Eq. (4).
This positional encoding for each odd and even position is then added to the feature vector to allow the neural network some context into the order of inputs. There was no need to form a tokenised input embedding since we already have a distinct feature space described in our feature vector directly from the tabular sequences.

Dense
The dense layer is a simple linear layer with enough weights to form the weighted sum of the inputs and concatenate them into a singular value output in Eq. (5) Towards gathering data we employed our own data collection pipeline on our Riseholme strawberry tabletop site, the respective yields of this site can be seen in Fig. 2. All the following data is streamed into MongoDB and accessed using aggregation pipelines to help speed up the transformation process.

Weight initialisation
For weight initialisation, we used the default pytorch Kaiming uniform initialisation as defined in Algorithm 1 for leaky-ReLU (Nair and Hinton, 2010;Radford et al., 2015).
Algorithm 1 Kaiming uniform weight initialisation using leaky-ReLU with the fan-in method. where : (default 0 for ReLU, or -0.01 for leaky-ReLU) is the negative slope of the rectifier used after this layer.
: a randomised weight matrix with mean 0 and variance 1 (shape e.g (64, 32)) mode: is a flag which represents a different value for the fan whether the method being used is for feedforward or backpropagation (e.g if mode = fanin then fan = 64 else fan = 32 given previous example matrix . This allows us to exponentially penalise large more errors than small errors on our continuous yield forecast. We in particular seek to reduce the networks tolerance for larger single errors as these would mean even if the total error was the same, being particularly peaked in one prediction would result in the growers having to import fruit that particular week. We would much rather be consistently out by a known amount than having almost perfect performance one week and then large errors the next. As is commonly the case we use Adaptive moment (ADAM Kingma and Ba, 2014) as our neural network optimiser as it is has been shown to be more performant than just first order or second order moments G. Onoufriou et al.  and is by and large the defacto standard. We calculated our first order moments = 1 * −1 + (1 − 1 ) * ̂= 1−− 1 and second order moments.

Models
We primarily focused on two different types of model. One holistic model that learned from all of the training rows using random subsets for training and validation (Fig. 5). Then we also attempted to create smaller weaker predictors as an ensemble only trained on a smaller set of the training data to each other as an ensemble to attain simple certainty metrics, which we deem would be invaluable towards building trust in the models and enabling re-investigation of uncertain scenarios. We split the training data used into 3-row sets of tabletop for each ensemble member. Each ensemble member is equivalent to the base MTT, including weight initialisation, loss function, and optimiser.   a These are estimates and may not be representative of any grower or agronomist specifically but are instead ballpark figures for illustration based on our information from our industry partners.
Overall this means there was a one-row overlap between the firstsecond and second-third MTT. The results of our two current attempted approaches along with our past approaches and expected forecasting performance of growers and agronomists can be seen in Table 1.

Results
As can be seen in Table 1 our primary MTT that can forecast three weeks ahead within 8% RMSE is a large improvement over current capabilities as forecasts by agronomists tend to not only vary wildly from agronomist to agronomists (14 to 30%), rely on specialist human presence, and are less accurate than our current model. However, a large caveat is that our model was created with intensive/high-quality environmental and yield data, on a small site compared to the typical industrial settings.
The results shown in Table 1 and Fig. 6 are a significant step forward in the prediction of strawberry yields, however, there are some weaknesses to our approach and the yield outcomes. Firstly our ensemble is significantly underperforming especially since a single predictor trained on the whole dataset beats the ensemble significantly. This is likely due to data, with almost three times the parameters, we suspect that we require more training data to learn adequately, yet they receive 1∕3 of the total training data each. However, as time progresses and more data becomes available to us over more seasons, we believe this ensemble will outperform the single MTT while enabling ensemblebased certainty estimation. Secondly and most difficult is the data itself.
While we are fortunate to have access to our Riseholme campus and the strawberry tabletop site, there is still a lack of data available for use. This relatively small site means we likely have not learned some of the more complex variances present on larger sites where the sensors' immediate environment might be significantly different to another area on the growing site some distance away meaning the data in such scenarios might be significantly less representative of the conditions experienced by the strawberries.

Discussion
Our strawberry dataset while covering 200 m of strawberries is still limited. Commercial sites in comparison have hectares of such crops, meaning our 200 m is not as representative of larger sites with more intra-crop variability. However, as previously mentioned data availability is scarce making it practically very difficult to collect hectares of data, not least due to actual or perceived data sensitivity by the respective growers. In spite of this, while there may need to be some adjustments to account for more intra-crop variability of these larger sites our neural networks perform well given the data availability. While the sites are smaller and easier to learn, they also have fewer data to do so, which we believe to be a fair trade-off with no loss in difficulty between their larger sites and our smaller site.
We have a high level of intra-crop variability with our dataset in the similarity between rows. Largely while there is inter-row variance there is still a risk of overfitting since even if the neural network cannot see row 1, for instance, it may be able to relate the yields of row 1 from previously trained/known yields of row 2. We would have ideally liked to have split by time, and claimed one whole season as a completely separate testing set with none of those rows being trained on. However, due to the reality of strawberry seasonality and that there are only so many seasons with which it was possible to collect data, we had to split in such a way as to give the neural networks context for at least two seasons from start to finish. This is only necessary since the current methods of strawberry prediction in industry are largely based on the occurrences of the last season. As such we attempted to base our methods on existing techniques, and intuitively the performance of the strawberries last year will be related to the current season's performance unless some large shift in methods between the seasons occurs. Fig. 3 shows a significant number of zero/near-zero values. This is due to the slow start at the beginning of every season as shown in Fig. 2.
In our data collection, we still recorded fruitless strawberry picking sessions to account for some strawberry plant varieties producing for longer periods in the growing season, whereas others started later. This is significant as the total berries one would expect to harvest over the season is affected. In particular, for the 2021 season, we experienced a very slow start to our season with very low yields when compared to the 2020 season. Later in the season, we may also experience zero/nearzero values, these are difficult to distinguish from actual low values and bad picking sessions. One way that we might have made such assumptions is by assuming the harvesting effect causes all temporally adjacent picks of the same row to have diminishing returns.
Our MTT used an interval of 4 h despite our data being synchronised over 15-minute intervals. This was a tradeoff between data density (thus model complexity), and data availability. Since we only had a finite number of concrete outcomes that we observed we had to limit the complexity and weights of the model so that it could train its fewer weights with what limited data we had for concrete observations. In contrast, if we had used a data density of 15-minute intervals we would have had to have significantly larger weight matrices being backpropagated from the limited number of observed yield values. If, however, we found ourselves with large hectare scale datasets with many more observed outcomes we could tune the model to be more complex to leverage this data, allowing the model to understand much more complex relationships like the aforementioned expected intra-crop variability.
It may also be noted that we use a simple missing data imputation algorithm strategy namely forward-fill which involves filling missing values with the last known value. This was chosen as we mostly only incurred individual or relatively sparsely missing data. In larger sites one might expect to find entire regions that have some data unavailability for some time, meaning more advanced data-filling strategies may be necessary under such conditions. However, in our site, since the missing values were relatively sparse, the forward fill strategy is sufficient to allow us to leverage data in spite of any missing observations or features. The only notable exception is that of yield values. Since yield values were recorded sparsely a single missing value represents a much larger significance. Thus any such missing values are excluded entirely. Thankfully we had very few such missing values.
Due to the data scarcity, we used fixed positional encoding as opposed to learned encoding. This means the gradients would not be shared with the learned positional encoding. This is sufficient since in the original transformer paper (Vaswani et al., 2017) fixed positional encoding and learned positional encoding result in similar performance.
Finally, we chose to use a tri-transformer architecture merged using a dense fully connected layer. We did this to allow the neural network to train separate contextualising units for each potential timeline. This way we can easily conceptualise the timelines as follows. The pasts purpose is to have a broad view of the relationship between the features and the expected outcomes. This is important as we want to ensure the network has context for how yields are expected to outcome given past scenarios. The present serves to contextualise how this current specific season or crop is performing such that it can later be related to what has happened in the past. The future timeline/transformer is to add mitigations and adverse effects, such that high expected fluctuations can be considered at the merging layer.

Conclusions
In this study, we propose a new multi-timeline transformer architecture that outperforms many other forms of neural networks (including CNNs, RNNs, LSTMs, and GRUs) in forecasting strawberry yields, even on small datasets. Multi-timeline transformers are very capable of learning from the past, the present, and the premonition of the future, even when these use similar approaches to human forecasters. There has been little work in forecasting strawberries using state-of-the-art deep learning methods, and all of the works that do exist struggle with data availability. Data is clearly the principal problem, as we need more data to develop good machine learning models. Therefore, we need to encourage the sharing and collection of more and better data so that more impactful research can be performed. With more data, we can properly test ensemble models and similarly data-hungry models.
Regarding our future work, one key area would involve implementing certainty metrics that do not require the use of ensembles so that we can keep the neural network parameters down. This would reduce the necessary data to train more complex models. We also seek to make transformers that are abelian compatible such that we can use some of the previously proposed fully homomorphically encrypted (FHE Gentry and Halevi, 2010) deep learning methods with these currently incompatible but performant transformers (Onoufriou et al., 2021b,a).
Lastly, we seek to find ways in which to make our data available for wider use, currently that is not possible due to contractual constraints which were necessary to enable us to collect this data with industrial varieties in the first instance. However, we seek to remedy this in future.