The challenges of using live-streamed data in a predictive digital twin

A key component of a digital twin is the monitored data, communicated from the physical system to the virtual representation, for visualization and simulation. Yet little attention has been paid to the practicalities of working with live-streamed data. Strategies are required for providing continuous access and storage, for processing data of indeterminate quality, and for ensuring long-term sustainability of data communication systems. This paper describes design of the data communication infrastructure, 3D visualization and data interpretation tools, and model development and implementation, for an operational digital twin. Conclusions are drawn pertaining to the informativeness of streamed data, key for successful digital twin development and operation. The paper highlights the need for further research into techniques for ensuring data quality and forecasting efficacy. Digital twins operated commercially are rarely described in detail; the discussion provides a basis for future collaborative development of a consistent operational and interactive framework.


Introduction
The increasing availability of high resolution, and detai led, data streams of physical systems, facilitates the development of digital twins for system operation and fault detection. However, little attention has been paid in the literature to the challenges of assimilating engineering models with real-time streamed data, as against carefully curated data. In this paper, we highlight and address the crucial issues arising from the use of real-time data, in the context of a digital twin developed for optimizing the energy, and environment, of an underground urban farm.
It is necessary to define the term digital twin in the context of recent advances. To some extent, digital twins have existed for years -in buildings, asset management and HVAC control systems actuate heating and cooling of buildings, based on live thermostat readings. For weather, the Met Office continuously updates its models based on real-time measurements at thousands of weather stations. The original early twenty-first century vision of a digital twin was of a virtual representation of a system that aids the understanding of the relationships between a physical system and its underlying information (Grieves 2014). At the time, data streams were largely manually collected and paper based, and the concept was a vision for the future. Fast forward to the current day and the technology exists to make that vision a reality.
CONTACT Rebecca Ward rward@turing.ac.uk,rmw61@cam.ac.uk The Alan Turing Institute, The British Library, 96 Euston Road, London NW1 2DB, UK Grieves and Vickers (2017) identified a number of key concepts surrounding the digital twin in the manufacturing process, and differentiated between five scales of twin over a product life cycle, from a digital representation of a prototype containing sufficient information to build the corresponding physical realization, to a complete digital twin of the physical product within its operating environment. Jones et al. (2020) recently performed a thematic analysis of published work and found that the bulk of the papers came from the manufacturing/production field. Rasheed, San, and Kvamsdal (2020) reviewed the recent status of methodologies for the construction of digital twins from a modelling perspective and provided recommendations for the different types of stakeholder, from industry and academia through to society and education of the new generations. Verdouw et al. (2021) also reviewed the use of digital twins but in a smart farming context, focussing on the purpose and benefits of the twin. They suggested different typologies of digital twin for different purposes. The typologies vary from the simplest Monitoring Digital Twin, which provides a visual representation and insights into the state of the physical entity to inform human interpretation and action, to a fully Autonomous Digital Twin, which fully controls the behaviour of the physical entity without the need for human intervention. A Prescriptive Digital Twin is somewhere in between, prescribing actions but not implementing them automatically -the actuator is the so-called 'human-in-the-loop' (Jones et al. 2020). Verdouw et al. (2021) also defined a Predictive Digital Twin as a separate typology, being a simulated projection of the future states of the virtual representation, but acknowledged that any of the other typologies can also have predictive capabilities.
In 2017, Gartner (2017) suggested that by 2021, half of large industrial companies would be using digital twins. Many commercial offerings are now advertised, with companies involved including IBM (IBM 2023), Microsoft (Microsoft 2023) and Oracle (Oracle 2023), and for the built environment IES-VE (IES-VE 2023) and Bentley Systems (Bentley Systems 2023). There is a growing wealth of academic literature discussing the digital twin paradigm, and experimental case studies that highlight the potential pitfalls for implementation of digital twins in a commercial setting. In the manufacturing context, recent examples consider the use of digital twins in friction stir welding (Roy et al. 2020), die-cutting (Wang, Lee, and Angelica 2021), assembly processes (Yi et al. 2021) and automation (Guerra-Zubiaga et al. 2021). Wang, Lee, and Angelica (2021) highlight the difficulties in developing digital twins for existing machinery owing to the lack of Internet of Things (IoT) capability, but successfully prototype a digital twin of a conventional die cutting machine. Guerra-Zubiaga et al. (2021) present case studies for automation of two different experimental production line systems, highlighting the need to improve connectivity of each component. The potential for improved assembly efficiency is demonstrated by Yi et al. (2021) on an experimental digital-twin based satellite assembly system. They emphasize the difficulties in gathering, processing, and storing data in real time. In the built environment there is potential for an increasing proliferation of digital twin technology in line with increases in monitoring and smart metering in buildings. Lu et al. (2020) developed a digital twin demonstrator at the University of Cambridge encompassing both building and city scale and focussing on operation and maintenance. They highlighted key issues when dealing with data from disparate sources around data quality, heterogeneity, synchronization and integration. Another prototype at the city scale is described by Dembski et al. (2020). Their digital twin of the city of Herrenberg, Germany, allowed investigation of air quality issues, testing of potential alternative solutions, and facilitated involvement of public participation for decision making by consensus. A building-level case that has been fully operational for over a year is described by Peng et al. (2020). In this hospital building, a digital twin system has been deployed that incorporates data from more than 20 management systems.
Managers can access the detailed status of the whole hospital visually and receive timely operational suggestions generated by the digital twin. The authors highlighted the need to develop the system further to safely implement automatic control of medically critical machines, and to develop protocols for ensuring data privacy issues do not compromise system function.
Despite different industries having a different focus, there is a growing consensus that as a minimum, a digital twin comprises a physical entity that is changing over time, a virtual representation of the physical entity (typically a numerical or statistical model), and data connections that feed monitored data from the physical entity to the virtual representation (Wright and Davidson 2020). For maximum benefits to be realized, it should also include a means to interpret the monitored data and feed information back that impacts on system operation in a beneficial way. Following this interpretation, we identify five components required to specify a digital twin: (1) the physical entity itself, so the product or system that exists in reality, (2) a monitoring system that measures the state of important parameters of the physical system, (3) a virtual representation of the physical system, incorporating: • infrastructure that can receive and process inco ming data from the monitoring system, and • procedures for incorporating the monitored data to update the virtual representation to match the physical entity, (4) simulation tools for interpretation of the data communicated to the virtual representation, and (5) actuation systems for implementing operational changes as a result of data processing and simulation outputs.
Of these, the first four are essential and the last component facilitates use of the digital twin to improve the performance of the physical entity. A visualization platform is not an essential part of the virtual representation, but can prove an invaluable tool for interpretation of the data and enabling interactive use of the digital twin.
Given this structure, at the heart of a digital twin lie the data that are monitored for the physical system and communicated to the virtual system. To date there has been little discussion of the practical aspects of integrating monitored data within the visualization and models of a digital twin. While visualization of the data may in itself be a relatively straightforward task, there are many practical concerns around data access, communication and storage that require addressing. Equally, value can be added to the data by different visualizations or by conversion into actionable information, and different stakeholders may require different types of data and visualization tools, so the data must be usable by models of different type and resolution.
Our experience of working with the live data has highlighted many aspects which tend to be marginalized in discussions yet are of critical importance for the development of efficient data management strategies. In the current paper we focus on the challenges of developing a digital twin from the perspective of working with livestreamed as against carefully curated data to inform the operation of the physical system. We demonstrate these challenges through a digital twin -the Crop Research Observation Platform or CROP -developed for managing the indoor environment of a hydroponic farm in London, UK (Jans-Singh et al. 2020). The urban-integrated Controlled Environment Agriculture (CEA) system considered here is an interesting example because it involves complex processes for which monitoring and simulation provides practical and actionable information. The potential benefits of being able to track and forecast farm conditions in a hydroponic system include reduced water and nutrient usage, optimization of environmental conditions for maximum crop growth, and management of algae and pathogen growth (Sreedevi and Santosh Kumar 2020). Post-harvest, there is great potential for a digital twin to illuminate the factors affecting perishable product quality at point of sale (Defraeye et al. 2021).
In the following section, the crucial factors associated with utilizing and communicating live streamed data are described. Section 3 then presents the case study example and explores how the challenges identified in Section 2 have been managed. Lessons concerning the informativeness of data, learnt from the development and context of CROP, are explored in more detail in Section 4 and finally the results are discussed and conclusions are drawn for more general and wider development of Digital Twins.

Challenge of live-streamed data
Dealing with live-streamed monitoring data must address a number of practicalities, as discussed by Batini et al. (2009). These include data access and storage in addition to data issues such as quality and utility. From the viewpoint of an operational digital twin, the sustainability and extendability of data channels are key, but all of these issues come at a cost and require a diverse set of skills for development and maintenance. In this section, each of these topics is discussed in more detail.

Access
How and when the data will be accessed from the monitoring devices and transmitted to the digital twin platform is a key component of a digital twin. Data access can be facilitated by transmission of data from the monitoring devices to an on-line platform via an intermediate base station. Access to the on-line platform is typically via an Application Programming Interface (API) which can be interrogated to acquire the data required. While in theory this should be simple (and in many cases is), proprietary devices may restrict access to their own commercial platform and resist requests to communicate externally. The result is that with injudicious choice of providers, a company wishing to visualize -for example -5 different data streams may end up with 5 different platforms that do not communicate with each other. This is potentially unwieldy and inefficient.
Frequency of access must also be matched to the requirements for data collection for different functions within the digital twin. Again, different commercial platforms may restrict the number of data downloads over a specific timeframe, which may result in incompatibilities between different data sources.

Storage
The accessed data need to be stored in a secure repository, able to accept incoming data streams from different sources and accessible from the digital twin platform. The best storage solution is likely to be dependent on the individual context of the digital twin. Cloud-based platforms -e.g. Microsoft Azure -have the benefits of secure access, back-up, copious storage capacity and access to cloud computing services, but at a cost. Other solutions would be to use a stand-alone server on-site but this would also require maintenance and back-up -again with associated costs.
The volume of storage required is also worthy of consideration, in particular for constantly streaming data at short time intervals. As the data accumulates over time the volume of storage must be adequate and suitable protocols in place for archiving historic data.

Quality
Particularly important are data accuracy, completeness, consistency and timeliness i.e. the extent to which the data are sufficiently up to date for a specific task (Batini et al. 2009), and the interventions required to address any failings in quality.
The quality of the data may be directly affected by physical factors such as sensor failure, sensor drift or power supply failure, but may also be affected by factors such as the frequency at which updates are sent to the digital twin platform from the different sensors. Interventions may be necessary when poor quality data are identified. These interventions may be data-driven e.g. recognizing incorrect or missing numbers and substituting replacement values, or process-driven e.g. modifying the processes using the data to accommodate incorrect values.
The accuracy of the incoming data is affected by the reliability and calibration status of the individual sensors. Incorrect values can arise from errors in the sensors, for example inaccuracies of this type may typically be observed in monitored electricity consumption data, where the meter may fail to register a value in one time period and register a double value in the next time period. Drift in sensor values can also occur and care must be taken to ensure correct calibration. As an example, this is particularly the case for CO 2 sensors; typically these sensors self calibrate using an automatic baseline calibration approach, in which it is assumed that at some point during each day the CO 2 level is at a baseline level (equivalent to the outdoor air at around 400 ppm), and the sensor is calibrated by equating the lowest daily value to this baseline figure (CO2Meter.com 2022). Clearly this is not valid for a situation in which the baseline figure is rarely achieved.
Missing data is another important issue and visualization of the data can be used to illustrate when missing data arise. Whereas when working with historical data it may be possible to use sophisticated statistical techniques to impute data to fill in any gaps, when working with data in close to real time a more pragmatic data-driven approach may be more suitable. For example random missing data points may simply be excluded from a data set, assumed to be the same as a previous monitored value, or filled in using linear interpolation. When selecting a solution it is just necessary to ensure that the approach used is suitable in the context of the use to which the data are being put. If the missingness becomes extensive, however, alternative process-driven approaches may be more suitable i.e. models may need to be developed which can accommodate different levels of missingness without significant impact on the model output. Examples might be modification to a model, such that it no longer relies on data known to be of dubious quality, or calibration of a model, using an alternative more reliable data source.
Data may also be incomplete. Unlike missing data for which it may be possible to fill the gaps, incomplete data refers to values that are either not being monitored or not transmitted to the digital twin, but may be required by a model. An example might be operational data recorded locally for the use of the system operators. It is of key importance to identify all significant parameters and establish appropriate communication pathways.
Ensuring that the incoming data are fully understood and characterized is an important step in ensuring consistency across the different data streams. One initial challenge when working with live streamed data is ensuring that timestamps are consistent across all monitored data streams, in particular the methodology for implementation of seasonal changes which can differ for different sensors. The timeliness of the data depends on whether the update frequency is adequate for the function and purpose of the embedded models. Again, this is very much dependent on the nature of the digital twin and the decision-making tasks it serves.
The potential timeliness is dependent on the latency for data ingestion, processing and communication. This affects the speed at which data are transmitted to the database, and substantial delays must be avoided if the data are required in (close to) real-time. The variation in transmission speed for diverse datasets will impact on the choice and implementation of the database type, with a relational database suitable for data that can be aligned in time, but a non-relational database required for unstructured data.

Utility
The utility of the data lies at the very heart of how a digital twin is designed and intended to be used. This may be for monitoring system health, fault identification, optimizing operation, forecasting future conditions, or more extensive modelling of system optimization. A digital twin is primarily intended to be useful for the system operators but can also provide opportunities and information for a large variety of stakeholders. The twin must essentially become a repository of knowledge about that particular system such that by gathering data in one place and making it securely accessible, any number of different uses of the data can be facilitated.
Data utility is also dependent on how informative the data are under the context of interest. This is crucially important when the data are being used to either derive statistical models, or calibrate numerical models to ensure continuous comparability of the model to the physical system. In previous studies, Menberg, Heo, and Choudhary (2019) demonstrated that different selected datasets from the same data source contained varying levels of information, and hence gave results of varying success rate when used for calibration of a heat pump model. As this is an issue that is better discussed in more depth through examples, more details are provided in Section 4.

Sustainability
Ensuring the ongoing functionality of a digital twin is an essential part of the development. The robustness relies on different components: • the continuing function and accuracy of the sensors and sensor locations, • the continuing function of the data communication system, • the continuing function of the data storage system, • the continuing function and accessibility of the computing platform on which the storage system is hosted, • the continuing compatibility of all software components, and • updating of the decision-making models to reflect changes or updates to the physical system.
These require a pragmatic commitment even with the best of commercially hosted and data storage systems -from a technical, time, and financial view. The most likely components to change significantly are the sensors, sensor locations and communication system. This should be anticipated in the development of a digital twin, and functionality provided for the updating of the sensor network.
Platform technologies should also be expected to evolve over time, with the potential for incompatibilities to arise within the digital twin infrastructure. Models also can become outdated and software updates can lead to unforeseen consequences. Essentially the digital twin is itself a dynamic entity, and just as the physical system will undergo changes over time, the digital twin will also necessarily require changes in order to maintain compatibility with the physical system.
It is also important to appreciate that events triggering changes in the digital twin may be the same or different from events triggering changes in the physical system. Ensuring sustainability requires monitoring of all three constituent systems to anticipate and manage any nonconformities that arise.

Extendability
Linked to the sustainability, it is also necessary to consider both the ease with which changes to the physical system will be assimilated into the digital twin, and also the facilities provided for exploration of metrics derivable from the data. As a digital twin develops over time, new sources of data and new models may be acquired and there needs to be scope to include these in the digital twin platform. In addition it could be anticipated that as a digital twin evolves the stakeholders may suggest new uses. A benefit of having multiple models on the same platform is that warnings can potentially be created using more than one model, facilitating cross verification and therefore more confidence in the digital twin outputs. Having the ability to extend the platform functionality to accommodate new uses and models should be a priority for the design of a digital twin.
Equally, the position of the digital twin within an ecosystem of digital twins requires consideration. In future, federations of digital twins may be joined together by data shared securely, as envisaged in the Gemini Principles developed by the Centre for Digital Built Britain (Bolton et al. 2018). This will require an information management framework to be developed and adhered to by all participants.

Development team and stakeholders
It may seem unusual to include personnel in a discussion of factors influencing live data stream versus curated data-sets. However the development and operation of a digital twin requires a great many different types of skill. Design and installation of monitoring systems, development of the communication infrastructure, setting up the digital twin platform technology, the development of models of the physical system, and the interpretation of outputs into decisions, are all different components that require specialisms that need to be integrated in the successful development of a digital twin. Add to that the needs of different stakeholders interacting with the twin, and it is clear that a diverse team is required, not just for development, but also for ensuring sustainability of the digital twin system going forwards.

Cost
The costs of developing a digital twin are difficult to quantify. As discussed above, a diverse team is required for digital twin development and ongoing operation and this requires personnel.
There are up-front and on-going costs associated with collecting and communicating data, arising primarily from hardware purchase and ongoing platform subscription costs. The cost of hardware purchases is very dependent on the nature of the monitoring required but may include some or all of the sensors, wifi infrastructure and a station for transmitting the data to the platform of choice. If the data are to be transmitted and stored via cloud computing services there will be an associated on-going cost. All of these need addressing to ensure long-term viability of the digital twin platform.

The Growing Underground farm
A comprehensive description of the Growing Underground (GU) farm layout and operational strategy is given by Jans-Singh et al. (2020), so for brevity only the pertinent facts are included here. The farm, operated by Zero Carbon Farms Ltd, is located 33 m below Clapham High Street in London in previously disused tunnels designed for use as air raid shelters in the 1940s. The site presents specific operational challenges due to its location and history. Despite this, the farm produces 12 times more per unit area than a traditional greenhouse in the UK and uses just 4 times the energy per unit area. Crops produced are high value microgreens such as peashoots, basil and salad rocket which are distributed locally and across the city.
The GU site consists of two parallel tunnels, spanning approximately 400 m on two levels between Clapham North and Clapham Common Underground stations. Irrigation tanks are contained in the lower tunnels and the productive farm occupies a floor area of approximately 519 m 2 on the upper level. In the growing area there are two parallel aisles of benches with 2 m wide trays stacked 4 high, giving a total productive area of 528 m 2 . In addition to the main farm area there are zones for seed preparation and propagation, and a research and development (R&D) zone in a separate tunnel running parallel to the propagation zone, in which growing conditions and crop selection can be tested and optimized.
The GU farm is illustrated schematically in Figure 1. Access to the farm is provided via a spiral staircase surrounding a lift shaft that runs down to both the upper and lower levels of the farm. The lift shaft is a fundamental part of the ventilation system; air is drawn in through the spiral staircase and circulated through the farm via vents in the ceiling of the upper tunnels and floor of the lower tunnels. Air is expelled from the farm through a parallel air shaft. There are two 30 W fans, one at each end of the farm tunnels providing ventilation for the farm itself and the tunnels beyond.
The prime function of the farm is growing crops, and to this end it is important to maintain environmental conditions that are optimal for the plants, specifically the temperature and relative humidity need to be kept within certain ranges. Regulation of environmental conditions is carried out manually; the operators have a number of options available including changing settings on the ventilation system, door operation, and relocation of portable fans to increase air speed in vulnerable locations. There are also dehumidifiers in place to aid reduction of air moisture content.
The lighting system is a critical component of the farm as the lights not only provide photosynthetically active radiation (PAR) for crop growth, but also generate heat. The grow lights are typically operated overnight so as to provide the most heat when the external conditions are at their coldest. The location of the farm so far underground means that thermal conditions of the surrounding ground are relatively stable with a background temperature estimated to be around 14 • C (Mortada, Choudhary, and Soga 2015). The additional heat provided by the lights brings the temperature up to suitable conditions for plant growth.
In Figure 1, the location of the environmental sensors are indicated. The monitoring system consists of a wireless sensor network (WSN) which sends data in real time to a server. The proof-of-concept of the system prior to 2020 was described in detail by Jans-Singh et al. (2020). Since then, the sensors have been updated and the system developed further. At the time of writing there are 26 sensors, comprising: • 15 temperature and relative humidity sensors reporting at 10 min intervals, • 2 PAR sensors reporting at 10 min intervals, • 1 air velocity sensor reporting at 10 min intervals, • 2 temperature probes for the growing medium reporting at 10 min intervals, • 2 CO 2 sensors reporting at 10 min intervals, and • 2 energy meters, reporting electricity consumption for the ventilation system and lighting every half hour.
Temperature and relative humidity sensors are located throughout the farm, with one at each end and three at the centre of the main farm, positioned on the bottom, middle and top benches to capture the vertical stratification. There are 2 in the propagation zone, one in the fridge, one in the tunnel beyond the main farm and 6 located in the R&D area. The 2 PAR sensors and temperature probes are located in the R&D area, and there is 1 CO 2 sensor in the main farm and one in the R&D area.
The increasing ease of deploying sensor networks and accessing cloud storage means that systems do not need to be locked in to one data source. For example, the data from the temperature and relative humidity, air velocity, PAR and CO 2 sensors are all transferred via an Aranet Gateway base station to a data platform, Zensie 30 MHz (Zensie 2022), despite the sensors themselves being procured from different manufacturers and suppliers. Also transmitted are current weather data from Meteomatics (Meteomatics 2022). The proprietary 30 MHz platform provides an accessible Application Programming Interface (API) for connection to the CROP digital twin. The 2 energy meters transfer data to another platform, operated by Stark (Stark 2022), which also provides information that is accessible to CROP.
From a pragmatic viewpoint, it is optimal to separate the provision and maintenance of the sensor network and associated hardware from that of the digital twin platform. The hardware is part of the physical system and hence needs to be actively maintained by facility managers, whereas maintenance and development of the digital twin platform requires a software engineering skillset.
The digital twin presented in this paper demonstrates the potential of a platform which can draw different data sources together into one place, integrate both statistical and engineering models and update them continuously using live-streamed data, and provide meaningful data interpretations and forecasting tools.

The CROP digital twin infrastructure
Despite being a unique physical system, the CROP digital twin has been developed using a generalizable approach, the foundations of which are as follows: • live-streamed data provides the basis for the digital twin, • extraction of information from the data is essential in order to maximize usefulness, and this requires models of the physical system, • visualization of the model outputs actionable information to the operator at a requisite time resolution, • the context of the system defines the nature of the required information, and • physics-based predictive models provide complementary information that can further advise decision making.
Following this foundation, the CROP digital twin comprises key components which can be generalized to any digital twin for the built environment. The CROP digital twin comprises a database which stores the data that come from the GU environmental monitoring system and the energy consumption meters. Functions have been developed to interrogate the relevant APIs (Application Programming Interfaces) for the different data sources and import the data into the database. The database stores data in a query-able format and can only be accessed by selected users and through IP verification. A front-end web platform communicates with the database, and gives an immediately intuitive visual picture of the environmental conditions and energy consumption in the farm, including display on a 3D visualization platform. This is the front-end with which users can view meaningful statistics, interact with, and download data. At the back end is the code which governs the entire infrastructure, creates the database etc, for automatic data ingestion, collation and updating for modelling and analysis (The Alan Turing Institute 2022). Also observable on the platform are the model outputs: the 3D visualization model with interface to locate the data sources spatially, and visualize the outputs from the predictive models forecasting farm performance. The final component is the system to run models in real-time based on latest available data. i.e. the virtual machines running the The structure of the CROP digital twin is illustrated in Figure 2.
CROP is hosted on the Microsoft Azure cloud computing platform (Microsoft 2022) and is implemented using a well established software stack. First and foremost, the digital twin provides an accessible repository for various data streams. The data are stored on Azure in a PostgreSQL database, an established and well-supported open-source database system. To date no limits have been imposed on the size of the database so there has been no requirement to archive the historically oldest data from the database. In addition to the numeric data, the database has been developed to include crop growth data which incorporates both categorical and numeric information.
The CROP software is predominantly written in Python (Python Software Foundation 2022), using Flask (The Pallets Projects 2022) and SQLAlchemy (SQLAlchemy 2022) for the backend and API, and Jinja (The Pallets Projects 2022) and Bootstrap (Bootstrap 2022), for frontend presentation. Docker (Docker Inc. 2022) is used as a means of deploying web applications and automated ingress functions on Azure. Though there are options for many different software stacks, this configuration allows flexibility, adaptability and easy version control. The combination of SQLAlchemy, Flask, and Jinja, allows the backend and API to be developed entirely on one language (Python) which simplifies the development process. In that way there is a core system from which different front-end applications can modularly attach and be used to visualize the real-time information.
In our case we developed two front-end systems, all presented via the same webapp. The dashboards are developed in Python and Javascript, and are used to show the raw data and analytics from the farm's sensors by location, and from simulated forecasts (described further in Section 3.3). On top of this, a Unity 3D representation of the farm was developed to interactively show live data (Unity Technologies 2022). Unity is a game engine, which has the capacity of producing content in webGL, and can therefore be used for the production of interactive online 3D content.
The advantages of this design approach is that there are two independently configurable systems. First there is a classic dashboard interface based on html standards, which can be enhanced at will. Meaning that new charts, or models, can be added as simply as extra pages in an html index. Secondly, the Unity 3D interface adds context to the dashboard, by showing the conditions of the farm real-time, in an 'identical' digital replica. This dual approach integrates the Building Information Modelling (BIM) data from the farm with Internet of Things (IoT) data coming from the sensors and the prediction data arriving from the models.
The CROP digital twin infrastructure is open source and available to be used as a basis for the development of other digital twins (The Alan Turing Institute 2022).

Interactivity and usability
The infrastructure described above provides the foundation for usability. The current evolution of the CROP digital twin comprises an enhanced visualization platform, with embedded modelling capability that generates information which can be used to inform operational decisions. The 3D visualization of the farm and sensor values gives an immediate view of environmental conditions within the farm, shows the locations of the sensors on the 3D model, and integrates BIM data for the future development of facility management via the digital twin. However, this does not in itself present a substantially greater benefit than offered by energy management systems. What adds value are the models embedded within the platform that are continuously updated with the livestreamed data, and provide outputs that track the health of sensing infrastructure, forecast future environmental conditions, and allow the user to perform 'what-if' scenario evaluations to assess the impact of potential operational changes. The utility of these models is an important measure of the performance of the digital twin.

Data visualization and interpretation
A key aspect of CROP is the ability to visualize the environmental conditions throughout the farm at any moment in time. The Unity 3D model includes a 3D representation of the farm layout with the locations of the sensors and a user is able to zoom in to different sections of the farm, identify sensor locations, observe environmental conditions in (close to) real time, and download historic data ( Figure 3).
The environmental conditions at any given time may or may not indicate that action must be taken. Interpretation of the data is required, and it saves the operator time if the data can be presented in different ways that facilitate straightforward assessment of the impact. Specifically, visualization of current and recent historical conditions facilitates more informative interpretation of the data. However, while plotting interpretative or aggregated data may seem like an obvious task, typical data platforms are not designed for data aggregation and interpretation.
What is important from the point of view of growing crops is not so much the instantaneous environmental conditions, but more the cumulative conditions over the growing period. For the micro-salads grown at GU, which are typically harvested after 7-10 days in the farm, the consistency of the environmental conditions over the previous week is a significant factor; for example if the temperature has regularly dropped below the optimal range then crops will be less well advanced, and operational conditions may need to be modified to ensure harvesting is feasible on the required date. In order to provide this information, the data are processed and presented visually as illustrated in Figure 4. For each of the monitored locations, the current temperature, relative humidity, and vapour pressure deficit (VPD -calculated from the temperature and relative humidity) levels are presented in large bold font, with smaller values indicating the maximum and minimum value observed at that location in the previous 6 h. In addition, graphics are provided that illustrate both over the previous 24 h  and the previous 7 days how many hours the environmental conditions have been within certain ranges. The ranges are colour-coded according to the desired range, but the absolute values differ according to the location; the Figure illustrates how the ranges are revealed when the user's mouse hovers over the specific section of the graphic -in the example given over the previous week at the back-farm location the VPD has been above 600 Pa for 60 h out of the week's 168 h.
The simplest way of using this information is to notify the user when conditions are sub-optimal. The visualization illustrated in Figure 4 colours red when the temperature has been too high for any period in the previous 24 h or 7 days and gives an immediate warning that conditions are not right. In addition, this approach gives visual information concerning sensor health, as the graphic is coloured grey where data are missing indicating the sensor is not transmitting data as required.
Also given are plots that show how consistent the spatial distribution of conditions has been over the previous week ( Figure 5). As might be expected and observed in Figure 5(a) the temperature is consistently hotter at the top than the bottom of the benches. Perhaps more surprisingly there is a consistent difference between the relative humidity between the front and the back of the farm, with the back of the farm being considerably more humid ( Figure 5(b)). This is significant and requires monitoring, as too high a humidity can be detrimental to the crops.

Data-centric forecasts
The continuous streaming of monitored data lends itself nicely to statistical analysis and forecasting as datacentric methods facilitate identification of changes to farm operations (e.g. new lighting schedule, changing ventilation pattern), by learning from the data. Two different types of forecasting model are described in detail by Jans-Singh et al. (2020). The models forecast temperature, which exhibits distinct diurnal variation affected by the heat input from the lights and seasonal fluctuations. The first model is a static model using a seasonal autoregressive integrated moving average approach (SARIMA), trained on temperature data alone. For comparison, a dynamic linear model trained using energy data as a proxy for light operation, in addition to the temperature data, was also developed. The comparison illustrated the benefits of using the energy data as an additional parameter to train the models: as the temperatures in the farm are strongly correlated to the heat input from the grow lights, additional information related to the light operation improves the forecast results when deviations from standard operation occur. However, this dynamic model has a much longer run time, hence for inclusion within the digital twin an extended version of the SARIMA model was developed, which includes the energy demand as an exogenous input (SARIMAX), thereby combining the best attributes of both models (Hyndman and Athanasopoulos 2021).
The SARIMAX model is trained using the previous 240 days of data and the forecasts are updated once a day with the updated forecast uploaded to the database and plotted on the dashboard. The analysis has been set up to run the forecast at 4 am and 4 pm every day, using data prior to 12 h in the past and forecasting for 48 h into the future; so running at 4am, the forecast will actually be from 4 pm the previous day for 48 h. The schedule has been set up in this way as the information that the forecast provides is typically only useful to the farm operators twice a day -at the start of the working day so that preventative measures can be taken to minimize sub-optimal environmental conditions, and at the end of the working day so that changes can be made if the forecast require intervention overnight.
This approach works well, as the plots on the dashboard show the evolving conditions for comparison against the forecast, and an operator can see both the forecast conditions and also assess how closely the actual conditions match the forecast. Not only does this forewarn an operator of rising temperatures but also helps to identify potential faults where temperatures are not following an expected pattern, for example if the grow lights do not turn off automatically when scheduled. Figure 6 illustrates the output of the SARIMAX model compared against monitored data. The grey continuous line is the monitored data, plotted for two days before the forecast time and continuing in real time for comparison. The forecast, with mean shown as the red dot-dash line, and the upper and lower bounds indicated with the pale pink lines and shaded in grey, runs from 4 pm for two days into the future. Under these conditions the forecast is giving a good prediction of future temperatures.
As this is a data-centric model, the quality of the data is a critical factor affecting the prediction result. An initial task is to align the data by timestamp and aggregate the data if necessary to an appropriate format. Setting up the appropriate procedures for the different data streams can be time consuming, particularly when -as in this casethe data arrive at different times and in different temporal formats. In CROP the cleaned data are aligned by UTC time and are aggregated to hourly values.
Once cleaned, specific ongoing issues for this model are the accuracy and completeness of the data. In assessing the forecasting methodologies, Jans-Singh et al. (2020) imputed missing temperature measurements in the historical data using a random forest algorithm with neighbouring sensor data, outside weather conditions and the hour of the day. The relative humidity missing data were imputed using a normal multivariate regression using neighbouring relative humidity sensor values and external conditions. For working with the live  streamed data, however, a more pragmatic data-driven method is required. For temperature forecasting, a seasonal ARIMA model does not require missing data to be imputed and so the simplest approach is not to impute missing temperature data. This is arguably the correct approach and the ARIMA package in R will generate a forecast even with missing data not imputed (Hyndman and Athanasopoulos 2021). For example, Figure 7(a) shows the temperature forecast for 48 h following a training period of 240 days (using a seasonal ARIMA model without the exogenous input). In the training data, approximately 15% of the data are missing, with just over 48 h missing prior to the forecast date of 22nd October. The monitored data are shown in blue and the mean forecast as a dashed red line with the 90% confidence limits shaded in red. The forecast is clearly influenced by the data for the preceding 48 h and follows the downward trend. But given our prior knowledge of the farm conditions, we know that that the patterns of temperature are quite regular and that temperatures below 20 • C are unlikely. For comparison, Figure 7(b) shows the temperature imputed using a simple calculation procedure: the monitored temperature data are grouped according to the hour of the day, the day of the week and a 'pseudomonth', equivalent to 30 days of data. Any missing data points are filled in according to these groupings (shown in green on the figure). This approach works provided the missingness is random (i.e. not every Monday at 3 pm) and that no more than 30 days of data are missing. The new forecast temperature illustrated on Figure 7(b) is more realistic based on our understanding of the farm operation, and indeed is closer to the actual monitored temperature shown in blue.

Scenario evaluation
The data-centric model can only forecast if operational conditions remain the same or can be inferred from the monitored data. In order to assess the impact of changes to operation, a simulation model is required. For CROP, we are using a greenhouse energy simulation (GES) model (Ward, Jans Singh, and Choudhary 2018) specifically modified to represent the tunnel conditions. The GES 1D model simulates the heat and mass transfer through a vertical slice of the central section of the farm. Optimal conditions for crop growth in the tunnel require maintenance of stable temperatures and relative humidity. The primary heat source is the LED grow lights, and the two routes for heat exchange from the tunnel are conduction to the surrounding ground and air exchange through the ventilation system, both of which are dependent on the temperature differential. Air temperature and relative humidity are calculated as a function of time, by solving heat and mass exchange equations developed for the tunnel geometry, subject to the deep soil temperature and the temperature, moisture content and CO 2 concentration of the incoming ventilation air. A full description of the model is given in Ward, Jans Singh, and Choudhary (2018).
The beauty of a physics-based simulation tool is that it can be used both to illuminate non-measurable processes and to simulate a change in environmental or operational conditions. The GES model can be used to assess 'what if' scenarios, such as 'what happens if there is a heatwave', 'what happens if an extra dehumidifier is added', or 'what happens if the lighting schedule is changed'? Typically, a physics-based model of this type would be developed for the site in question, calibrated using experimental data and the output validated against monitored conditions. Within a digital twin, to ensure that at any time the model is as good a representation of the physical system as possible, the model must be calibrated continuously using the live data. If all system parameters are measurable this is not too intractable; however, for the GES model there are a number of parameters that cannot be measured directly, but instead must be inferred from monitored data to ensure comparability of the model output with the environmental conditions. A tool for continuous calibration of this model using a particle filter approach has been developed for implementation in CROP (Ward et al. 2021).
Heat and mass transfer within the GU tunnels is particularly dependent on the ventilation rate and the velocity at which air travels across wet surfaces, such as the plants and growing medium. The ventilation system within the tunnels is antiquated and, while controlled manually using a dial, there is little information available to quantify the actual total ventilation rate in the tunnels in air changes per hour. Similarly the air velocity in the tunnels is directly impacted not only by fan location and operation, but also by non-measurable factors such as presence of farm operatives, movement of trays and plant growth. The particle filter methodology for continuous calibration of the GES model takes as its input the streamed external weather conditions and heat input from the lights, together with the monitored relative humidity data at the centre of the farm. A particle filter is a sequential Bayesian inference technique, in which a large sample of possible parameter estimates are filtered according to their likelihood given some data. The particles are possible values of the uncertain model parameters, here ventilation rate, internal air velocity, and a length scale parameter. As a first step, a large number of particles -or sets of possible parameter values -are sampled from the possible ranges of values for each parameter. The GES model is then analysed using the external weather as input and the simulated relative humidity is calculated for each particle. Based on the monitored RH data, the likelihood of each particle is assessed, and the particles are re-sampled taking the new likelihood into account, generating a new sample where each particle has an equal probability of being the 'true' set of values. The process is repeated at the next timestep and the likelihoods of the new particle set are again updated using the next value of the monitored data and values predicted by the model using the parameter values from the most recent particles -the particles are then re-sampled with the new likelihoods. This repeating process results in a set of parameter values continuously updating in line with the monitored data, to ensure the best agreement of the model with the data.
To illustrate the process, Figure 8 shows the output of the GES model both before and after the calibration starts. Monitored data are shown as discrete black dots, whereas the simulation output is shown as a solid red line. The first half of the analysis is uncalibrated, with calibration starting after 10 days, after which calibration occurs every 3 h. In the top figure, although in the uncalibrated section of the analysis the predicted relative humidity (shown as the solid red line) is in reasonably good agreement with the mean monitored relative humidity, there are peaks in RH that are not predicted by the model. The relative humidity values used for the calibration are identified on the plots as blue crosses. After calibration starts, the calibrated model shows a closer agreement with the monitored data, both for the relative humidity and the temperature (bottom plot).
Within CROP, the GES model is included in order to facilitate scenario evaluation with respect to the following parameters: (1) ventilation rate, (2) number of dehumidifiers, and (3) time at which the lights are switched on/off. Figure 9 shows the potential impact of changes made to the operational settings as calculated using the GES model. Prior to February 15th at 12 am, the GES model is calibrated as detailed above. With the calibration points incorporated every 3 h the simulated relative humidity shown in red matches the monitored relative humidity shown in grey closely and the agreement between the simulated and monitored temperature is good. The solid red line shows the output from the calibrated GES model, with the grey shaded area indicating the 90% confidence limit of the simulation output, i.e. the region between the 5% and 95% quantiles of the outputs calculated using the sampled particles. From the 15th February at 12 am onwards there are two scenarios; the first is a 'Business As Usual' scenario (BAU), continuing the red solid line and grey shaded confidence limits. For this it is assumed that the calibrated parameter values are constant at the values derived at the forecast timepoint i.e. February 15th at 12 am. The second scenario is a selected operational change. The blue dashed line shows the scenario outputhere a scenario comprising a reduction in ventilation rate to 1ACH, no change to the number of dehumidifiers, and a shift in the timing of the lights so that they turn on 3 h earlier. When compared against the BAU simulation, the scenario exhibits a higher temperature and relative humidity, as might be expected for a reduction in ventilation rate. The impact of a change to the timing of the lights is also clearly visible, as the temperature is predicted to start to  rise at an earlier time of day corresponding to the earlier switch-on time of the grow lights.
Monitored relative humidity data are used in the calibration of the GES model, and as for the temperature forecasting tool described in Section 3.3.2 the model output will be dependent on the quality of the data. Considering missing data, there are two sources of data for which different approaches have been used. The first is the external weather data interacting with the farm via the ventilation system. If data are missing from this dataset, linear interpolation is currently used to fill in the gaps. This approach is suitable for random, isolated missing values, but should longer periods of missing data be observed, it may be necessary to impute more accurately. The second source is the monitored relative humidity values used as calibration data points. Here, if a value is missing, it is assumed to be the same as the previous value. This effectively means that the model remains in the same state until a new, different, relative humidity value is obtained.
This model is also impacted directly by information that is not currently automatically communicated to CROP, for example the quantity of operational lights and lighting operation times, the number of dehumidifiers continuously in operation, crop density in the farm and the ventilation system settings. Some system parameters are able to be inferred from the monitored data, as exemplified by the use of electricity consumption data as a proxy for lighting operation times in the ARIMA model. But without additional information, it is difficult to infer the number of lights or dehumidifiers operational, or crop densities, directly from the electricity consumption. Currently these parameters are updated manually, one area for improvement as the digital twin is developed further.

The informativeness of data
The previous sections set out the challenges of working with live streamed data, and the way in which the CROP digital twin is set up to meet those challenges. In this section we explore the informativeness of the available data and the ways in which exploitation of that informativeness defines the purpose of the digital twin.

Converting data into useful information
Data on their own may not be immediately useful. Consider the value of a single temperature data point in the GU farm: in isolation the single data point is meaningless, but it becomes useful when compared against the history of values at that location, against other data points in different locations at the same time, or against known acceptable limits of temperature at that location. Any one of these comparisons converts the data to information that can be acted upon. And while some comparisons may be made automatically by an experienced operator observing that data point -for example they may know whether the value is acceptable without having to look up the acceptable range -other comparisons require computational processing of the data to extract the information. Automatic processing and visualization provides the information without requiring further processing on the part of the operator. This facilitates efficient decision making. Particularly useful is the facility to immediately view whether the conditions in the farm have been optimal for plant growth over the previous week, which helps the farm operatives to estimate whether harvest targets will be met on time.
Fundamentally the usefulness of the data is affected by the data quality, particularly timeliness, accuracy and completeness. The absolute and relative significance of these three factors may be different for different stakeholders.
The timeliness of the data depends on whether the update frequency is adequate for the function and purpose of the visualization and embedded models, and it is most efficient to use the maximum time period that still facilitates adequate updating across all uses. For example, updating daily may be adequate for oversight of a process but hourly data may be required for operational purposes; for an immediate view of current conditions, visualization may need to be in as close to real time as possible. For different stakeholders, the frequency at which data are required may also be different, with system operators requiring data at far more regular intervals than the business team or software engineers who may only require data when the system is performing sub-optimally or failing.
To ascertain the adequacy of the update frequency for a model, the purpose of the model is key -in simple terms, the ambition of a platform that is able to respond and provide visualizations in real time imposes specific model requirements with respect to data access and run time, but if the results of the model are not needed in real time then the requirements for timeliness may be less onerous. In CROP, the data-centric ARIMA forecasting model runs quickly (¡ 5 min elapsed time) and it would be feasible to run every 10 min if necessary, corresponding to the arrival of the environmental monitoring data in the database. However the electricity consumption data are only monitored every half-hour, immediately impacting on the frequency at which the forecast can be made. But considering the purpose and function of the model, as described in Section 3.3.2, the information that the forecast provides is typically only useful to the farm operators twice a day. As a consequence the SARIMAX model has been scheduled to run twice a day and the update frequency of the data is more than adequate. Calibration of the GES model is designed to ensure that the scenario evaluation tool uses a model that simulates a response to a system change as close to the physical system response as possible. To this end, it is sufficient to run the calibration once a day, so that when scenario evaluations are required a recently calibrated model is available. The calibration is run every morning at 5 am and the calibrated parameters are saved for use in a scenario evaluation when requested.
In CROP, identification of incorrect values is not currently attempted. These can arise from errors in the sensors, for example inaccuracies of this type may typically be observed in monitored electricity consumption data, where the meter may fail to register a value in one time period and register a double value in the next time period. Drift in sensor values can also occur and calibration of the sensors is an essential part of the system maintenance.
Missing data is a key issue. The missingness in itself is important as it may be an indicator of sensor and system health, of use to farm operators and software engineers. To notify the farm operatives of this, the enhanced visualization on the CROP dashboard illustrates when there are gaps in the streamed temperature and relative humidity data. The previous sections have described the methodologies employed in CROP to minimize the impact of missing data on the models. This impact is dependent on the data stream and the use of the data. For example, missing temperature data from the sensors has a direct impact on the results of the temperature forecasting model, but gaps in the weather data are less significant as the external weather only affects the internal conditions indirectly.
CROP currently uses four different approaches in different missing data situations; • the enhanced visualization identifies missing data and displays the number of missing values in grey, • in the forecasting model, an approach has been taken in which the monitored data are grouped according to the hour of the day and the day of the week over the previous 30 days, and missing data are filled in according to these groupings, • missing weather data in the scenario evaluation (GES) model have been infilled using linear interpolation, and • if monitored relative humidity values used for calibration of the GES model are missing, the value is assumed to be the same as the previous value.
The infrastructure of the digital twin, illustrated in Figure 2, provides the mechanisms by which the live streamed data are communicated and incorporated into the digital twin platform and models. However, it is also essential to have procedures in place for identification, and notification, of system or operational changes. If these data are incomplete, there can be a significant impact on operation of physics-based models such as the GES model used in CROP, as incorrect system information can lead to significant discrepancies between simulation output and the real physical performance.

The importance of context
More nuanced than the availability or otherwise of monitored and system data, is the informativeness embedded in the data that are available. The useful information that can be extracted from the data depends directly on the system context. In the GU farm, information that can be directly extracted from the streamed data relates to the environmental conditions, and includes both exceedance of limits and trends over time. Correlating these data with crop growth rates and yields provides valuable insights that can improve farm efficiency and reduce waste (Jans-Singh et al. 2019). A different system will require different information, for example a digital twin of a fluid pump system might require information related to flow rates and fluid levels. It is necessary to ensure that the data collected facilitate generation of the required contextual information i.e. the parameters monitored must either provide information directly relevant to the system function, or be used in predictive models that provide information related to the future functioning of the system in some way. Equally, the same data can offer different insights to different stakeholders. Energy consumption data is a case in point here, essential for quantifying the farm running costs but similarly useful as a proxy for the operational state of the lighting system.
The GES model calibration results in calibrated values for the ventilation rate and internal air speed, that when substituted back into the model ensure the best match of the model to the monitored relative humidity data. Relative humidity was chosen as a calibration parameter as it is dependent on both temperature and air moisture content, and is primarily affected by bulk air and local air movement. The relative simplicity of the GES model when compared against the complexity of the physical system means that an exact match is not expected, but the calibration process ensures that the model output is as close to reality as possible. However, the temperatures output by the model are not solely dependent on the ventilation rate and internal air speed; in fact far more significant is the operational schedule of the lights, as the lighting is the principal heat source in the farm. Using relative humidity alone as a calibration parameter calibrates just the portion of the temperature behaviour that is dependent on the global and local air exchange; ensuring that the temperatures output by the model are close to the monitored values also requires that certain operational parameters are also specified accurately, namely the number of lights in use.
For model calibration it is important that the data used for the calibration contain sufficient information across the parameter space. It has previously been demonstrated that the level of useful information for identification of calibration hyperparameters and model bias can be very dependent on the subset of data selected (Menberg, Heo, and Choudhary 2019). In the case considered here, i.e. calibration of the GES model, the choice of time at which the calibration is performed affects parameter inference and hence the output of the calibrated model. For example consider Figure 10. Figure 10(a) shows for two scenarios the parameters inferred from a calibration performed every 12 h i.e. the calculated values of ventilation rate, N, and internal air speed, IAS. The two scenarios are different starting times, the first starting at 5 am -so calibrating at 5 am and 5 pm daily -and the second at 8am, so calibrating at 8 am and 8 pm daily. In the figure, the mean inferred parameter value is indicated by a line with the shaded area representing the 90% confidence interval. In the first scenario (red dotted line) the monitored relative humidity values used for the calibration are typically quite different at 5 pm from the values at 5am. Using calibration values that fluctuate significantly results in inferred parameters that also fluctuate similarly. By comparison, the monitored relative humidity values at 8 pm (blue dashed line) are typically similar to the values at 8 am i.e. there is much less fluctuation from one calibration point to the next. Figure 10(a) illustrates this difference; the fluctuation in the inferred parameter value is particularly noticeable for the internal air speed for the 5 am/5 pm calibration, whereas calibration at 8 am/8 pm results in a much smoother variation in the parameter values. The particle filter methodology used here is more able to follow smooth trends, and this is reflected in the inferred parameters. Figure 10(b) illustrates the impact of using each of these sets of inferred parameters in the GES model to calculate relative humidity and temperature. The simulation based on calibration at 5 am and 5 pm shows poor agreement with the monitored relative humidity (red dotted line), whereas the simulation based on calibration at 8 am and 8 pm shows a much better agreement (blue dashed line). This is a result of having less noisy parameter values incorporated into the GES differential equation solver, as a smoother variation in parameter value is computationally less demanding to solve.
To counteract the impact of the choice of calibration time affecting the output, it is possible to calibrate more frequently than every 12 h. In this way the extreme swings between consecutive calibration relative humidity values are avoided. Figure 11 illustrates the model parameters inferred and the model output when the calibration is run every 3 h compared against the calibrations run every 12 h. As can be seen, the calibrated internal air speed still exhibits significant fluctuations, but now the relative humidity simulated by the GES model is more closely matched to the monitored data. The calibrated ventilation rate is higher in this example than observed when calibrating every 12 h, and this leads to slightly reduced temperatures as more heat is exchanged with the farm exterior.
A measure of the quality of the calibration can be made by calculating the errors between the calibrated model output and the monitored data. Table 1 shows the mean bias error (MBE) and the root mean squared error (RMSE) for the temperature and relative humidity simulations over a number of different calibration options for a specific simulation period. The lowest temperature RMSE is observed for the 12-hourly calibration at 8 am and 8 pm and this also has a small value of MBE. Considering relative humidity, the smallest MBE is observed for the 12-hourly calibration at 8 am and 8 pm. However, the lowest RMSE is observed for the 3-hourly calibration and currently this is the strategy deployed within CROP.
Again, the context of the simulations must be considered when making decisions about the most suitable calibration strategy. For the purposes of CROP, the GES model is used to simulate scenario changes and so while it is important that the model is as representative of reality as possible -which is ensured by the calibration -it is the difference in simulation outputs for the different scenarios of interest that are of primary importance.

Discussion
The experience of developing the CROP digital twin has provided genuine insight into the practical challenges of gathering, processing, assimilating, and displaying diverse data sets. It is worthwhile reflecting on the lessons learned, and highlighting points that might benefit similar projects.
A key observation from this project is that development of a digital twin requires a diverse skill set in the development team. Expertise in data visualization, digital communications and application development are all needed alongside the context-specific expertise. In a research environment, added to these are the complementary skills driving the development of the digital twin to maximize the benefits of the twin for all stakeholders. CROP has been developed both as an operational tool and also as a research project with the University of Cambridge in partnership with the Alan Turing Institute and Zero Carbon Farms Ltd. The team has comprised engineering researchers and software engineers in addition to farm management and operational staff. As a result the data that are communicated through the twin are of interest to a wide range of different stakeholders for different purposes; efficient farm operation is key, but operation of all aspects of the digital twin are also of interest to the development team, and environmental and crop parameters of interest to the research team.
It is also clear that there are a number of practical and analytical challenges when working with live-streamed data. The practical issues around accessing and communicating the data have been solved here using a suite of open source software as described in Section 3.2 and accessible from a GitHub repository (GitHub 2020; The Alan Turing Institute 2022). With real time data, the timeliness of database updates is an issue which needs to be considered in context. Here, the frequency of database updates has been tuned to the lowest frequency of the commercial platform updates. This is adequate for the case described here as explained in Section 4.
It is important to explore and understand the informativeness embedded in the data being communicated to the digital twin. As described in Section 4, data informativeness impacts on the twin in several different ways; • by transforming the information into visualizations that can be immediately interpreted by system operators, timely interventions can be enacted, • data-centric models require informative data if they are able to generate realistic forecasts. When data are missing it may be still possible to generate a forecast from the incomplete dataset. However, if there is sufficient information in previous data to facilitate imputing the gaps, then imputation may be a better option, generating more realistic forecasts as observed in Figure 7, • datasets selected for calibration of mathematical models must contain sufficient information to enable robust calibration of the selected parameters.
Ensuring that the data being collected are sufficiently informative for diverse applications is a key requirement of digital twin design and development.
Working with live streamed data, as opposed to curated historical data, is one of the key challenges of the digital twin. This is particularly significant when implementing data-centric models. The models implemented here were initially developed using curated historical data; an important step in developing the models was cleaning the historic data and imputing values to fill in any gaps. Considerable effort was expended in ensuring that the imputation was robust. With live streamed data, methodologies have to be developed to manage the quality of the data in close to real time in order to maintain model utility. In the case described here, the principal issue affecting data quality is missing data owing to breakdown in data communication.
As detailed in Section 4, four approaches have been used within CROP to deal with different types of missing data. These approaches are pragmatic in that they have been shown to work under the ongoing conditions in the farm; but they also have a level of fragility with extended gaps in data impacting the quality of model outputs and hence model utility, an issue which will be addressed in future studies. Equally, identification of incorrect values is not currently attempted. Further development is required to implement strategies to identify values that are out of range, or exhibiting unexpected drift over time.
Out-of-date information also poses issues, especially for parameters that need to be updated manually in the Digital Twin. Automation of this process is a priority for future development.

Conclusions
In this paper, the practical challenges of using live streamed data in a digital twin have been explored. Since usable integration of data and models lie at the heart of a digital twin, this topic touches on all aspects of the digital twin, from the architecture and composite software infrastructure selected in the design, to the development and deployment of predictive modelling tools. The CROP digital twin developed for the Growing Underground farm in London, UK, has been described in some detail in order to give context to the challenges described when working with live streamed data. The nature of this physical system has benefited from a digital twin solution, drawing together data from diverse sources into one central database that is then accessed, and the data visualized, on the CROP web-based platform.
A fundamental requirement of a digital twin is to communicate the outputs in manners that provide actionable information to the end users. To this end, CROP incorporates a number of features, both for enhanced visualization of current and historic data, and forecasting of future conditions. For example, a summary of the environmental conditions over previous time periods for each zone of the farm helps the farm operators to estimate yields of the crops growing within the zones. The forecasting models are used to avoid adverse conditions such as overheating or poor ventilation.
The CROP digital twin has reached a stage of development at which the farm operators rely upon it for efficient operation of the farm environment. Enhanced operational functionality planned for future developments includes expanded spatial monitoring and incorporation of crop yield data. This will facilitate exploration of the relationships between farm location, environmental conditions, and crop yield for the different types of crop grown in the farm.
From a research viewpoint there are several avenues to be explored. The open source development of the CROP platform infrastructure (available on GitHub) can be extended to wider applications across built environments (The Alan Turing Institute 2022). Future research will investigate the transferability of the infrastructure to other situations in which environmental control of the built environment is critical. As this paper discusses, one of the key challenges when working with data in realtime is ensuring data quality. Improving ways to deal with missing data has already been mentioned in Section 5 and will be a key priority for future research. Also, methods for continuous validation of the statistical forecasting models have yet to be developed here. Future research will explore both the historic accuracy of the forecasts, and the efficacy of different forecasting methods in this context.
Examples in the literature of digital twins operational in a commercial setting are scarce. The contribution of this paper is an open discussion of the practical challenges faced -and solutions selected -when developing a digital twin in the built environment. Future technologies would benefit highly from a wider discussion around these challenges, and development of a consistent infrastructure framework that can support the development and operation of digital twins.