Energy and AI

,


Introduction
The energy landscape for the Low Voltage (LV) network is undergoing rapid changes. Energy no longer flows in one direction from a substation to consumers, but consumers are now able to export their energy produced from self-generation back to the network. Furthermore, with the social imperative of electrifying transport and heat/gas networks, demand for electricity will increase, elevating the risks to the LV networks. More so, when the predicted increase in demand maybe higher than the network capacity [1,2]. This has motivated the complete, error-free smart meter data from every node in the network is unrealistic, due to both operational (technical roll-out issues) and privacy concerns. Also, many networks have very large and distributed networks of assets that are varied in their type and condition (i.e. varied cable ratings and state of health (SOH) of distribution network assets). Due to the age and complexity of the legacy LV network, their are also information gaps with respect to the asset base i.e. knowledge about the exact cable types installed in each specific location may be unavailable, further complicating network operational and planning decisions.
Yet, for distribution network operators (DNOs), smart meter data can present important opportunities for managing their networks, in particular in estimating the voltage distributions across all points in their networks. Voltage fluctuations are a key concern, as DNOs have a key legal duty to assure that the voltage excursions, at all nodes in their networks, remain within a tight legal limit or operating condition set by the regulator. This is because voltage drops/surges may lead to e.g. malfunctioning of some connected electrical appliances. In recent years, this responsibility has been made much harder by the roll-out of new loads (e.g. distributed EV charging), or embedded generation, such as increasing penetration of rooftop solar panels or micro-renewable generation. Hence, it is important for distribution network operators to identify points/areas in the distribution network that are ''at risk'' from voltage fluctuations, by having accurate tools for estimating such fluctuations, and do so based on often only partial smart meter data available. Several network operators, for example Scottish Power (SP) Energy Networks have outlined ambitious digitalisation strategies [3], to allow them to leverage large-scale smart meter data to address these challenges, and allow energy networks to enable higher penetration of low-carbon generation and demand technologies.
Against this background, recent advances in areas of machine learning, and in particular, deep learning techniques provide key opportunities to extract information from very large-scale data streams, and their potential by power system operators is only now beginning to be explored. Another key tool required for LV networks Active Network Management (ANM) is a Power or Distribution System State Estimation (PSSE or DSSE) tool, which estimates and simulates the most likely state of the networks [4,5]. For LV networks ANM, the PSSE can be used to approximate how best to manage the energy import from Distributive Energy Resources (DER) [6][7][8][9][10][11], the scheduling of Electric Vehicles (EVs) charging [12][13][14], and/or for network reconfiguration, to ensure the solution proposed by ANM meets the constraints limits of the network.
Recently, a number of works have begun to explore the potential of smart meter data for a variety of applications related to LV and MV network management. In this vein, Huang et al. [15] use smart meter data to address the problem of interval state estimation in low-voltage (LV) distribution systems, while Pappu et al. [16] use such data for topology identification of LV distribution grids. Gahrooei et al. [17] propose a new pseudo load profile determination approach in LV distribution networks based on frequency-based clustering of customers, based on load data from their smart meters. Cataliotti et al. [18] deal with the problem of placement of measurement devices for load flow analysis in MV smart grids, while Jiang [19] considers data-driven fault location of electric power distribution systems with distributed generation. Liao et al. [20] propose a novel group lasso method to estimate the topology if urban MV and LV distribution grids, while Procopiu et al. [21] develop a method for decentralised control of residential storage in PV-Rich MV-LV Networks, that makes use of smart meter data from a real MV feeder in Australia. Finally, Fang et al. [22] develop a statistical approach to guide phase swapping in LV networks where smart meter data from customers in scarce, a situation the authors argue, is typically for LV networks. While there are very useful elements in all these papers, to our knowledge, none of this prior work addresses directly the challenge we consider in this paper, that of predicting voltage distributions across LV networks, using smart meter data under data availability and customer privacy constraints.

Challenges and motivation
Since the start of the SM roll out and AMI installations, privacy has been a key concern for both consumers are regulators [23]. There are justifiable fears that high granularity electricity demand (in particular, power load) data can be used to profile individual customers behavior in their homes, allowing intrusive information to be inferred about their daily routines and lifestyle. In the UK, The Office of Gas and Electricity Markets (OFGEM), the body charged with developing regulations for the UK's energy sector, has indicated that energy demand data with a granularity of less than 1 month interval must be considered personal, hence protected by more stringent privacy provisions [24]. As a result, justifications are required by OFGEM when UK distribution network operators (DNOs) request access to high-granularity energy demand data. In practice, this means that when the DNOs request access to such high granularity energy data, they will incur high data management cost, to ensure data security of their customers is maintained during the data transfer, when in use and when in storage, and to ensure that no unauthorised third party access is possible. To overcome this concern, there are a number of methods proposed in literature that aim to anonymise and mask customer energy usage; from the use of energy storage systems [25,26], and via data aggregation from multiple properties [27]. These methods, however, can impact the ability to best estimate the state of the network, specifically how voltage is distributed across the network. Therefore, data privacy concerns create a challenge in the context of performance versus privacy i.e.: how can the DNO predict the voltage distribution and its associated risk without availability of high-granularity power data?
Aside from privacy concerns discussed above, the smart meter data challenges can be split into two: (i) current data challenges and (ii) future data challenges. One of the current data challenges is a result of the voluntary nature of the smart meter installation. Customers not legally required to install a smart meter when offered by their utility company, and a considerable number of customers choose not to do so. This can result in blind spots in the network, which requires the need for pseudo-measurements for the PSSE and DSSE analysis [28,29]. Furthermore, as indicated above, power demand data may not be available at high granularity. Another information that is critical for PSSE and DSSE analysis and if often unavailable is the phase identification. This can impact on the output of the analysis. Phase identification should be performed a-priori, and methods proposed to achieve this require full coverage of smart meter data on the LV network for effective identification.
In the future, DNOs are likely to encounter additional big data challenges, if each household is to provide its smart meter data. Smart meters in the UK by default capture and transmit half hourly power and voltage data. The granularity of voltage data can increase up to one per second if required. The question for Distribution Network Operators (DNOs) is: do they need all the available data for their LV network management? With increases in data volumes via more requests, the higher the data management costs. So the optimisation challenge DNOs face is: could the risk of local out of bounds voltage excursions be calculated if using data only from some key monitoring locations on the network?
To address the challenges identified, this paper proposes a Deep Learning Neural Network (DLNN) architecture to predict how voltage is distributed on an LV circuit for one time step ahead using minimal or key located smart meter data. We define an LV circuit as a group of customers that share the same source (closed fuse at the secondary substation). An LV network are a group of LV circuits within a specific area. From the LV network operational perspective, the smart meter data can indicate how voltage is distributed across the LV circuit. This is beneficial to predict its likelihood of risks.
Without knowing the network topology, it will be difficult to profile customer energy behaviors from high granularity voltage data, unlike power data, which reveals directly the energy consumption of each domestic consumer at each point in time. Because of this, in the UK, OFGEM do not impose similar restriction for the transfer of high granularity voltage data to the DNOs. This suggests that novel machine learning and PSSE techniques need to be developed that can make efficient use of this voltage data, without requiring additional data-points from high-granularity power data. The paper, therefore evaluates the impact of prediction with and without the use of high granularity power demand data, deemed personal. The paper also aims to discuss the effectiveness of the DLNN in predicting the voltage distribution even at locations with no smart meters. This is to address the limitations of current voluntary nature of smart meter installations, which resulted in blind spots across the network.
If all customers are to install a smart meter, the large volume of data will result in increased complexity and cost for the associated data analysis and management. To reduce this cost, the paper proposes the method of identifying key locations for which smart meters are required to ensure effective prediction. The key locations within an LV circuit are the first customer on the LV circuit and the customers located at the start and at the end of each branch. A compressed tree representation of the LV network defined as the asset path tree is proposed to identify the key locations.
The remainder of this paper is organised as follows. Section 3 outlines the problem setting and discusses the use of existing DSSE techniques presented in literature, motivating our proposed method. Section 4 proposed how the DLNN predicts the voltage distribution across the LV circuit and the asset path tree that identifies the key locations on the circuit, significant for the prediction of the voltage distribution. Section 5 describes the results from our evaluation. Section 6 concludes the paper.

Problem setting & existing work
DSSE tools are often used to simulate and estimate the voltage distribution across the LV network for many energy scenarios. DSSE assumes that power demand data from all customers in the circuit are available at high granularity, half-hourly or less. For those engaged in the field studies, permissions have been granted by the customers involved that their high granularity personal energy demand data can be accessed [30,31]. However, not all customers are willing to grant such access. As indicated in Section 1, because of the privacy concerns, individual power demand data at high interval may not be provided from all customers.
To overcome these limitations, pseudo-measurements were suggested in, e.g. [28,29]. The key disadvantage of pseudo-measurements is the potential error propagation from the pseudo-measurement to the output of the DSSE, error which can increase the level of uncertainty of the results, rendering the analysis not very useful in practice [32]. Furthermore, the uncertainty with regards to which phase the customers are connected to will also affect the quality of the results.
Nearly all of domestic electricity users are connected to the LV circuit using a single-phase cable. These individual phases are taken from the three-phase mains cable. One of the key identifiable challenges for LV network management is the missing customer phase information. Identification of customer phase is an active area of research, with voltage clustering [33,34] and energy data correlation [35,36] are the most common methodologies for customer phase identification. The later technique is more suitable if the high granularity power demand data at every half hour or less and from all customers are available. The algorithms presented in this line of work are often not applicable in real settings, because of the high likelihood of incomplete smart meter coverage in the network.
A new approach is therefore required to predict the voltage distribution using only the available information provided, specifically, what is the predicted voltage at a specific point of the LV circuit given the available voltages provided at other points on the circuit. This paper proposes the use of Deep Learning Neural Network (DLNN) to do so.
There are several reasons for choosing deep learning neural networks (DLNN) for this problem. First, DLNNs have the ability to deal with very large, potentially unstructured datasets, such as smart meter data, which is large-scale and distributed. They have a proven track record in many other real-life domains where learning has been applied to large datasets, including many energy applications. Moreover, unlike other more supervised learning methods, they do not require extensive feature engineering, which would be expensive and time consuming in this application domain. For example, it is hard to say a-priori which input signals (e.g. combinations of voltages/load data from which locations) are needed to make good predictions, however learning using DLNNs can be used to guide this process. This ability to output good predictions from data without the need to invest a lot of engineering input and time in the set-up, which other ML approaches may require.
By providing the ability to predict the voltages across the LV circuit, or the voltage distribution, we are able to predict the risk of voltage constraints violation. We are also evaluating the accuracy of prediction for varying degree of observability. This is to address the results of the current voluntary nature of smart meter installation, and from key identified locations that aims to minimise the need to collect data from all smart meters because of the potential high cost of future big data management.

Machine learning methodology
In this paper, we propose a Deep Learning Neural Network (DLNN) to predict the voltage distribution in an LV circuit. Only the voltage magnitude is predicted, as this value is of interest to the DNO, specifically for use for predicting the risk of constraints violations and/or to control the voltage set point at the secondary substation level, either to step up or step down the transformer.
Due to the real-life limitations in the SM roll-out discussed in earlier sections, for the DLNN to be a practical useful tool, our predictive model must meet with the following features and aims: 1. Ability to predict the voltage distribution across a circuit one time step ahead ( + ) despite the partial SMs coverage in the LV circuit 2. Ability to predict the voltage for all customers, including for locations without any SMs 3. Ability to use, but not require high granularity power demand data from all customers on the circuit, e.g. the potential use of aggregated power demand data or no power data 4. No firm requirements of having customers' phase connection data for making predictions Many PSSE methodologies will fall short as they are unable to meet with the above features.

Simulating the voltage distribution
Principle simulations for different SMs scenarios for domestic LV circuits are constructed to validate our model meeting the above features. OpenDSS [37] is used for simulation, using actual LV circuit topologies randomly selected from the Central Belt of Scotland and the power demand data per household generated from University of Loughborough Centre for Renewable Energy Systems Technology (CREST) model [28]. The CREST model provides 1 min demand power data per customer and is used by OpenDSS to calculate the voltage distribution across the LV circuits. As majority of the smart meters in the UK provides 30 mins averaged voltage RMS, similar granularity of data is used for the DLNN, whereby the 1 min simulated data are averaged for every 30 mins before they are used as inputs to the DLNN.
For the OpenDSS to generate the voltage distribution, all residential properties are connected to the LV circuit 3-phase main cable via the service cables. We define the point of connection between the property to its service cable as the Customer Connection Point or CCP. A CCP can connect to a single household or multi-households property. Single household properties will typically be connected to a single phase service cable from one of the 3 phases 3-phase mains cable, and therefore will have a one CCP per property. For a single phase CCP connected to a single household property, the values provided by the simulation of SMs will be close to reality. However, the values from simulated SMs and real SM readings may differ significantly for multi-households properties. This is because no lateral or internal cable information is typically available from multi-households properties.
Multi-households properties, for example, flats and apartment blocks, are typically connected to 3-phase service cables and will therefore have a maximum of 3 CCPs, one for each phase. Assuming balanced loading, the number of households in the multi-households property are equally distributed across the 3 phases. For example, if there are 6 households in a property (an urban housing in Glasgow and Edinburgh), each single phase 230V will be connected to 2 households in the property. Because no lateral cables are available, we are simulating that a SM indicates for the aggregated power demand data (lump load) from all the households that are connected to the same phase in the multi-households property. This value is used by OpenDSS to calculate the voltage value for the respective phase. When using the actual SMs, the aim is to use, per phase, the voltage data from SM with the farthest distance from the CCP.

Predicting the voltage distribution
Eq. (1) indicates the input to output mapping (.) of the predictive model.
is the predicted voltage for time +30 mins for the queried CCP with the distance from source and the aggregated number of households between the source and on a given circuit path. We are predicting the voltage 30 mins ahead, every 30 mins, because, as indicated in the previous section, the majority of smart meters in the UK are configured to provide average voltage RMS every 30 mins.
are part of the inputs to the DLNN, with consists of | | measurement data from | | number CCPs with SMs, with ∈ and ∈ are the measurement data from SM (1)- (3).
is the total line impedance of the circuit, an input value to the DLNN that is used to categories the LV circuit topology; providing the indication of the circuit capacity and risk. High can be indicative of a long circuit (in distance) and/or a low circuit capacity. Cables with smaller cross-section areas have higher resistance and reactance values and lower ratings and capacities, in turn resulting in higher risks in comparison to those with lower and values. High , therefore, indicates a higher risk of voltage and thermal constraints violation.
Assuming similar customer demand (power), the voltage drop for those that are of same distance to source but of two different topologies will have different voltage drop between to them. This is because LV circuit with more branching will have its impedance value that is of smaller value compared to those that have no branch. aims to provide such differentiation and along with and provide the reference point to indicate how much the voltage drop shall be at any given point.

Total line impedance,
is calculated by first transforming the LV circuit into its equivalent schematic representation, with each cable segment in the circuit appearing as a resistor with the impedance magnitude = √ 2 + 2 . is calculated using the cables' resistance −1 and reactance −1 values provided by the cable manufacturer and the cable segment length . Because each cable has an impedance value , is then calculated using Thévenin's equivalent circuit theorems.
An LV circuit is typically a 3-phase circuit with the customers assumed to be equally distributed across the 3 phases. In theory, there should be 3 impedance values, one for each phase. However, customers' phase data is often unavailable information. Therefore, when calculating , all cables are assumed to be a single-phase cable and the customers are all connected on to the same one-phase, providing one value per circuit, instead of 3, one for each of the phases. While this is an approximation, the single-phase value is useful to indicate a worst-case bound on the LV circuit capacity, representing the worst case in-balance situation when all customers are connected to a single common phase.

Electricity measurements and their respective loading
In our analysis, a SM at a CCP , ∈ , with the distance from source provides the measurement data (3). consists of the average voltage rms magnitude and the aggregated average active power at times ( ), ( − 30 mins), ( − 1day), ( − 30 mins − 1day) and ( +30 mins−1day) for all the households that are connected to a specific phase at the CCP with SM . = 30 mins is chosen as this can provide sufficient time frame to enable for any mitigating actions to be in place.
also includes the distance from source and the aggregated number of households between the source and . These two values are to indicate the loading which resulted in the voltage drop at location .

Deep learning neural network (DLNN)
The predictive model̂( + ) = (.) (1) is a 6-layer DLNN developed using TensorFlow library [38]. The input layer of the DLNN consists of = 4 + (| | × 2 × 5) + (| | × 2) or = 4 + (| | × 1 × 5) + (| | × 2) number of neurons, depending if the (aggregated) power demand data is available to be included as part of the input. The input is divided into four categories, beginning with: The first hidden layer consists of ∕2 neurons, followed by ∕4 neurons in the 2nd to 4th hidden layer. The output layer is a single neuron layer for thê( + ) value. The activation function used for all neurons is the Scaled Exponential Linear Unit [39]. The DLNN is trained using the Adam optimiser [40] with early stopping.
DLNN has shown to be competitive for feature extraction and time-series analysis. For our analysis, DLNN will perform: • Feature extraction: to identify the correlation between the voltages provided SMs, their distances to source, and their approximated loading indicated by the power value and/or the aggregated number of households or loading at the location of the SMs (2)-(3). By identifying the correlation, the voltage for CCP without a SM can be approximated. • Time series analysis: to identify how the voltage distribution changes over time.

Training the predictive model (.)
One-month demand profile data was simulated. Only | | number of CCPs are used to train and validate the DLNN. This is to simulate the lack of SM coverage. The data from the first week for | | CCPs are used to train the DLNN. The data from the same | | CCPs from the following week are used for validation. The | | CCPs are also randomly selected, to simulate the lack of controllability to the SM installation in the UK, whereby as indicated in Section 1, SM installation is of voluntary nature. As indicated in Section 4.1, the SM for single household CCPs are similar to reality. However, for multi-households properties, this will vary, whereby any of the 3 single phase connection to the property are randomly selected to be that with a SM (2)-(3), ∈

Identification of the key locations
As indicated in Section 1, DNOs can be face with big data challenge when all customers on their network are to transmit their SM data to them. We are hypothesising that not all SMs are required for the prediction of the voltage distribution. Data from only the key locations or key CCPs on the LV circuit are sufficient to provide effective prediction. To manually identify key locations for all LV networks however will be a laborious task. Therefore, we proposed the use of asset path tree presented in [41] that represents the LV circuit to indicate the circuit's key locations.

Asset path tree
Any electricity network can be represented as a graph ( , ), such that a node ∈ is either a substation, a transformer, a link box, a branch point or a unit that either consumes or generates electricity or both. The edge , ∈ is a physical cable that connects the two nodes and . However, the level of details provided by such graph representation of an electrical network is unnecessary to identify the key locations in an LV circuit. The asset path tree presented in [41] is used to compress the graph connectivity of the network down to its key components. Fig. 1a shows an example LV circuit, with a 95 mm main cable branched to the right is connected to six customers; four of which are connected from the main cable via a 25 mm service cable each and the rest are from a 35 mm cable. The asset path tree differs from a standard graph , whereby for the graph representation (Fig. 1c), the 95 mm main cable branching to the right of the circuit is to be represented by seven nodes. Each node is a branch point and is indicated by the blue filled circle.
For the asset path tree (Fig. 1b), only one node is required to represent the 95 mm main cable. The two cable types that connect the customers to this 95 mm main cable are each represented by a node, indicated by the filled green circle. As a result, the asset path tree compresses the graph representation of the circuit down to its key elements, i.e. which cable types are connected to each other, and if the cables are further connected to other type of cables or branch or that they are connected to a property. Fig. 2 shows a portion of the asset path tree for the lower left-hand side of the LV circuit encircled in Fig. 3. In Fig. 2, a node is indicated by the arrowhead and the cable types that connect the nodes or the edges are indicated within the bubbles. The orange squares in Fig. 3 represent the properties connected to the LV circuit. The integer value next to the squares correspond to the number of households in the multi-household properties. The orange squares without integer are single household properties.
We proposed that the key locations on an LV circuit are: 1. The first customer on the circuit -indicated by the first branch on the asset path tree that leads to a customer. In Fig. 2, this is the first pink bubble from the left corresponding to the first 13-households property in the circuit. The property is connected to the 300 mm main cable via a 35 mm service cable. 1 2. At each branch point, the first and last customers for each (service) cable type that are connected to the mains.
These are as shown by the pink bubbles in Fig. 2. These key locations, we hypothesised, shall provide the reference voltages to approximate the voltage drop at other queried locations between them.

Experimental evaluation and results
Two sets of experiments were performed to evaluate the impact of | | number of customer connection points (CCPs) for which data is available to create the predictive model. We consider two main scenarios:

Varying the number of CCPs with SM
Twenty different combinations of 10, 15 and 20 CCPs are selected within a circuit. These CCPs may or may not be the identified key CCPs. Three circuits are chosen for the analysis and they are shown in Fig. 3 (for Circuit 1) and Fig. 4 (Circuit 2) The attributes of these two circuits are listed in Table 1. Majority of the properties connected to the circuits are multi-households properties, for which the properties are connected 1 The information related to the cables within a property was not provided. to the circuit via a 3-phase service cable. As indicated in Section 4.1, there will be 3 CCPs for the multi-households properties. For properties connected via the single phase service cable, these properties, each has only 1 CCP. Therefore, the chosen circuits have the number of key CCPs greater than the number of key locations identified.
Figs. 5-6 show the median predictive errors and the median confidence interval for the three set of combinations for = {10, 15, 20} Table 1 The attributes of the selected circuits. The results therefore indicated that not all CCPs are required. The figures show that there is a maximum number of CCPs are required, and that any increase beyond this value will not show additional benefit to the results of the prediction. Fig. 7 shows the mean predictive errors from the 2 indicated circuits plus 6 others for the indicated combinations of | | number of CCPs with SM; each circuit is indicated by its respective color. The figure shows that the maximum value of CCPs with SM depends on the number of CCPs on the circuit (the last value for each plot).
The results from this scenario also indicated that, if significant number of CCPs are with SM and that they are located at key locations, the median predictive errors are similar with or without the use of power demand data as part of the input. As indicated in Section 4.2, the variables used to approximate the demand are: (i) the aggregated number of households between the source and location of the smart meter or the location of interest , and (ii) if available, the power demand data . These two sets of data provide similar information, when approximating the amount of voltage drop at a given distance from source in the circuit ( and ). The power demand data however changes with time, and the power demand data at one location in a circuit will have zero correlations to the power demand data at another location, in comparison the voltage data. The voltage value at a specific distance from source in a given circuit is a function of the voltage value at another location. Larger median errors were shown when power demand data is used because the DLNN must 'learn' the correlation between how both the voltage and the power changes overtime, and that the power demand data are uncorrelated between each other, in order to makes its prediction. Such computation effort is not required when demand is not provided.
The power demand data is however useful when the number of CCPs with smart meter data is low, as any additional information is beneficial to the model. Fig. 8 shows the results of the analysis when only the key CCPs are selected with SM and the percentages of the key CCPs selected are varied. Low and consistent median percentage errors are shown in the figure despite the variability in the number of key CCPs selected with SM and that the number of selected key CCP ≥ 18. The median predictive errors for when the input to the DLNN do not include the personal power demand data is lower in comparison to when power demand data was included, especially when all the key CCPs are with SM. This is as discussed in Section 5.1

Varying the number of key locations with SM
This figure does indicate 2 cases when large median predictive errors were found when the input to the DLNN do not include the power demand data. The example with the highest median predictive error does not have any smart meters at multiple branches in the circuit. The blind spots are after the link box and at the bottom left branch in the circuit. As a result a large key portion of the circuit has no reference point to approximate their voltage values, resulting in a higher median predictive error. Therefore, it is unsurprising that, in this case, the DLNN is not able to learning the appropriate correlation between the data to enable the effective prediction.
The second highest, also with its inputs without the power demand data has multiple key CCPs at the start of a branch without any SM. The first CCP in the circuit, especially, is also without a smart meter. As a result, the approximated voltage drop from these key reference points at the start of the circuit and at the start of a branch will be difficult to be approximate. This have resulted in the larger errors.
Five DLNN models were generated for when all 42 key CCPs are selected with SM in the circuit. Fig. 8 shows consistently low and similar predictive errors for these cases. The median predictive errors are lowest when the power demand data are not included as part of their inputs.

Summary of experimental results
The results show the benefits of DLNN to predict the voltage distribution across a circuit using measurement data from minimal CCPs with SM. This addresses the following key concerns indicated in the introduction and motivation of the work.

Customer privacy concerns
High granularity power demand data is not required, as there are no significant differences to the predictive errors, calculated from all CCPs in the circuit, are from the DLNNs with or without the use of high granularity personal power demand data as their input. This is as indicated in Section 5.1, whereby the lack of correlation between power demand data from different SM can impact on the accuracy of prediction. If power demand data are provided, the DLNN must 'learn' to correlate the voltage data from different SMs and the correlation between the voltages and the power data, but not the correlation between the power data from different SMs.

Current UK voluntary nature of smart meter installation
Not all data is required to perform effective prediction of voltage distribution. The accuracy will increase with the increase in the number of customer connection points (CCPs) with SMs until up to a maximum value which is less than the number of CCPs in the circuit. Interestingly, no significant increase to the predictive errors were observed beyond this value, with or without the inclusion of power demand data as part of the input. This shows that, in fact, not every customer connection point needs to be smart metered to address this prediction problem effectively, even if this were possible in reality.

Future big data concerns
In summary, in this study we found that only the values at the Identified Key Locations are required for effective prediction of the voltage distribution. No significant improvement to the predictive errors are shown if other CCPs with SM were to be included as part of the input, unless the number of key identified CCPs with SMs is low. If all the key CCP are with SM and are available to the DLNN, we found that the predictive errors are lower when the DLNN does not consider power demand data as part of the input, in comparison to those settings which power demand data is included.

Discussion and further work
Low Voltage (LV) networks are a central element in the energy transition, and will need to accommodate significant increases in Distributed Generation, Storage Technologies and increasingly, new demand profiles, such as from decarbonised transport systems (i.e. EV charging infrastructure). UK policy is setting aggressive timescales in decarbonisation of energy and transport services, creating an urgency in the need for advanced operational and planning capabilities for LV networks. The previously passive 'fit-and-forget' approach to network management will be inefficient to ensure their effective operation. An adaptive approach is required that includes the prediction of risk to the circuits. This has motivated the mass smart meter (SM) roll-out and advance measurement infrastructure (AMI) installation in order to provide observability of how energy is distributed across the LV network, specifically for the LV circuits beyond the secondary substation. Yet, the majority of the Power System State Estimation (PSSE) tools developed require full observability of the networks. Moreover, the majority of the PSSE analysis methods described in literature also assumed that 100% of the customers on the network are with SM. This premise is unrealistic in real-life operations. The current voluntary nature of the SM installation has resulted in the low-likelihood of full SM coverage for all the LV networks. This, together with privacy requirements, which restrict the access of high granularity power demand data, have resulted in the low uptake of many of the PSSE tools for LV network analysis. In this landscape, big data is a key concern for the DNOs, however big data comes with a high data management cost.
To address these concerns, in our research, we designed and evaluated the use of a novel Deep Learning Neural Network (DLNN) to predict how voltage is distributed across an low voltage (LV) circuit, despite the partial SM coverage on the LV circuit. The results show the applicability of the DLNN to predict the voltage distribution, even at locations without smart meter, and that with SM data at key locations within the circuit is sufficient for effective prediction without requiring high granularity power demand data.
Taking a longer-term view, such approaches will be increasingly important for automating the data gathering and analysis activities of distribution network operators going forward, and this research work (and the broader NCEWS project it is part of) has been highlighted as a key innovation project supporting the SP Energy Networks digitalisation agenda (c.f. [3], pg. 57). Overall, we conclude that state-of-the art machine learning techniques, such as deep learning, can provide significant benefits for power system operators in providing voltage distribution predictions, while at the same time using only partial data and respecting the privacy constraints of their customers.
In future work, we plan to explore several directions. First, we consider applying our techniques to address other challenging problems for power networks, such as phase identification for individual customers. Second, we plan to explore a variety of other, more complex network topologies, such as dense, ''meshed'' network topologies, present in many industrial and urban environments. Finally, looking forward, we plan to investigate the use of AI techniques combining ML and data-analytic methods for network visibility/monitoring (such as those presented in this paper), with those supporting planning decisions, for example, how to design the sizing and placement of charging stations to enable faster EV rollout.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.