Enhance Unobservable Solar Generation Estimation via Constructive Generative Adversarial Networks

Power distribution grids experiences proliferation of solar photovoltaics (PV) at the system edge. However, its counterpart of sparse meter deployment provides insufficient monitoring of PVs, for which the potential violations challenge the operators for energy management and stable operation. Some previous works use satellite imagery to detect distributed PVs for the easy access of data. However, their PV localization methods rely on label-rich area with unitary background/environment to implement well; even further/harder, they do not provide precise metered-PV detection and quantification to estimate/know PV generation outputs in unobservable area, which is essential to prevent the edge from excessive two-way power flow and other violations. Thus, we combine the two steps of detecting PV existence and quantify PV amount into one classification task. To boost the classification performance in unobservable edge area, we construct a generative adversarial network that simultaneous augments the diversity of labelled PV satellite images and embed distinct PV characteristics/features for training the classifier. Furthermore, the PV localization and quantification result is combined with geographic information, historical weather conditions and neighboring generation patterns to estimate power output at the system edge. We validate the proposed approaches on PV systems in the southwest of the U.S. Experiment results show high accuracy and robustness in predicting distributed solar power without sufficient prior information.


Enhance Unobservable Solar Generation Estimation via Constructive Generative Adversarial Networks Jingyi Yuan and Yang Weng
Abstract-Power distribution grids experiences proliferation of solar photovoltaics (PV) at the system edge.However, its counterpart of sparse meter deployment provides insufficient monitoring of PVs, for which the potential violations challenge the operators for energy management and stable operation.Some previous works use satellite imagery to detect distributed PVs for the easy access of data.However, their PV localization methods rely on label-rich area with unitary background/environment to implement well; even further/harder, they do not provide precise metered-PV detection and quantification to estimate/know PV generation outputs in unobservable area, which is essential to prevent the edge from excessive two-way power flow and other violations.Thus, we combine the two steps of detecting PV existence and quantify PV amount into one classification task.To boost the classification performance in unobservable edge area, we construct a generative adversarial network that simultaneous augments the diversity of labelled PV satellite images and embed distinct PV characteristics/features for training the classifier.Furthermore, the PV localization and quantification result is combined with geographic information, historical weather conditions and neighboring generation patterns to estimate power output at the system edge.We validate the proposed approaches on PV systems in the southwest of the U.S. Experiment results show high accuracy and robustness in predicting distributed solar power without sufficient prior information.Index Terms-Distributed PV forecast, diversified data sources, structured learning, weak supervised learning, variational GANs.

I. INTRODUCTION
T HE sustainable and inexhaustible solar energy is one of the fastest increasing renewable resources in smart power grid.For example, research estimates that 150-530 GW of cumulative solar-based power will potentially be available in the U.S. by 2040 [1].Unlike the conventional energy source with scheduled power output, PV generation depends on various temporal and spatial factors, e.g., weather, atmospheric conditions, and Jingyi Yuan is with the Electrical, Computer and Energy, Engineering, Arizona State University, Engineering Research Center (ERC), Tempe, AZ 85281 USA (e-mail: jyuan46@asu.edu).
Yang Weng is with the Electrical, Computer and Energy, Engineering, Arizona State University, Engineering Research Center (ERC), Tempe, AZ 85281 USA (e-mail: yang.weng@asu.edu).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TPWRS.2023.3262773.
Digital Object Identifier 10.1109/TPWRS.2023.3262773installation position and quantity [2].They naturally bring variability and uncertainty, leading to bidirectional power flow and frequent voltage fluctuations of voltages and currents in distribution grids [3], [4].To maintain the safety and reliability of the grid, distribution system operators (DSOs) require accurate information of the solar panel locations and PV generation forecasts for system remodeling and predictive energy management [5], [6], [7].Moreover, the foreseeable future of PV energy-sharing and its economics posits in urban areas raise a high demand for precise and easily accessible information of distributed PV system [8], [9].For PV generation forecast, existing approaches can be divided into two folds: the physical model-based methods and the data-driven approaches [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20].Physical model-based methods rely on numerical weather prediction (NWP) or satellite imagery to analyze the atmospheric conditions for solar irradiance, with which the power output is computed using physical characteristics [10], [13].Some of the data-driven approaches directly estimate PV outputs from historical data, which primarily extracts the statistical properties from PV measurements for new predictions [11], [12].Machine learning models have been developed to consider highly correlated factors to characterize solar irradiance and predict PV generation.Such methods require complete information on PV locations, panel numbers, solar meter measurements, etc.However, in power distribution grids, the scattered PV generation data may come from different sources, including solar panel/inverter manufacturers, PV system development companies, utilities, and residential/commercial consumers.The methods mentioned above rely on the timely data aggregation from these different sources, which requires constant and intensive manual efforts [21], [22].Thus, the knowledge is often incomplete or unavailable, especially for residential-level consumers.For example, the National Renewable Energy Laboratory (NREL) focuses on an Open PV Project to track distributed PV installations.The project relies on voluntary surveys and self-reports to provide a general understanding of PV distribution but is still unreliable in precision.Moreover, the database is easily outdated due to the rapid growth and widespread PV installations [23].
Instead of multiple PV data sources, the other group of datadriven methods uses smart meter data to detect unauthorized PV installation and estimate behind-the-meter PV generation [14], [15], [16], [17], [18], [19], [20].These methods are summarized in Fig. 1.Specifically, smart meters record the net load data of customers, which is the conventional load minus the PV Fig. 1.A summary of model-free PV detection and generation estimation works using smart meter data.generation.While the PV is invisible, different model-free methods are adopted to uncover its existence and generation output.For example, the change points in historical measurements are detected, verified, and estimated but the unsupervised method requires predefined hyper-parameters to work properly [14].The supervised learning methods in [17], [20] need sufficient labeled data to fit an accurate estimator, especially the deep neural networks.Only net load data is used to disaggregate unknown PV generation based on the inherent temporal and spatial correlations [16], [19].However, the distribution grids may have unobservable areas on the edge, for which we have no access to complete smart meter data [9].
While the primitive information in distribution grids is limited, it can be inferred from extra public data sources that are easy to access, e.g., satellite images of PVs.Previous efforts lead to several approaches on identifying PV locations from Geographic information system (GIS), which is an image classification task [24], [25], [26], [27].However, the informative image data has complex structures to provide precise inference, and the insufficient labeled data in practice makes the classification method difficult for implementation.For example, [24] and [25] train and test on similar PV system images without considering significant variations in solar panel positions, textures, numbers, and backgrounds.Though the rooftop solar detection tool based on convolutional neural networks (CNNs) claims better feature extraction for classification [26], [27], they rely on abundant hand-annotated data sets and are unclear about how to obtain the PV coordinates effectively [26], [27], [28].Hence, these approaches are limited to solving the problem of locating and quantifying the distributed PVs in raw satellite images on their own.Furthermore, a precise PV generation estimate helps maximize the information gain for DSOs.In the literature, [29] considers the solar irradiation for an extended time period at a particular region to estimate generation.However, it is limited to addressing the concern of feature limitation.Also, [30] has introduced support vector machine (SVM) as a base learner together with a meta learner, based on the K-means algorithm to cluster the training set for predicting short-term solar power generation.Such a methodology requires labeled historical PV data of the same solar power system, which is the same assumption of deep learning models such as probabilistic neural networks (PNNs) [31].Therefore, previous methods find it difficult to extrapolate information, providing a solution when some PVs do not come with complete historical recordings.
To address the problems, we propose to utilize different data sources of publicly available information together with the utility owned measurements to enhance PV localization and generation forecast.Specifically, we use available GIS and satellite image data for PV detection and quantification.While existing works have also utilized such data sources, they have not considered the practical problem of labeled data deficiency as the first step, not to mention generation estimation.Therefore, we enhance PV localization and quantification by designing weak-supervised generative adversarial networks (GANs).
The proposed model not only generates diversified labeled data to address data deficiency but also embeds PV characteristics during generation to enable distinct PV image augmentation.The PV images are augmented for more accurate PV detection.Instead of separating two tasks, we integrate them into one model as a feedback loop, which makes improvements simultaneously for both data augmentation and PV detection during training.On the input side, we provide backbone structures as informative inputs to the generative model.The proposed model not only generates diversified labeled data to address data deficiency but also embeds PV characteristics during generation to enable distinct PV image augmentation.The PV images are augmented for more accurate PV detection in downstream task.Instead of separating two tasks, we integrate them into one model as a feedback loop, which makes improvements simultaneously for both data augmentation and PV detection during training.Based on the detected PV information, we adopt K-nearest neighbors method to estimate the best possible generation considering the relevant factors of solar irradiance.Fig. 2 shows an overview of the proposed method for generation predictability enhancement.
Our contribution lies in designing a constructive GAN to expand both the volume and diversity of labeled datasets for PV detection and enhancing the generation estimation by utilizing effective features and neighboring generation patterns.Specifically, the learning performance of detection and quantification is boosted via specific knowledge embedding: 1) using diversified backbone structures (grey-scale images with different PV quantities and locations) as additional informative inputs; 2) integrating the evaluation of solar panel classification into the data augmentation process as guidance; 3) adopting both content and style losses to train the GAN model specifically for rooftop solar panel datasets.

II. FORMULATE LEARNING PROBLEM TO PREDICT DISTRIBUTED SOLAR POWER FROM (DIVERSIFIED) DATA
To predict widely distributed solar power, we need to accomplish three steps: 1) detect the PV existence with respect to locations, 2) quantify the amounts, and 3) estimate the output generation.While we have no access to the direct information, it is inherently contained/held in diversified data sources.We fully utilize these data sources to infer the power output of distributed PVs in this paper.The task of latent information discovery is to solve a comprehensive/structural machine learning problem.Specifically, we start with the publicly available data sources (i.e., Google Earth) to sample raw satellite images (resolution of 4800 × 2987) with coordinates in geographic information system (GIS).Since the raw image covers a large geographical area, we segment each one into M = 400 pieces to zoom into rooftops.Fig. 3 shows one sample image, where a few locations of the raw images have solar panels.With M = 400, less than 2% of segmentation is labeled as 1.
For segment i ∈ {1, 2, . . ., M}, the label q i ∈ {0, 1, 2, . . ., k} indicates the amount of solar panels, which is used in a multi-class classification for PV quantification.For a simplified PV detection task, the label is q e i ∈ {0, 1} to indicate the non-existence and existence of solar panels in an image.Moreover, y t , where t = 1, . . ., T time points, is the generation of the corresponding solar power system based on the generation g K t of K nearest neighbor PVs.For generation estimation, other available data include the temperature, humidity, and cloud cover with respect to the coordinates, which are the same location information used to sample satellite images.Therefore, the problem setup is as below.
r Problem: location detection, quantity assessment and power generation estimation of rooftop PV systems; r Given: 1) raw image x x x raw from Google Earth and available labels q i , 2) the known PV generation time series g K t of solar panel systems covered in the x x x raw solar panel image, and 3) time-series weather data f K t of locations covered in the x x x raw image; r Find: for a new single segmented image x i , 1) existence q e i of solar panels, 2) quantity q i of solar modules, and 3) power generation time series y i,t .This paper aims to estimate distributed PV generations based on accurate PV location detection and quantity assessment.A robust image detector requires a large quantity of diversified data to train, while the known installation locations from utility are limited to less than 2% of the total segments.To enlarge the labeled image data, an intuitive way is to use augmentation techniques (e.g., flip, rotate, extract patches, and transform color spaces) to operate on obvious data invariants [32].For the solar panel case, we can rotate slices of images to contain different orientation angles of PV installations.This data augmentation is realistic, but there are no new instances created for data variability/diversity.The detection model may easily cause overfitting and perform poorly in unseen data.For a diversified image augmentation, deep learning-based generative models, such as generative adversarial networks (GANs), are popular at generating similar but varied samples as compared to the existing instances.GANs gain benefits from rich feature extraction of neural networks and the adversarial-training scheme; however, the lack of learning guidance causes the model to collapse easily, which creates bad samples like clustered solar panels pixels and object mixtures.Moreover, the labelled data is augmented with the goal to improve the training of a classification model (PV detection and quantification).The separation of the two steps can lead to propagation errors.To address these concerns, we aim to integrate both tasks into one compact learning model and train the entire model with mutual benefits to achieve high accuracy of PV classification against unbalanced data.

III. PROPOSED VARIATIONAL GANS WITH WEAK SUPERVISION FOR PV DETECTION AND QUANTIFICATION
To address the concerns of insufficient labeled data, we propose in this section an enhanced solar image classification based on weak supervision over GANs, as shown in the middle of Fig. 2.

A. Recall Basic GANs to Augment Image Samples
As mentioned in Section II, the goal of augmentation is to generate various solar images that mimic the original data in feature patterns as well as provide diversity.
To achieve this, we train a generator G as a feed-forward neural network parametrized by θ to produce new data x aug , e.g., a rooftop solar panel image.x aug is a random variable, whose complex distribution is expected to be learned from the distribution of real data samples x real .In basic generative adversarial networks (GANs), the generator captures the mapping from a random variable z to image x aug , where z is usually sampled from Gaussian noise distribution with the same size of x real .To enforce the similar feature pattern in augmented images, the generative model G is trained discriminatively against another neural network, a discriminator D. The discriminator is parameterized by φ to score the outputs (comparing x aug with x real ), computing the probability that one image comes from real dataset.It aims to assign a high score to a real image x real , while assigning low scores to the generated image x aug .Mathematically, the discriminator is realized as a classifier to maximize the binary cross-entropy loss while increasing the distinguishability between real and fake data.In contrast, the goal of generator G is to produce outputs that achieve high scores from the discriminator D, satisfying the constraints imposed by D in the process.Therefore, G minimizes the loss.Thus, for training the GANs, the objective takes the expectation of random variables as and the training optimizes over min θ max φ L GAN .Such an adversarial-training scheme can augment images on a larger invariant space, which is implicit for the data space.GAN becomes popular in solving power system problems because it adopts an adversarial strategy in generative learning to augment the diversity of labeled data.[33] firstly leverage the characteristic to generate scenarios of renewable outputs, which can mimic diverse conditions and uncertainties to produce more renewable data.Similarly, GAN model is used to estimate the unknown power injection at unobservable loads based on the available historical measurements [34], [35].While the generation from Gaussian noises is insufficient to cover the needs of specific data patterns, conditional inputs are used to restrict the generated data to a particular class like weather conditions of high wind, real-time system configurations of topology/admittances, and electricity market data [33], [34], [36], [37].

B. Embed PV Characteristics in Image Augmentation to Enhance Detection
The basic GANs generate new instances from noise inputs z to provide flexibility in generation.However, it easily causes insufficient learning of complex image distribution, especially when the labeled PV image data is limited for discriminative learning.The direct application of GANs creates bad samples like clustered solar panels pixels and object mixtures in the background.The generation process needs to be materialized with more explicit information.Moreover, unlike the assumption of sufficient training data in other GAN implementations, our task has the precondition of limited labeled solar panel images.And, the expected outputs should have distinct solar panels located on the roof, trees, pools, and other objects in the background.
Therefore, we adopt the conditional settings of GANs to consider extra information together with Gaussian noises.Specifically, y is an embedding variable to condition the generative model on the external information.In our case, the solar panels naturally have the characteristic shape to be distinguished from other objects so that we use the grey-scale outlines for y.The joint inputs of y and noise z contribute to the replication of real PV images with both "backbone" representation and flexible background, as presented in the left-hand side Fig. 4. Similar to the basic GANs, the generator improves during training to fool the discriminator for a relatively high score compared to the real images.Mathematically, the generators in conditional GANs learn the mapping {y, z} → x aug , where y is the given information, z is the random noise, and x aug is the output image.Meanwhile, discriminator D is trained to distinguish between the real and generated images, for which Fig. 4 illustrates the adversarial training scheme in the middle part.D is also conditional in such settings.It is fed with the concatenated image and the corresponding "backbone" information y.In implementation, we have compared the conditional and unconditional settings of D, and it appears that conditional D better leads the augmented image x aug to follow the PV characteristics.Thus, we have modified the objective for training GAN in (1) with consideration of extra information ( Despite the benefit, the input of grey-scale outlines has reduced the information of PV characteristics, which mainly embed the shape and position.To compensate for the remaining information that is essential for PV classification, we consider regularizing the generation process with limited but available labeled data.Although GANs do not use an explicit loss function, adding a traditional loss could benefit the image generation.Motivated by [38], we consider the content loss and style loss to retain consistent with PV characteristics on the pixel level and in the feature space.In the PV image generation task, the content loss is the mean square error between the generated image and the real image at the pixel level.Minimizing the content reconstruction loss recovers the detailed pixel information when the pair of grey-scale outline and ground-truth image are available.Notably, using the content loss will only have blurring and deterministic output images, which makes the conditional inputs of GANs in (2) necessary for our task.Moreover, while the labeled PV image data is limited, we use the style loss minimization to guarantee the style consistency Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.from feature space among different images.Thus, the loss for image consistency of PV characteristics is ( Similar to the content loss, the style loss is the error between the features correlations expressed by Gram matrices A = K k=1 F ik F jk (F is the feature map of the image).The hyperparameters α and β are weights to determine the emphasis of the two losses.

C. Integrate Classification Feedback Into PV Image Augmentation for Mutual Benefits
The generator is enhanced by embedding prior knowledge of PV characteristics on both inputs and loss function.Besides, GANs use the discriminator to judge the quality of generated data via assigning scores.Although the discriminative learning scheme benefits, the DNN-based discriminator can be too general to control the generation process as expected.In our case, the specific goal of image generation/augmentation is to improve the accuracy of PV detection and quantification.GANs aim at enlarging the data to train the classifier for rooftop solar panels.Usually, the data augmentation and classification are separated steps, where the former step cannot receive performance feedback of the latter task.It is difficult to make an improvement based on causal reasoning, and propagation errors between steps may exist.To solve the problems, we propose to integrate the two steps in a loop, which guides the augmentation towards improving the classifier f 1 .As Fig. 4 shows in the bottom half, we feed the augmented images from the generator into the f 1 and include the classification loss during training, where η parameterizes the classifier network.P is the probability of x to be class 1, which means the solar panel exists in the augmented image.In this way, the evaluation metric constrains the generation process.Since the classifier f 1 is integrated into the GAN structure, we recognize it as the other discriminator to distinguish augmented images that can improve PV classification.
In short, the final optimization is min θ,η max φ L Aug + L Im + λL Class .λ is the hyperparameter (defaulted as 10 in this paper) to specify the weight of the classification loss compared to the generator loss L Aug .In this way, the labeled satellite images are effectively augmented to improve detecting PV systems.
Moreover, since we utilize task feedback as weak supervision, the PV classifier is trained via minimizing L Class .During the training process, the generator and the classifier are updated in turn for multiple iterations and converge.After learning, we simultaneously finish dataset augmentation and classification, so there is no extra step to train an additional model for the solar panel detection task.

D. Specify Conditional Inputs to Represent Distinct PV Features
The self-enhancement and feedback-control discussed above provide weak supervision for generating images of rooftop solar panels.Not only increasing the volume of labeled data is beneficial, but also the diversity is essential.For example, an effective classifier training expects images with different numbers of solar panels and various background objects.Although the available training dataset never has such instances, we introduce the randomness function to form the backbone, i.e., grey-scale outline as shown in Fig. 4. To enable it, we first apply different abstraction to the real labeled images, such as Sobel filter, semantic segmentor, landmark extractor, and color-specific filter [39], [40], after which we crop the basic backbone of the solar panel target.We then define t and s as the center and size (amount) of the single target, and r is the rotation degree if needed.The backbones of the diversified samples are constructed by different values of t, s, and r.The reconstructed backbone images are fed into the proposed variational GANs.

E. Proposed GAN Architectures and Optimization for PV Classification
The previous designs involve different functions of deep learning to enhance PV detection and quantification.In this section, we specify the model architectures of each function and illustrate the training setups.
1) Generator G for Image Generation: To embed PV tion and shape, we input conditional "backbones" to the generator.The desired output should not only retain the localized information from grey-scale image but also reach high precision in context for classification.To enable such a transformation between high-dimensional inputs and outputs, we adopt the U-shaped architecture [41].Its upsampling layers balance the accurate object localization and context usage, which is capable of generating distinct features of solar panels.
2) Discriminator D With Markovian Consistency: As the consistencies in both pixel level and feature space are important for solar panel images, the discriminator follows the same rule.Different from the regular classifier, the whole image (real or generated) is scored in the unit of patch identically and independently, which is an N × N (N ∈ {1, 16, 70} [42]) square of the image.A smaller N focuses on pixel level to benefit color representation, and a larger value sharpens spatial statistics across features.For example, N = 1 is the special pixel-level assessment, but the generated images cannot bring greater color diversity in our task.We find N = 70 works best to generate distinct objects in solar panel images.
3) Classifier f 1 for PV Quantification: The classifier is the last and the most important step to detect and quantify solar panels in images.Meanwhile, it is expected to weakly supervise the generation process by feedback evaluation.The model Inception-v3 achieves state-of-the-art performance in image classification, whereas the feature extraction layers require large and diversified dataset to train.Therefore, we pre-train the Inception-v3 model with 1.28 million images containing 1,000 different classes in the ImageNet and achieve 93.3% accuracy [43], [44].We reuse the feature extraction layers and retrain the last layer of decision-making with our generated solar panel dataset for evaluation.

4) Configuration:
During training, we use the Adam optimizer with a learning rate of 0.0002, and momentum parameters β 1 = 0.5, β 2 = 0.999 to train for 200 epochs for each experiment.For each epoch, the batch size is chosen as 5 due to device limit.All the experiments are completed with a computer equipped with Inter(R) Core(TM) i7-9700 k CPU and Nvidia Gerforce RTX 3080Ti GPU.

IV. ESTIMATE GENERATION OUTPUT VIA CLOSE PROXIMITY
OF PV SYSTEMS Knowing the quantity of solar panels at a specific location helps gain insight into predicting generation capability, but its effectiveness needs to be further enhanced by historical PV generation profiles.Thus, to estimate solar generation output, we integrate the proposed PV detection and quantification with the data-driven solar irradiance forecasting, as shown in the right-hand side of Fig. 2. Specifically, based on the historical observations, the PV generation patterns of neighbor installations are similar because of the similar numerical weather conditions.In residential areas, the neighboring houses are usually covered in one basic spatial unit to have the same feature values of numerical weather predictions.Therefore, this section describes the feature selection and generation estimation based on neighbor information.

A. Select Relevant Features Based on Weather Conditions
There are many different features available for making generation estimation.For example, the geographical features of longitude, latitude, altitude, the weather conditions of temperature, pressure, humidity, cloudiness, and the quantities of modules.To increase the information gain and select the most relevant features, we followed a filter method for feature selection based on the information gain of minimum descriptive length (MDL) [45].Let a •,• denotes the number of training samples.a i,• is the number of training samples from class P i , and a •,j gives the number of samples with the j th value of the provided feature.Therefore, a i,j is the number of instances from class P i and has the j th value of the provided feature.If we have P classes, the information gain MDL is defined as a logarithm of all combinations of class labels ( A high score of information gain on geographical space shows that neighboring regions have similar PV generation patterns.The quantity of the modules is already available from previous learning results and serves as one of the most important variables, in that more solar panel units generate more power.
Moreover, to further boost the information gain for generation estimation, we consider the thermal space.The temperature is correlated with the efficiency of solar cells [46], which is also the efficiency of the power generation.Based on the feature set, we learn to predict the power generation time series of the unknown solar power system.

B. Apply Nearest Neighbor Approach
With a set of carefully selected relevant features, we aim to fully utilize the latent correlation to make precise estimation of PV generation.Fig. 5 shows a clear correlation among generations of the "neighboring" PV systems that locate in different ZIP (Zone Improvement Plan) code areas.The Euclidean distance is a promising measure for the distance between any two samples in feature space, no limited to only geographical distance [47].With identified PVs in previous learning process, the coordinates are known.Let q i , x latitude i and x longitude i , and x weather i , Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
In the feature space, the closest K points to S F test are used for training purposes.Their respective distances are used as the weights d K and the PV generation time series of K nearest neighbors are represented as g K t .For t = 1, . . ., T time points, the PV generation of the unknown solar power system is given by y t .Thus, the optimization problem becomes This procedure gives an estimate of power generation for the unknown rooftop solar power system.We select the hyperparameter of K through two different error metrics, namely mean square errors (MSE) and mean absolute percentage errors (MAPE) respectively.MSE is calculated using the predicted power generation time series y t and real power generation y t,real , where t = 1, . . ., T are the sequential time slots MSE = 1

V. NUMERICAL TESTS
To evaluate the proposed methods on PV detection and generation estimation, we conduct experiments on realistic test cases in the following.

A. Data Pre-Processing and Tool Selection for Validation
First, we aim to enlarge the labeled solar panel image data so that we can train a classifier to detect the rooftop solar panel location from the geographic information system (GIS).We collect the original data from SunPower Inc. and local utility, which contains the known installation locations of rooftop solar panels in GIS.With the longitudes and latitudes, the satellite images are sampled with a fixed resolution of 4800 × 2987 on Google Earth (fixed zoom level).It is free and has several benefits when compared to its commonly used rivals like SPOT5 or DigitalGlobe satellite imagery.[48] illustrates the high-resolution imagery of Google Earth archive has robust positional accuracy and plays a vital role for solar panel image recognition and quantification.The original positive sample number is only 1017 + 2813 = 3830, while related detection work used tens/hundreds of times of positive samples for training.The classification performance is greatly affected since the limited image data cannot cover the scenarios that exist in reality.Due to the highly unbalance in original image dataset, we select partial images without solar panels to be negative samples and get 15000 as a total.Moreover, we use the geographic coordinates to integrate a time series of power generation data (612 + 239 data sequences).Typically, the recorded time series of PV data is with a time resolution of 15 minutes for a whole year.So, to align the data format we chose time resolution of weather conditions to be 15 minutes as well.We aim to show robustness by validating the method in two different states in the U.S., 1) Tempe, Arizona and 2) Santa Ana, California.

B. Enriched Labeled Data With Physics Embedding Improve Detection Accuracy
In the following, we validate the accuracy of PV detection and quantification of the proposed weakly-supervised GANs.
1) Effective Augmentation: To solve the data deficiency problem, we first consider standard augmentation techniques with respect to real conditions that residential houses can face from different directions, and the house styles can be quite similar within each community.Therefore, by adequately rotating images via the rotation matrix R = [cos θ, − sin θ; sin θ, cos θ], θ ∈ [0, 2π], we increase the labeled dataset.Since direct rotation changes the horizontal structure of the image, we pre-process the data to crop them circularly, using the circular segmentationbased approach as in [49], where the shape will be consistent no matter how it is rotated (Fig. 6).In such augmentation, the information around the edges and vertices is compromised, and the performance suffers.Meanwhile, the improvement of the classifier requires sufficient and various positive samples, e.g., images with solar panels in different scenarios.Therefore, we expand the labeled dataset by the GANs.We validate the proposed GAN with two types of metrics on the generated data quality.On one hand, Fréchet Inception Distance (FID) [50] measures the closeness of extracted features between generated data and real data for image generation tasks.Namely, the lower FID score reveals more similarity, which is a better quality of augmented images.On the other hand, the image dataset is enlarged to boost classification performance so that we use the test performance of the solar panel classification to assess the augmentation quality.The output labels of the f 1 for the quantification task, and we compute accuracy by comparing the f 1 outputs with ground truth labels, which is: Quantification Accuracy = Amount of Samples{ qi =q i } Amount of Total Samples We further compute the commonly used metrics to evaluate the detection performance, which are Overall Accuracy = TP + TN Amount of Total Samples (9) The and "false" represent whether or not the classification results are the same as labels, while "positive" and "negative" mean with and without solar in the image as two classes.Namely, "true positive (TP)" represents that solar panels exist and are correctly detected.Similarly, "false positive (FP)" means the positive (wrong) estimation of negative samples.For our task, we aim to evaluate the model's capability of correct solar panel detection.First, we expect to know what proportion of the detected solar panels does actually exist?Thus, the detection precision in (7) measures the ratio of correct solar panel detection among all the positive output (with solar panel) samples.It is used to evaluate the quality of augmented positive examples, as shown in Table I.Meanwhile, we are interested in what proportion of the images with actually existed solar panels are detected?This is computed by the detection recall in (8).Moreover, we include 1) the overall classification accuracy to represent the correct classification for both positive and negative samples, and 2) the quantification accuracy to represent the correct identification of amount of existed solar panels.First, we tried popular architectures, such as DCGAN, LS-GAN, and WGAN-GP [51], [52], [53].However, the learning of mapping from simple noises to target images poorly follows our expected direction of distinct solar panels in images, as shown in Fig. 7. Thus, the proposed model improves upon two aspects: providing informative inputs and embedding feedback evaluation.
2) Informative Inputs for Self-Enhancement: In this task, dependent information y is the grey-scale image that covers the number of solar panels and their positions in an image.We adopt the Pix2Pix model [43] as a basis for its superior image translation.The abstraction of grey-scale images is an edge detection task, for which we select the Sobel filter among different filters [39].It convolves the image with a small, separable, and integer-valued filter in the horizontal and vertical directions and is relatively efficient in computations.The Sobel filter enhances edges of objects in grey-scale images by providing differentiating (which gives the edge response) and smoothing (which reduces noise) concurrently.
During training, we found that using the base model (Pix2Pix) only makes the generator hard to converge.The discriminator converges fast and cannot further improve the generation, because our solar panel data is complex with multiple objects.Such data complexity can also be observed from a high base FID value (131.07 ± 3.47), for which we randomly separate the real solar panel dataset into two groups and measure the similarity in between.Therefore, we increase the depth of discriminator D to balance with the generator capability and use more task-specific losses as proposed, leading to the complete model in Table I.The solar images generated by the proposed complete model achieve a much lower FID score than the base data generation model (Pix2Pix).The visual comparison in Fig. 8 shows intuitive improvement.The images "translated" by Pix2Pix in the middle resemble the real ones from Google Earth, which is more comparable than those from standard GANs (Fig. 7).Nevertheless, the generated solar panels often mix with the background objects, and the Pix2Pix model sometimes collapses around 100 epochs or even earlier.In contrast, our model with specific guidance can better recover the target solar panels in the image with clear edges and color, as shown in the bottom of Fig. 8.To demonstrate the contributions of each component, we conduct an ablation study.The first row of Table I shows that the content loss plays an essential role in synthesizing images.We select L Content and L Style , which both make a large difference in FID score, as shown in Fig. 9.While the pixel-level loss can encourage color similarity of the image, it can also bring blurring results.The style loss better corrects the color of the target objects, leading to distinct solar panels.Moreover, the contribution to the following classification task is more essential for PV detection, which is reveal in the next rows of Table I, and we analyze in the next section.
3) Feedback Evaluation to Constrain the Augmentation: As the goal is to improve the classifier for solar panel detection, we channel the feedback from the state-of-art pretrained Inception-v3.The images generated under such weak supervision are then fed into the classifier as positive samples, and the cross-entropy loss is fed back to the generator for control.The GAN and the inception-v3 classifier are trained simultaneously to augment positive samples.Specifically, the last four columns of Table I show the results of an ablation study of the proposed variational GAN model, where each column reveals the contribution of each design to the PV image augmentation and PV classification.I).While the content loss incentives the color similarity of the whole image (pixel-level), the logical constraint focuses on the target objects more.Such weak supervision improves the generation quality, leading to distinct solar panels in the image.
Without any design, the performance deteriorates.To compare the third and the last columns of Table I, we observe that although integrating the classifier does not significantly improve FID on image similarity, it benefits the post-classification task with higher accuracy.
Except for the comparison in the ablation study, we explore the improvement of the classifier with different data availability to show the benefit of GAN-based augmentation.The following datasets are considered to train the classifier for PV detection and quantification: 1/2 of the original dataset with limited labels (1/2), the entire original dataset (original), the enlarged labeled dataset by basic functional image augmentation such as rotation and flip (basic Aug.), and different sizes of expanded labeled datasets from the proposed variational GAN (GAN Aug. 1×, 2×, and 3×), respectively.improving, the loss of classifier keeps decreasing until convergence.We observe that the losses close to convergence are lower and more stable.Moreover, we test the trained model on the randomly sampled images of the residential areas in California and Arizona and show the results in Fig. 11.Previous data-driven methods assume sufficient labeled data for training, so the accuracy is high when using only the sufficient original datasets or basic augmented data.The adopted classifier f 1 , the Inception-v3 model, has state-of-the-art performance in image classification, which is a benchmark model.In our case, using only the original labeled dataset or dataset with basic augmentation, the accuracy is low for PV detection and quantification.We observe a noticeable increase in accuracy from the barplot when using the proposed GAN method to augment labeled data.With increasing amount and diversity of augmented data, the testing accuracy keep going up and reach comparable and even better performance than previous data-driven method that assume sufficient labeled data.

C. Robust Generation Estimation Via Flexible Features and K-NN
The validation to detect and quantify PV systems brings us to the next stage of generation estimation validation.There are many different features available for making estimations of PV generation.But, to maximize the information gain, we propose to select the most relevant features using a filter method-based approach.The result of that approach is a set of features that include geographical coordinates, weather conditions of temperature and cloudiness, and the quantity of modules.Furthermore, the regions differ in solar irradiance, climate, soiling profile, and terrain from one another [54].The generation of such solar panels is provided by SunPower Inc., and is used as the training set for learning.We use the K-nearest neighbor (K-NN) method for learning by assigning the weights to the time-series data of the nearest points geographically.The only hyperparameter to be tuned here is K.We will present the performance changes with respect to K later.
Then, we apply the weighted K-NN regression to predict the short-term solar power generation.Fig. 12 compare the predicted generation with the ground truth for an entire year.The previous data-driven method refers to the support vector regression (SVR) model, which has shown good performance in PV generation estimation [11], [12], [15].Since the experiment setups are different, we use the SVR model but change the inputs to be the same as that of our case.Specifically, the SVR implementation does not have a GAN-based data augmentation to train a classifier for PV quantification as a preliminary step.To better visualize the comparsion, both real and predicted generation data are downsampled.We plot the long-term accumulated generation based on months in Fig. 13.The results are expressed in a bar plot over 12 months in 2015, where the black error bars indicating the variance in generation estimation.From a general energy production point of view, the model performance is stable in the the months when the weather is stable.For example, estimations of June, July, November, and December have higher accuracy than March, April, September, and October.
To select the best K data samples of neighbors for PV generation estimation, we compare the performances in Table II.Obviously, K = 3 is the optimal choice for both error metrics.If the data samples of more or less nearest neighbors are  considered, the estimation errors increase.Especially, if we select K > 4, the error is much higher, for which too much feature distance between the considered PV systems could increase the variance for estimation.

VI. CONCLUSION
To accommodate the limited data availability and timeliness of PV data in distributed power system, we propose to systematically enhance PV localization and generation forecast using multiple data sources such as satellite imagery and numerical weather conditions.Specifically, we first design weakly supervised GANs for solar panel image augmentation.Multiple aspects of GAN enhancement is designed to augment the images that can improve PV classification, including: 1) preparing backbones images as conditional inputs to embed PV characteristics, and 2) restrict the inexplicit learning process of the GAN model by specific losses.Moreover, we leverage the discriminative training mode of GAN to integrate PV detection and quantification into the augmentation loop.In this way, the performance of the targeted downstream classification task guides the image generation process.Thus, we obtain the detection results without further efforts, and combine with historical neighboring measurements to estimate the PV generation.We validate the proposed approaches on areas of distribution grids that have wide PV coverage but limited prior information.The result shows that the proposed approaches can efficiently avoid model collapse in the image generation, reach comparable classification performance with methods trained using sufficient data, and obtain accurate generation estimation.

Manuscript received 19
October 2022; revised 10 February 2023; accepted 14 March 2023.Date of publication 29 March 2023; date of current version 26 December 2023.This work was supported in part by the Department of Energy under Grants DE-AR00001858-1631 and DE-EE0009355, in part by the National Science Foundation (NSF) under Grants ECCS-1810537 and ECCS-2048288, and in part by the BIRD Foundation.Paper no. TPWRS-01582-2022.(Corresponding author: Yang Weng.)

Fig. 2 .
Fig. 2.An overview of the proposed approaches.

Fig. 3 .
Fig. 3.A real image acquired from Google Earth and sliced.

Fig. 4 .
Fig. 4. Block diagram of the proposed variantional GANs with weak supervision for PV detection and quantification.

Fig. 5 .
Fig. 5. Plot of raw data from the industrial partner SunPower Inc. shows homogeneous curves of power generation for three different ZIP code areas on a single day.
t,real − y t ) 2 .The variance of the distribution can be captured in mean absolute percentage error (MAPE) MAPE = 1 T T t=1 |y t,real −y t | y t,real × 100%.

Fig. 6 .
Fig. 6.The classic augmentation of solar panel image via basic operation of circular rotation.

Fig. 7 .
Fig. 7. Generated results after training the popular architectures of GANs.There appears to be visually under-fitting and missing-training direction via repeated noise textures across multiple samples.

Fig. 8 .
Fig. 8.Given conditional inputs of "backbones" (top), compare the generated rooftop solar panel images from Pix2Pix (middle) and our proposed model (bottom).Without specific guidance, the solar panels easily mix with background objects.

Fig. 9 .
Fig. 9. Visual results of the ablation study based on the obvious differences in FID values (TableI).While the content loss incentives the color similarity of the whole image (pixel-level), the logical constraint focuses on the target objects more.Such weak supervision improves the generation quality, leading to distinct solar panels in the image.

Fig. 10
demonstrates the trends of loss during training.There are significant decreases in converged losses when training the classifier with the data augmented by the proposed GAN.Specifically, Fig. 10(a)-(c) present fast loss decreases before 50 th epoch, but the classification losses cannot be lower after then.In contrast, the training losses of Fig. 10(d)-(f) experience an increase and then go down.It is due to the simultaneous training of classifier and the generator for image augmentation.At the first few epochs, the proposed GAN keep updating parameters from random initialization, for which the generated labeled images are not perfect yet.Therefore, the classifier trained by these generated images has high losses.With the image generation

Fig. 10 .
Fig. 10.Classification loss during training (epochs) using different training datasets with respect to augmentation methods: (a) 1/2 of the original dataset with limited labels, (b) the entire original dataset, (c) the enlarged labeled dataset by basic functional image augmentation, and (d)-(f) different sizes of expanded labeled datasets from the proposed variational GAN (GAN Aug. 1×, 2×, and 3×).

Fig. 11 .
Fig. 11.The testing accuracy of PV detection and quantification when training the classifier f 1 using different training datasets with respect to augmentation methods.

Fig. 12 .
Fig. 12. Compare the real and the predicted downsampled PV generation of a rooftop solar power system.

Fig. 13 .
Fig. 13.Comparison of accumulated real and predicted monthly PV energy generations for the year 2015.

TABLE I THE
QUALITY OF GENERATED DATA AFTER 200 EPOCHS.A LOWER FID INDICATES BETTER SIMILARITY, AND A HIGHER ACCURACY MEANS THAT THE DATA AUGMENTATION BETTER BENEFITS THE POST CLASSIFICATION TASK.TO SERVE AS A BASIS, FID OF TWO RANDOMLY SEPARATED REAL DATASETS IS 131.07 ± 3.47 AND THE CLASSIFIER TRAINED WITH REAL DATA HAS THE TEST ACCURACY OF 0.59 ± 0.030

TABLE II COMPARE
ERRORS OF GENERATION ESTIMATION WITH RESPECT TO DIFFERENT K VALUES