A Compound Poisson Generator approach to Point-Source Inference in Astrophysics

The identification and description of point sources is one of the oldest problems in astronomy; yet, even today the correct statistical treatment for point sources remains one of the field's hardest problems. For dim or crowded sources, likelihood based inference methods are required to estimate the uncertainty on the characteristics of the source population. In this work, a new parametric likelihood is constructed for this problem using Compound Poisson Generator (CPG) functionals which incorporate instrumental effects from first principles. We demonstrate that the CPG approach exhibits a number of advantages over Non-Poissonian Template Fitting (NPTF) - an existing method - in a series of test scenarios in the context of X-ray astronomy. These demonstrations show that the effect of the point-spread function, effective area, and choice of point-source spatial distribution cannot, generally, be factorised as they are in NPTF, while the new CPG construction is validated in these scenarios. Separately, an examination of the diffuse-flux emission limit is used to show that most simple choices of priors on the standard parameterisation of the population model can result in unexpected biases: when a model comprising both a point-source population and diffuse component is applied to this limit, nearly all observed flux will be assigned to either the population or to the diffuse component. A new parametrisation is presented for these priors which properly estimates the uncertainties in this limit. In this choice of priors, CPG correctly identifies that the fraction of flux assigned to the population model cannot be constrained by the data.

A practical and unbiased method of point-source inference is central to the task of recovering the physical characteristics of point-source populations. When point sources are bright and well separated in the sky, their identification and cataloguing is straightforward and requires little statistics beyond quantification of uncertainties on location and brightness. The primary difficulty arises when point sources are dim, where one must distinguish a putative source from a background fluctuation, and account for the possibility of multiple closely overlapping sources -called a crowded field. The calculation of the uncertainty on the number of sources is particularly demanding, as the number of sources is a discrete parameter. Parametric point-source inference methods, such as the Non-Poissonian Template Fitting (NPTF) method [1], side-step this issue by specifying a likelihood conditioned on the characteristics of a population. In particular, the likelihood can be expressed in terms of the mean number of sources -a continuous parameter for which uncertainty calculation is considerably easier.
NPTF is currently the most widely applied parametric point-source inference method in gamma-ray and neutrino astronomy, especially in the analysis of gammarays from the Galactic Center, however, there is grow-ing concern that results obtained with NPTF must be interpreted cautiously to avoid potential biases, see the discussion in Refs. [2][3][4][5][6]. In the present work, we will identify several clear biases of the NPTF framework which become most obvious in the following three specific test scenarios motivated by X-ray astronomy: (i) a spatially non-uniform point-source population model; (ii) a non-isotropic instrumental point-spread function (PSF); and (iii) variation in the instrumental detector response, namely the effective area, on scales smaller than the choice of binning -that is typically, in turn, on the scale of the PSF. In all cases, we observe a bias in the recovered number of point-sources as summarised in Fig. 1.
To resolve these biases, we develop a new parametric point-source inference likelihood called the Compound Poisson Generator (CPG) likelihood. This new likelihood addresses the biases in the NPTF apparent in the X-ray domain, and further provides a rigorous treatment of several approximations made in the conventional NPTF approach. In addition, we demonstrate the importance of a carefully chosen prior parameterisation in Bayesian analyses. Priors chosen directly on the standard parameterisation of the differential source-count function can, in the limit where these models are formally indistinguishable, lead to posteriors where the observed flux is assigned to either the point-source population model or an associated diffuse emission model. We define a new coordinate system for the specification of priors that removes this unwanted effect as summarised in Fig. 2.
This work does not address biases that may result from incorrect modelling of the spatial distributions of populations (see in particular Refs. [5,6]), effective area, exposure area, or the PSF. Such biases are not limitations of the point-source inference method; instead, they are concerns for individual applications of point-source inference to specific analyses -although they are no less important for obtaining correct results. In this work, we assume that these effects are modelled correctly, and investigate the statistical methodology of point-source inference.
We organise the remainder of the discussion as follows. Section I reviews the current state of point-source inference, with a more detailed discussion on the concerns with NPTF, where the CPG enters, and how it resolves some of these issues. Section II outlines the point-source population model and the various instrumental effects introduced by standard X-ray instruments such as the NuSTAR satellite telescope, which we use as an example. Having outlined the background and challenges, in Sec. III, the CPG likelihood is constructed from first principles. Setion IV introduces the priors on the population model parameters and demonstrates how the new coordinate system correctly handles the combination of point source and diffuse models. The NPTF method is introduced in more detail in Sec. V, where we consider several scenarios motivated by instrumental effects, results from CPG and NPTF on these cases are directly compared, and the general efficacy of the new likelihood is demonstrated. An implementation of the CPG is made publicly available here.

I. REVIEW OF POINT-SOURCE INFERENCE
To begin with, we will review the history and current state of point-source inference as it applies to parametric inference in the regime of dim sources and crowded fields. In this regime automated source extraction is most com- The median recovered posterior for the novel natural coordinate system (above) and standard coordinate system (below), for multiple choices of priors on the number of sources (N , natural only), flux (F ), and differential source-count function normalisation (A, standard only). The posterior is shown for the fraction of flux assigned to the population model, as compared to a diffuse background model. In this scenario, only diffuse flux is present, and so these posteriors should be uniform, as diffuse emission is formally indistinguishable from a population of dim point sources. This is observed only for the natural coordinate system, with one exception discussed further in Sec. IV. monly employed. This involves such methods as thresholding, peak searches, and wavelet decomposition among others [7]. Source extraction algorithms can be broadly categorised as point estimation methods -they produce a single best-fit or most-significant solution, and this solution can be represented as a point in the parameter space of point-sources. For example, in thresholding and peak searches, a pixel is considered part of a source if the measured pixel intensity exceeds some defined threshold; which may be manually selected, or defined by a statistical significance estimated from the image. The set of pixels that are classified as belonging to a source is a single point in the parameter space of all such sets of pixels, and may be further transformed into point-source locations by taking the pixels in each set, and, for example, computing the centroid of those pixels. The resultant point in parameter space is the algorithm's best guess as to the true parameters. However, this guess is just that: an estimate of the true locations of the point-sources in an image. As such, the calculation of uncertainties is required in order to quantify the quality of this single best guess.
In this problem, there are two classes of parameters: the individual point-source parameters, ϑ, such as location and brightness, which are continuous quantities; and the number of sources in the image,Ñ , which is a discrete quantity. The measured data, such as the observed image, will be distributed according to the likelihood L(d|Ñ , {ϑ i }), where {ϑ i } is a set of sizeÑ , one for each of theÑ sources. The pair (Ñ , {ϑ i }) form a point in the parameter space of the likelihood, and is often referred to as a catalogue. When it comes to the calculation of uncertainties, discrete and continuous parameters must be handled in different ways.
Deriving uncertainties on the continuous individual source parameters is the easier of the two. Assuming the likelihood distribution is available, uncertainties can be derived using variational methods, or confidence regions can be constructed should Wilks' theorem apply. Deriving uncertainties on the discrete number of sources, N , is more fraught. The calculation of uncertainty on a discrete parameter is equivalent to the problem of model selection. Each choice for the number of sources should be considered as a separate model. Two choices for the 4 number of sources may then be compared using a likelihood ratio between the likelihoods for each of the associated models. This likelihood ratio is a test statistic, and the probability distribution for the ratio must be known if it is to be converted to a p-value. The p-values may then be used to construct a confidence interval on the number of sources. Most often, Wilks' theorem is used to assume that the likelihood ratio is χ 2 distributed; however, this is only guaranteed in the asymptotic limit and any deviation of the true distribution from this assumption will result in incorrect confidence intervals.
An alternative approach to model selection is the calculation of marginal likelihoods: 1 where p(ϑ i ) is a prior on the individual source parameters. From this expression, the posterior p(Ñ |d) can be found through Bayes' theorem, and so a credible region can be constructed on the number of sources -avoiding the need for a test statistic as in the likelihood ratio case.
Variational methods provide only a lower bound on the marginal likelihood of the model, and thus provide no more than an estimate for the posterior distribution on the number of sources. A more common approach is to sample the posterior p({ϑ i }|d,Ñ ) directly using nested sampling [8] or a Markov Chain Monte-Carlo (MCMC). Nested sampling provides an estimate of the marginal likelihood directly, while various techniques exist for deriving the marginal likelihood from MCMC -for details consult the review Gelman and Meng [9]. Although this provides the most principled approach to the problem for small numbers of sources, in practice it exhibits a critical flaw. When the number of sources is potentially large, a marginal likelihood must be computed for each possible choice ofÑa flaw also shared by the variational and likelihood ratio methods. For hundreds of sources, this can rapidly become computationally infeasible.
Ultimately, this flaw can be understood to be a manifestation of a one-dimensional grid search overÑ . In fact, MCMC is designed to avoid grid searches by concentrating sampling on regions of parameter space that are most likely; however, the MCMC algorithm is designed exclusively for continuous parameters, whileÑ is a discrete parameter which when varied also modifies the number of ϑ parameters. The Reversible Jump Markov Chain Monte-Carlo (RJMCMC) algorithm is an extension of MCMC to discrete parameters that control the number of parameters in the distribution [10]. Probabilistic cataloguing [11,12] is the application of RJMCMC to the problem of point-source inference. While the point estimation methods produce a single best guess catalogue, probabilistic cataloguing produces posterior samples in the space of all possible catalogues. From this catalogue posterior, a posterior on the number of sources can be calculated. A catalogue posterior provides -essentially by definition -the most general solution to the problem of point-source inference. This generality comes at a high cost: the trans-dimensional parameter transformations required by RJMCMC has to be carefully selected, as they must be tuned to the specific application to ensure that trans-dimensional moves are efficient. In addition, the dimensionality of the likelihood is at least two or three times larger than the number of sources -depending on the number of parameters per source. For thousands of sources, it can be unrealistic to assume that the RJMCMC will properly equilibrate for even a given number of sources, let alone across the number of sources; the application of RJMCMC will require careful and extensive diagnostics, none of which are foolproof.
The solution to this problem requires a revisitation to the stated inference goal. If the goal truly necessitates the location and intensity of each individual source, then probabilistic cataloguing is necessary. But in this problem area, where the number of sources is large, they also likely form a crowded field. In this case, locations of individual sources will be poorly defined, as they form a near-uniform density within which sources are easily interchangeable. Instead of attempting to track each individual source, a model can be defined for the entire population of sources. The most common choice of model is a differential source-count function: dN/dϑ(θ). This function then has its own parameters, θ, which are characteristics of the population as a whole, such as the spatial distribution of sources, or the average flux of the population. The differential source-count function gives the density of sources as a function of ϑ given θ, and thus implicitly defines a probability distribution: where N is the mean number of sources in the population. This allows marginalisation of the likelihood over both the set of θ and further the possible number of sources, where p(Ñ |N ) is a prior on the number of sources given the mean number of sources -usually assumed to be a Poisson distribution. Thus, sampling directly from the posterior of L(d|θ) reduces the high dimensional nonparametric sampling problem in the space of all possible catalogues to a low dimensional parametric sampling problem in θ. However, to achieve this we have marginalised out the ϑ parameters, and so identification of the location or brightness of individual sources is no longer possible -the inference problem is now on the population itself.
To achieve this vast simplification, the likelihood L(d|θ) must be constructed directly so that it may be evaluated without computing the multi-dimensional integrals in Eq. 3. An early attempt for this construction, called the P (D) method [13][14][15], used instrument simulations to numerically estimate the likelihood as a histogram over the number of detected photons. This procedure estimates L(d|θ) by sampling d from the right-hand side of Eq. 3. As this is effectively equivalent to an importance weighted integration of those multi-dimensional integrals, this method ultimately carries the same computational burden of probabilistic cataloguing.
In Malyshev and Hogg [16], the authors noted that the number of detected photons is a sum of the number of detected photons produced by each source -itself distributed by a Poisson distribution. This allows the likelihood to be constructed using probability generating functions, and provides an analytic formula for L(d|θ) assuming a perfect instrument. To incorporate instrumental effects, a heuristic argument is used to justify a semi-analytic expression for the likelihood in terms of a detector effect correction function, ρ(f ), that is generated through Monte-Carlo simulation.
This method was further extended by Lee et al. [1] to images of multiple pixels by parameterising the sourcecount function in terms of a spatial template. Known as Non-Poissonian Template Fitting (NPTF), this is currently a leading method for parametric point-source inference in gamma-ray astronomy, and the method has also been applied to the search for astrophysical neutrino point-sources in data from the IceCube telescope [17]. The NPTF was primarily developed to analyse the excess of Galactic Center (GCE) gamma-rays observed by the Fermi telescope [18][19][20][21][22][23][24][25][26][27][28][29][30][31] and concluded that the observed excess was better described by a population of point sources, in comparison to a dark-matter annihilation origin [1,32] (see also [33]). 2 A similar approach to the NPTF that has also been widely used is the onepoint fluctuation analysis or one-point PDF method, see for example [35][36][37][38][39][40][41][42]. In the present work we will focus on comparisons between the CPG and the NPTF, but many of our conclusions would be similar if we compared to these alternative approaches.
Recent investigations have suggested that the NPTF results must be interpreted carefully, raising the possibility that the nature of the GCE has not yet been conclusively resolved. Firstly, Leane and Slatyer [2] demonstrated that the NPTF could attribute an injected darkmatter signal to the point-source model, although it appears this concern can be addressed. In particular, an improved treatment of the background models largely resolves the issue [4], and further, as was emphasised in Chang et al. [3], a degree of confusion is unavoidable given the inherent degeneracy between purely Poisson emission and a population of point sources that produce at most one photon each in the data set. Nevertheless, when performing a Bayesian analysis, confusion between point source and pure Poisson emission can be exacerbated by a poor choice of priors, a point we will return to in Sec. IV. An additional concern regarding the NPTF that has not yet been addressed, is that it appears the presence of an unmodelled asymmetry in the data can significantly bias the method, in that the asymmetry leads the NPTF to return strong evidence for a point-source population, even when there is none [5,6]. Taken together, the above results emphasise that the output of the NPTF must be interpreted cautiously -indeed, whether the GCE contains the first hints of the particle nature of dark matter remains an open problem, that even more recent wavelet [43] and Machine Learning [44] approaches have not conclusively resolved.
Nevertheless, as yet, the modelling of the detector effect correction function of NPTF, ρ(f ), has not been questioned, despite the heuristic justification for its original inclusion in the method. In this work we will show that there are instances where the NPTF construction explicitly breaks down as a result of the mathematical construction of ρ(f ), and that this represents an obstacle to extending the method to X-ray data sets. In detail, we present a first principles construction of the parametric likelihood in Eq. 3, which incorporates the correction of detector effects in a statistically justified manner. The result, which we call the Compound Poisson Generator, includes a new detector effect correction function that is an analytic expression of the basic components of instrumental effects: the PSF, effective area, and photon detection probability. The new construction demonstrates that the spatial distribution of the point-source population cannot be disentangled from the detector effect correction function, nor can the PSF be disentangled from the effective area. This shows that the NPTF -which factorises the spatial distribution, PSF, and effective area -cannot describe point-source statistics in general. To show this lack of generality, the NuSTAR X-ray telescope is used as a test case for both the CPG and NPTF. The substantial differences between the detector response in NuSTAR and Fermi will reveal the stated deficiencies in NPTF.
As for the effect of unmodelled asymmetries, these occur when the spatial distribution for an emitter is specified incorrectly. As NPTF and CPG have additional explanatory power above that of a simple Poisson model, they will produce a better fit to the data even if the emission is truly diffuse. This same effect would be observed when using probabilistic cataloguing, as the location of the sources will migrate to explain the deviation from the specified spatial distribution. The CPG likelihood does not address this issue; as it is, in fact, not an issue with the point-source model at all. To see this, consider the following analogy with statistical mechanics. We would be surprised to observe a box where all gas molecules are located in one corner, as this situation is a very unlikely macrostate for a gas. On the other hand, from the perspective of the microstate description of the gas, the gas is in as likely a configuration as any other configuration including those where the system is more thoroughly mixed. Probabilistic cataloguing, NPTF and CPG all specify the likelihood of the data, which is a microstate description of the population. Thus, a configuration where all sources are on one side of the image is as likely as any other, and these methods make no distinction.
The problem here lies in the diffuse model that these population models are compared to. The introduction of an unmodelled asymmetry will have only a small effect on the population likelihood; in comparison, the likelihood for the diffuse model will suffer greatly due to this mismodelling. When the two models are compared, it appears that the population model is erroneously describing sources, but it is the diffuse model that is erroneously rejecting the diffuse hypothesis. To resolve the issue, the spatial distribution must be given additional degrees of freedom so that the diffuse model can account for deviations from the expected spatial distribution. Much like in statistical mechanics, care can be taken by examining the macrostate of the fitted population. The analogous quantities to macrostates in point-source inference are the population parameters. The differential source-count function can be adjusted to include, for example, an N top and N bottom for the top and bottom of the image. The ratio of these means can then be computed, and an unlikely value for this ratio will form a diagnostic signal for a mismodelling issue.

II. POINT-SOURCE POPULATION AND INSTRUMENT MODELS
This section describes the point-source population model that will be used in this investigation, as well as the instrument model that is necessary to generate simulated observations. The instrument model is also needed in Sec. III to incorporate a detector correction into the likelihood.

A. Population Model
In the parametric approach, an assumption must be made to select a model that describes the population of sources. Here, we describe the population using a differential source-count function: dN/dF . This function describes the number of sources, N , as a differential over the individual point-source flux, F .
The point-source flux F is defined as a normalisation factor in a power-law flux energy spectrum: where Φ is the number of photons per unit area and time, E is the photon energy, E 0 is the power-law scale, and γ is the power-law index (also known as the photon index). As a result, the dimensions of F are photons per unit area, time and energy. Power law spectra are common for the high-energy tails of X-ray sources [45][46][47] -due to various processes such as synchrotron emission and inverse Compton scattering -and thus make a natural choice for this investigation. A power law index of γ = 1.5 is used for all scenarios, as spectra with this index are common near the Galactic Center [47], and the power-law scale is set to E 0 = 1 keV. The contribution of each source to the image is determined by converting the flux of the individual source to an expected number of photons based on the response of the detector. For a telescope, this response is commonly specified in terms of the exposure time, t exp , which is the time the instrument spent collecting the flux of interest, and further, the effective area, A eff , which is the collecting area of an equivalent idealised telescope that detects all incident photons, so that the effective area is strictly less than the actual size of the real instrument. The mean number of detected photons (also called the mean number of counts), S, is then where the number of counts is defined within an energy band of interest, E ∈ [E i , E f ]. As the only model parameter in this equation is F -through dΦ/dE -this expression can be simplified to S = F κ, where κ is called the detector response. Generally speaking, the exposure time and effective area are position dependent, so this is a function of position in the image, x: Equations 4 and 5 are not necessary choices for the methods employed in this investigations. All that is required, is for the number of counts, S x , received at location x to be able to be specified in terms of the individual source model parameter F , and another quantity κ(x). The value of κ(x) need not even be based on an assumed energy spectrum, it must only be known and calculable for any location x and satisfy the relation S x = κ(x)F .
As reviewed above, for a single source the key parameter dictating how many photons we expect to observe is the flux, F . When studying a population of sources, these fluxes can vary between the individual sources. The distribution of fluxes is described with a differential sourcecount function, dN/dF , which encodes the number of sources with flux between F and F + dF . For the investigations in this article, the differential source-count function is assumed to be a broken power-law. A singlybroken power-law has the following form, where A is a normalisation factor, F b(2) is the location of the break in flux, n 1 is the power index after the break, and n 2 is the index before the break. The generalisation of this distribution to multiple breaks is discussed in Sec. IV. The ultimate goal is then to infer the dN/dF model parameters, denoted in aggregate by the vector θ, from the statistics of the number of counted photons received by the telescope or detector. To be explicit, for the parameterisation given in Eq. 7, θ = {A, F b(2) , n 1 , n 2 }. Generally, the differential source-count function can be posed in terms of a distribution, p(F ), which gives the probability density of an individual source having flux, F , through the relation where N is the mean number of sources. This representation is more useful when describing the statistical process underlying point sources.
Another common representation is the cumulative source-count function (also called simply the sourcecount function): which specifies the number of sources in the population with a flux greater than F . If the population is large in spatial extent, this may be written as the areal sourcedensity function, n(F > F ), often in numbers of sources per steradian or arc-minute squared. Power-law type source-count functions are common for astrophysical populations generally, and for X-ray emitter populations specifically [46,47]. This is not unexpected, as the inverse square reduction in apparent brightness with distance will give most populations a power-law like distribution in the observed brightness. More generally, power-law distributions are common in nature (see, for example, the Pareto distribution) and are the maximum entropy distributions for a logarithmic parameter with a specified mean.

B. X-ray Instrumentation
Although the P (D) method has been used extensively to study X-ray sources [14], the current leading parametric inference method, NPTF, has yet to be applied to this regime. X-ray astronomy poses a number of novel complications -not present in gamma-ray and neutrino astronomy -to the application of parametric point-source inference. Overcoming these challenges is one of the central aims of this work.
The NuSTAR telescope [48][49][50][51], in particular, possesses most of these complications; for this reason, NuS-TAR is used as detector model to explore the effect that these complications have on parametric point-source inference. As compared to gamma-ray and neutrino astronomy, the NuSTAR X-ray data set presents the following unique challenges: • NuSTAR, like most X-ray telescopes, has a far narrower field of view (FOV) of 12 arcmin per side, with hard edges due to the use of a detector plane with focusing optics. Fermi and IceCube have a FOV of approximately 40 degrees and the full sky respectively; • Although NuSTAR has a much narrower angular resolution, the NuSTAR PSF is a larger fraction of the FOV compared to Fermi and IceCube; • The NuSTAR PSF varies significantly as a function of position on the detector plane within a given observation; and • To compensate for the narrow FOV, multiple observations are often compiled together into a mosaic. This creates a complex and discontinuous detector response, with multiple overlapping PSFs for each observation. While the instrument response of both Fermi and particularly IceCube do vary across the sky, the variation is significantly smoother than common for X-ray data sets.
In order to study all of these effects, we have developed a detector simulation suite for NuSTAR. This simulation injects point sources into an binned map that forms a image. In this investigation, the bin sizes are much larger than the pixelisation 3 of the NuSTAR X-ray detectors, and so the effect of this intrinsic pixelisation on the image binning is not considered here. The NuSTAR effective area -including vignetting effects -and PSF are incorporated into the simulation, and can be individually altered or simplified to assess the effect of each on the performance of the point-source methods investigated here. The simulation also requires a spatial distribution for the point-source population, as well a dN/dF function as defined by Eq. 8. Further details on NuSTAR and the simulation procedure are given in App. A. An example image generated by the simulation is shown in Fig. 3. This image demonstrates many of the complications arising from the NuSTAR telescope. Multiple observations are stacked into a mosaic, and the boundaries between the observations, shown by grey dashed lines, cause discontinuities in the images of nearby point-sources. In addition, the anisotropic PSF of NuS-TAR causes complicated PSF shapes for sources near these boundaries, as shown in the red inset. The white dotted lines within the inset highlight one particular source that lies on the boundary of an observation. Two observations contribute to the image of this source, and so two PSF shapes (along the dotted lines) combine to create a highly irregular PSF shape.
The X-ray data will contain emission from contributions other than point sources. In particular, we expect both detector backgrounds to populate the collected data, as well as smooth astrophysical emission associated with, for instance, the cosmic X-ray background. We collectively refer to these non-point source flux contributions as diffuse emission. The total diffuse emission will have an associated spatial map which specifies the mean expected flux at each location, which we can then combine with the detector response -as we did for the point-source flux F -to determine the mean predicted diffuse counts at each location. Simulation of these contributions is then performed by generating a map that is a draw from a Poisson distribution that is associated with the mean value of the diffuse map.

III. DERIVATION OF THE CPG LIKELIHOOD
We now turn to towards the central goal of this paper, the construction of the CPG likelihood. The construction of the likelihood will heavily involve the use of probability generating functions and functionals, as we will exploit their useful properties for constructing compound distributions. Generating functions are an alternative representation of a discrete probability distribution over non-negative integers. Suppose we have a distribution P (k), which determines the probability of observing an integer k, the generating function is then defined by the 9 Z-transform of the distribution: For example, the generating function for the Poisson distribution with mean λ is The generating function approach will be convenient for a number of reasons, however, a central property that we will exploit is as follows. Consider forming a sum of M independent and identically distributed random variates, each of which has a generating function G 1 (z), however with M itself an independent random variate, with its own generating function G 2 . Then the generating function for the sum is given by G 2 (G 1 (z)), which follows directly from the expectation value definition of the generating function and the law of total expectation. For the problem at hand, this will arise in the context of counting the total number of counts detected from a population of point sources, for which the number of sources is itself a random variable.
We will build up the full CPG likelihood over the course of several steps, where each part describes the distributions and generating functions for a major concept in the construction. Specifically, we divide the discussion as follows. The goal of this section is to build an intuition for the moving parts of this construction. A more direct, but abstract, construction is provided in App. B, along with further discussion of the properties of generating functions and functionals. In that appendix we also show the unbinned likelihood, as well as the likelihood that accounts for the correlations point sources induce between neighboring bins, as well as a discussion on why these correlations must be ignored for computational reasons in these demonstrations.

A. Single Source Generating Function
To begin with, consider the emission from a single source located at a position x. The number of detected photons from a time-integrated observation of that source follows a Poisson distribution. This is a result of the exponential nature of the inter-detection time distribution, and the statistical independence of multiple detections conditioned on the mean number of detected photons, S B . Here the index B denotes the spatial bin in which the counts are detected. Continuing, let P P (s B |S B ) be the probability mass function for the Poisson distribution describing s B detected photons with mean S B . Then given the true spatial location of the source, x, the probability mass function for s B at this location is Here we introduced p(S B |x), which is the probability distribution for the mean, S B , for a given source location of x, i.e. given a point source at x, it is the probability that it produces a mean number of counts S B in the bin B. As the exact value of S B is unknown, we marginalise over it in the above expression. The exact configuration is shown in Fig. 4a, where a source at location x contributes s B detected photons via the distribution P (s B |x) to bin B, which has a spatial size Ω B . We can move from the expected number of counts from a source, S B , to the physical flux F , by combining a conditional distribution p(S B |F, x) and a distribution over source flux, p(F ), as follows, This construction explicitly assumes that the flux distribution is isotropic, such that p(F |x) = p(F ), an assumption which holds when the sources are themselves identically distributed -a key property that will be needed in the next section, and we will assume throughout. The flux distribution itself has already been defined: it is given by the differential source-count function, dN/dF , in Eq. 8. The conditional distribution p(S B |F, x), however, will depend centrally on the detector effect correction -as discussed already, the conversion from flux to counts intimately depends on the detector response -and it will determine the amount of flux F that contributes to the bin B. For the moment we will leave it unspecified, postponing a definition until Sec. III C. Combining Eq. 12 and 13, we have which fully specifies the probability of observing s B counts from a single point source in terms of the differential source-count function, in p(F ), the instrumental response, in p(S B |F, x), and of course the Poisson distribution, P P (s B |S B ). From this we can immediately determine the generating function for a single source located (a) A single source located at x contributes sB counts to bin B as defined by the bin extents ΩB (solid square). The number of counts is determined by the Poisson distribution P (sB|x) as defined in the text.
i→B } counts to bin B creating a total of kB counts in the bin.
T (x) y 1 y 2 y 3 y 4 y 5 y 6 (c) A Poisson process describes sources spatially distributed according to T (x) (sketched by shaded region). Each source may contribute multiple photons to the bin, which are recorded as counts at locations {yj}.
(d) A single photon created by a source at x1 is converted to an expected number of counts using the following detector effects: the detector response (effective area and exposure time), κ(x1), that is evaluated at the source location; the PSF distribution φ(y2|x1) evaluated at the location where the photon lands, y2, and conditioned on the source location; the detector efficiency, η(y2), that is evaluated where the photon lands. at a given position x, G s B |x (z), which follows as the generating function for the Poisson

B. Multiple Source Generating Function
We next extend the discussion to account for the emission detected from a population of sources. Statistically, the locations of these point sources follows a spatial Poisson point process. This follows from the observation that: • The number of sources can be any non-negative 11 integer. 4 • The occurrence of each source is independent from other sources. The presence of a point source does not have an effect on the probability of a new source forming.
• Two sources never occupy the same spatial location, an infinitesimal area on the sky has only zero or one sources in it.
Spatial clustering of sources may violate the last two assumptions. If the clustering is known in advance (e.g., it is known that the sources cluster around the galactic center), then the construction presented here will render the source locations independent by conditioning on the spatial distribution of the sources. Some clustering processes cannot be made conditionally independent in this way; e.g., if the presence of one source increases the probability of further sources forming nearby. In this case, the construction presented here does not apply for counting sources. Instead, the method can count entire clusters as single sources, and the flux distribution, p(F ), would describe the flux of clusters. The size of this effect will depend on the degree to which the spatial distribution assumption is violated by the clustering. As an explicit example, the presence of binary emitters would violate these assumptions, and the CPG will count binary systems, rather than sources. 5 To proceed, let T (x) denote the spatial distribution for the location of a source (or system) in the population. In terms of T (x), often referred to as the spatial template of the point sources, the source intensity function is N T (x), so that the integral of the source intensity function is equal to the mean number of sources, N . 6 In more detail, Now, let the template be the fraction of the spatial distribution in bin B with bin extents Ω B . We will use this to construct a generating 4 In reality, there will be a physical upper bound to the number of sources one can find in most populations. Nevertheless, the number is typically sufficiently large that an unbounded process is an excellent approximation. 5 For this caveat to be relevant, both objects must be emitters that can be detected by the instrument. If only one object in a binary is an emitter, the presence of the other is irrelevant to these assumptions. 6 We leave the region over which T (x) is normalised unspecified.
In general it need not correspond to the region within which the analysis is being performed. For example, when studying an extragalactic source-class, it may be convenient to normalise T (x) over the full sky, even if this is larger than the region over which the data is collected.
function that accounts for sources located in surrounding bins which contribute flux to the current bin. We define k i→B as the number of counts that all sources located in bin i contributes to bin B. As there can be more than one source in bin i, k i→B is a sum of M i random variates, 7 s (j) i→B : as shown in Fig. 4b. Now, note that the spatial distribution for a source, when conditioned on it being located in bin i, is Then, each s (j) i→B is identically distributed according to the generating function G s i→B . This generating function is formed by taking the expectation of the generating function 8 G s i→B |x over T (x|i) -the spatial distribution conditioned on the source being located in bin i: where is now the instrument response for bin i contributing to bin B. This object can be intuitively thought of as the average response in bin B, from sources located in bin i, marginalised over the probability of a source appearing at any specific location within bin i.
As emphasised already, the number of sources, M i , is itself a random variate distributed according to a Poisson distribution with mean N T i , recalling that N is the mean 12 number of sources for the total population. Thus, the generating function for M i is Now, the generating function for the sum k i→B is the composition of the generating functions for M i and s (j) i→B : The total number of counts in bin B, denoted by k B , is then the sum of all the k i→B across all bins: Note that this sum must range over the full domain of the template, T i . Primarily, this ensures that the recovered value of N is correctly normalised to the template, but it can also be understood to ensure that the contribution from any source -no matter where it may be in the spatial distribution of the population -is counted correctly, as the PSF generally allows contributions to k B from anywhere in the spatial distribution.
Thus, the generating function for k B is the product of the generating functions for k i→B : At this stage, we could substitute into this expression G si→B (z) as given in Eq. 21. Before doing so, however, let us capture into a single function the detector response as it would appear in Eq. 26, Using this, we can write the full generating function for k B as The distribution for k B is known as a compound Poisson distribution [52], and so G k B is a compound Poisson generator. As shown in App. B, this compound Poisson distribution actually describes a compound Poisson point process. The process gives the probability density of finding a detected count at location y as shown in Fig. 4c. Eq. 29 generates the probability of finding k B such detected counts, {y 1 , . . . , y k B }, anywhere in Ω B .

C. Detector Effect Correction
In the discussion thus far, we have determined the generating function associated with a population of sources, specified by a differential source-count function. The final aspect of Eq. 29 we have avoided confronting is the instrument response, which we turn to now. Recall, we codified the instrumental effects as follows, where, as defined above, T (x) is the spatial point-source template, and p(S B |F, x) accounts for how the instrument converts a flux F from a source at location x into an expected number of counts in bin B. In general, the instrument response and the spatial template cannot be fully factorised, as they are in the NPTF approach to the problem. We will see this explicitly in the discussion that follows.
Our treatment will consider four separate detector effects: the exposure time, which converts flux to timeintegrated flux; the effective area, which converts timeintegrated flux to expected photons incident on the detector; the PSF, which gives the probability density for the deviation of a photon's recorded direction of arrival from its true incident direction; and the detector efficiency, which gives the probability of a single incident photon being detected. Both the exposure time and effective area are merged into a single detector response value, κ(x) which converts flux to mean counts for a point-source at location x, and was discussed in Sec. II. The PSF is a probability density, φ(y|x), for a count detected at location y conditioned on its parent point-source at location x. The detector efficiency, η(y) is a probability of a count being detected conditioned on the location of detection y. The geometry of each function is shown in Fig. 4d, and examples of how these quantities combine for mosaiced images is shown in Fig. 5.
To give a concrete example of these terms in the context of NuSTAR (Fermi, IceCube): a source is located in the direction of x and emits a primary X-ray (γ-ray, neutrino) from the direction of x. This primary interacts with the optics (detector, ice) and produces a secondary X-ray (electron, lepton) that is scattered in the direction Left obs. a) Figure 5: Two separate observations with dashed boundaries form a mosaic. a) In areas where the observations overlap, the PSF will be an appropriate mixture of the PSF for each observation. Near the edge of the observation, we show an example of how the PSF can be highly distorted in the radial direction, as expected for NuSTAR observations. The mixture forms a cross with the PSF from the left observation lying diagonally NW-SE, and the PSF from the right observation lying diagonally SW-NE. b) The detector response, κ(x), is a sum of the response for each observation, and varies smoothly across the mosaic as the response extends outside the observation boundaries. c) In contrast, the detector efficiency, η(y), has a sharp transition from unity to zero at the observation boundary, and will generally be non-smooth across any bin that straddles these boundaries.
of y according to the PSF φ(y|x). This secondary interacts with the active detector causing a count to be recorded with probability η(y).
In terms of these individual detector responses, the expected number of incident photons produced by a pointsource at location x with flux F is S x,F = κ(x)F , and then the mean detected photon density at location y will be S x,F (y) = η(y)φ(y|x)S x,F . As a result, the mean number of detected photons in bin B is and the associated distribution, p(S B |F, x), must be which then determines the marginal form as given in Eq. 30. Direct substitution of this result into Eq. 29 is not immediately helpful, as it results in a double integral over spatial coordinates that must be evaluated during the computation of the likelihood. Instead, all of the detector effects can be encoded into a single measure, which we denote µ B (ε), over an effective detector response variable, ε, such that S B = εF . In particular, we define In terms of the new measure, we can restate Eq. 30 as Then, substitution of this form of p(S B |F ) into Eq. 29 reframes the CPG as which only involves integrals over the effective detector response ε and source flux F . Importantly, the detector correction function, µ B (ε), can be pre-calcuated thus saving considerable computation time during likelihood evaluation. As it is rare to have closed form expressions for κ(x) and φ(y|x) in most experiments, µ B (ε) will almost always need to be numerically estimated. This estimation can proceed by effectively evaluating the integrals over x and y in Eq. 33 through Monte-Carlo integration. Samples are drawn from T (x), and then -conditioned on these values of x -samples a drawn from φ(y|x). The values of y are accumulated to create a value of ε, and the resulting ε values are histogrammed to create a density estimate over ε -which is µ B (ε). This process, and an explicit algorithm for construction µ B (ε), is detailed in App. E. As emphasised at the outset, the detector response involves the source template intimately. This is distinct to the handling of the detector response in NPTF, which occurs through the function ρ(f ), and we will explore the differences between these two approaches in Sec. V.

D. Calculation of the Single-Pixel Likelihood
Once µ(ε) has been constructed, evaluation of the CPG in Eq. 35 requires the integrals over ε and F to be performed. As µ(ε) is numerically constructed, the integral over ε will also be performed numerically. The remaining integral over F may then be performed numerically, or analytically, if the assumed form of p(F ) is amenable to such treatment. For the examples in this investigation, p(F ) is assumed to follow a broken power-law distribution, in which case the evaluation can be performed analytically as detailed in App. C. In this section, the only assumption required is that this evaluation produces a power series in z: This is a fairly mild assumption, as it only requires that the expression within the square brackets of Eq. 35 be an analytic function of z. This is easily satisfied when the moments of p(F ) and µ B (ε) are finite. For now, this power series will be assumed to be infinite in order; later, it will be shown that only a finite order is required in practice.
The goal is to find P (k B ), the probability of measuring k B detected photons, from this generating function. Recall that a generating function is defined as The relationship between the power series in Eq. 36 and the power series in Eq. 37 is given by the Bell polynomials Thus, by inspection we have in this case B )/k!, and note that from Eq. 37, for a bin with k counts, only the first k terms of the power series in Eq. 36 need to be calculated. The evaluation of the Bell polynomials can be performed using recurrence relations, as we demonstrate in App. D.

E. Whole Image Likelihood
Through Eq. 38 and the results preceding it, we have achieved our aim of writing the single-bin likelihood, P (k B |θ), where θ are the model parameters for the population, as encoded in dN/dF . A likelihood for the whole image, I = {k B : ∀B}, can be constructed as a simple product over the bins, Importantly though, this construction does not take into account the correlations between the bins, which may be induced by the PSF. Indeed, it assumes that the k B are statistically independent, which will be not be true unless the PSF is a delta-function distribution. Such a deltadistribution ensures that sources at the edge of a pixel only deposit flux in a single bin. The effect of this broken assumption is that the resulting posterior distribution on θ will be narrower than the true posterior if these correlations were accounted for -by treating every pixel as independent, we have assumed there is more available information than is actually present in the image. This will underestimate the uncertainty on the model parameters, as sources which overlap bins are effectively counted as multiple independent observations of the same source, instead of the true single observation -a similar effect to double counting data. The degree to which the posterior is narrowed will depend on the chosen bin size, as smaller bins -relative to the PSF -will be more strongly correlated. The test cases in Sec. V show that this is not a significant effect for the bin sizes chosen in this investigation, which are several times larger than the PSF.
In principle there are other ways correlations between pixels can be induced that would invalidate the factorization in Eq. 39 -for instance an incorrect template for one of the Poisson models, or perhaps other instrumental effects. The reason we single out the PSF is that it is never truly a delta-function distribution in any real instrument, and this represents the most significant correlation that can only be mitigated by appropriate choice of bin sizes.
While this is an unambiguous deficiency of the above derivation, so far no computationally feasible method has been proposed to account for the correlations in a binned analysis. Indeed, the most obvious extensions of the above construction require the computation of every possible combination of how a source can distribute its counts to all bins in the image. As stated, in the present work we will work with a binning such that this shortcoming is suppressed, but we caution that for a general binning the biases associated with this effect must be considered.

F. Multiple Emission Models
It is a common occurrence for an image to have contributions from both populations of point-sources as well as purely Poisson emission or background components. As such, it is important to be able to accommodate this reality in the likelihood, as we do so in this subsection.
Let k B,j be the number of counts that population j contributes to bin B, and B,l be the number of counts that Poisson component l contributes to the same bin. Then the total number of counts in this bin is simply a sum over the contribution from each component, In turn, the generating function for the combined emission, K B , is For point-source populations, the generating function is as derived earlier in this section, whereas G B,l (z) = exp [λ B,l (z − 1)] is the generating function for B,l and is the integral of the intensity function I l (y) for Poisson component l in bin B, which parameterises the mean of B,l . Recall that a single point source in the pop-ulation is specified by a flux F , which carried dimensions of [photons cm −2 s −1 ]. In comparison, the Poisson diffuse-emission component is an extended source and has a differential flux of F P T P (x) with dimensions [photons cm −2 s −1 sr −1 ]. Here T P (x) is the template for the Poisson emission, which may or may not be the same as the spatial distribution of the source population, T (x). It does, however, have the same units of [sr −1 ], so that F and F P will also carry the same dimensions. In terms of these quantities, the Poisson intensity is given by as photons from the diffuse component are also scattered by the PSF. The mean can then be written more compactly as where F P,l is the flux for Poisson component l, and µ B,l is the detector correction function for template T P,l . As for the point-source population, we envision the spatial template as fixed, which leaves a single model-parameter for the emission, F P . Combining these results, the generating function for K B can be written in the same form as Eq. 36: and the same likelihood evaluation method of Sec. III D can be employed, thereby completely specifying the CPG likelihood.

IV. BIASES INDUCED BY COMMON PRIOR PARAMETERISATIONS
In this section, the priors on the population model parameters -necessary for a Bayesian analysis -are discussed. Our focus will be to demonstrate that poorly chosen priors on a common combination of the population flux and background flux can lead to misleading posterior distributions. We further introduce a set of priors where these issues can be reduced, and advocate for their use more generally in population studies.
Let us make a general point at the outset. In Bayesian analyses, one is free to choose any set of priors. Priors can be adopted which reflect a preference towards either hypothesis. The Poisson hypothesis may be preferred for its simplicity, or alternatively one may wish the results to reflect an underlying bias towards the point-source model given that in many situations it is known that unresolved point-sources must be present. Whatever set of priors is adopted, however, the preference they reflect should be considered. As we will show in this section, taking simple priors on the parameters that describe the pointsource model can induce a complex bias in the question of which model is generating the flux. The priors we will introduce instead place the Poisson and point-source models on equal footing at the outset, and if not adopted directly, at the very least represent a starting point for adopting a principled set of priors for Bayesian pointsource inference.
For the remainder of this investigation, we will restrict the model under question to at most one population of point sources and one Poisson component, that both share a common template, i.e. T (x) = T P (x). Realistic analyses are more complicated than this restriction; for instance, existing NPTF Fermi analyses involve two (or three) point-source population models and multiple Poisson components, while the NPTF IceCube analysis used one population with multiple Poisson components. However, common to both analyses was the use of a point-source population model and Poisson component with an identical spatial distributions. This is the situation where the particular bias we will discuss can emerge, justifying our restriction.
Fundamentally, such a scenario arises when the underlying nature of flux distributed according to T (x) is unknown, and the question of interest is whether it is due to a measurable population of sources, purely diffuse emission, or instead a mixture of the two. By using a common template for the two possible emission sources, the flux could be assigned to either, or some fraction to both. For example, in the case of Fermi, the fundamental question was determining whether the anomalous flux at the Galactic Center was attributable to a population of sources, such as millisecond pulsars, or instead to diffuse dark-matter emission that would be Poisson distributed.
For the example of IceCube, the goal has been to determine whether a measurable fraction of the astrophysical neutrino flux can be assigned to a population of sources.
To begin the investigation, the differential source-count function must be parameterised in order to evaluate the power series terms, a (m) B as given in Eq. 45, and accordingly the CPG likelihood. Note that while we will use the CPG likelihood in order to demonstrate the potential prior bias, the effects we reveal are equally applicable to other point-source likelihoods such as the NPTF. The use of the CPG also furnishes us with an example to demonstrate the application of the likelihood. In the Fermi and IceCube analyses, the following standard form was used for the source-count distribution: (2) .
This defines a broken power-law with m − 1 breaks in flux, specified by F b(m) , . . . , F b(2) ; m power-law indices, parameterised by n m , . . . , n 1 ; and a scale factor A, that is related to the expected number of sources in the population. The break parameters have the same units as F , while A has units of sources per inverse units of F . Note this is a generalisation of the singly broken power-law given in Eq. 7. With this parameterisation, Fermi and IceCube analyses -see, for example, Refs. [1] and [17], respectively -have used a Bayesian inference framework. In both cases, uniform priors on the indices, n i , and log-uniform priors on A were chosen. For the Fermi analysis, the flux breaks, F b(2) , and Poisson component flux F P were given uniform and log-uniform priors, respectively, whereas in the IceCube analysis, F b(2) was given a log-uniform prior, while F P was given a uniform prior.

A. A Sketch of Poor Prior Parameterisation
Let us firstly outline where the issues associated with the poor prior parameterisation originate. For simplicity, we will concentrate on a single break differential sourcecount function; however, the arguments here generalise to multiple breaks. It is important to note that the total flux of the point-source population, F P S , is not proportional to F b (2) , and also depends on A, Already from this expression we can make the following observation: a uniform prior chosen for F b(2) will not, in general, uniformly weight values of F P S after the change in coordinates. Consider a situation in which this model is used to analyse data that has no distinct population of sources, such that the data is entirely consistent with a Poisson distribution. Clearly, the model can explain this data by assigning the entirety of the flux to the Poisson component; however, this is not the only solution for this inference problem. In particular, if we have a population of dim sources, there is a limit in which the sources are so dim that this distribution becomes indistinguishable from Poisson emission. To provide an explicit example, if we had a population with N expected sources, each of which produces µ counts on average, then the mean number of counts produced by the population is λ = N µ. In the limit where µ 1, such that all sources produce either 0 or 1 counts only, then the point-source likelihood exactly reduces to the Poisson distribution with mean λ, and the two hypotheses are formally indistinguishable in the data. Note, that for λ to stay finite, we require a large N when µ 1. Returning to our single-break scenario, note that and so for this point-source-Poisson degeneracy regime to be achieved here, the prior on A must be sufficiently large. But so long as it is, then in principle data associated with Poisson emission can be equally well described by the diffuse or point-source hypothesis. Ideally, in this situation, the posterior should show complete uncertainty on F P and F P S , with a perfect anticorrelation that corresponds to the total flux in the image. 9 However, if the same kind of prior is chosen for F b(2) and F P -for example, both log-uniform -then the corresponding prior on F P S may have a different form to the prior on F P . In this case, the posterior will show a preference to assigning flux to either the population or the Poisson component. If the difference between the priors is substantial, essentially all of the flux will be assigned to one of the two components, despite the data having no power to distinguish them.
Unless this subtlety in choice of prior is accounted for, the result may be unexpected, potentially leading an experimenter to erroneously conclude that the data supports a population of sources where there are none, or vice-versa. This is the bias we will explore in more detail below, and then outline priors that can be chosen such that the two hypotheses remain indistinguishable given uninformative data.

B. Prior Effect Demonstration
In order to investigate in detail the potential bias induced by the choice of prior parameterisation as discussed in the previous subsection, we consider simulated NuS-TAR data sets (for details of the NuSTAR simulation see App. A). For this scenario, the vignetting is disabled so that the detector response is uniform across the image. In addition, the PSF is locked to the on-axis PSF of NuSTAR, so that the PSF does not change as a function of source location. These simplifications are chosen to ensure that the effect of prior parameterisation is not obscured.
The spatial distribution for both the population model and the Poisson component is specified as a uniform distribution concentric with the image and with a width and height twice that of the field of view. As this demonstration requires data that is indistinguishable from a Poisson distribution, no point-sources are injected and only a uniform Poisson background is used to generate the image. Further details are given in Tab. I (all tables are located in the appendices).
Given this scenario, we perform a Bayesian analysis of the resulting images using the CPG, but taking four variations on the choice of prior, considering variations for the prior of the amplitude A and the flux parameters F b(2) and F P . The same form of prior is used for F b(2) and F P . We consider the four combinations resulting from a uniform linear or log-uniform prior on the amplitude and flux parameter. The detailed prior ranges are given in Tab. II. We take the same lower limit for both flux priors. The upper limit of the F P prior is extended as this parameter captures the total flux of the Poisson component, while F b(2) is more closely related to the average flux of a single source in the population, so we allow for a lower value. The upper and lower limits for the priors on each parameter are identical across variations. Uniform priors are chosen for the power-law indices, n i . Of course, the results from running point-source inference on one simulated image may not be representative, as the simulation is a Monte-Carlo procedure that may produce an outlier image. To capture the variation in the simulation, for each choice of priors, six images are generated and a posterior was sampled for each of these trials. The posteriors were sampled using the emcee Affine Invariant MCMC [54], discarding the first 90% of samples as burn-in.
From the resulting posteriors, we focus our attention on a single parameter of interest, the proportion of the flux that is assigned to the population model. In particular, we define the fraction of flux assigned to the population as ω = F P S /(F P + F P S ). For each of the trial images, a coordinate transformation is applied to the posterior samples to yield samples in ω. These samples are then histogrammed, and from the set of trials the median, 16% and 84% quantiles over the histogram bins are computed and shown in Fig. 6. The clear trend observed here is a highly bimodal posterior in ω -the model assigns essentially all of the flux to only the population or the Poisson component in a situation where from the perspective of the data, the two are indistinguishable. Unless this behavior was anticipated, it could generate misleading conclusions. Again, the data supports both the population model and the Poisson component, so one expects that the posterior on ω will be close to uniform.
Recently, in the context of application of NPTF to the Fermi GCE, concerns have been raised on the possibility that flux from a diffuse dark-matter component could be misattributed to the point-source population model. Leane and Slatyer [2] raised these concerns in relation to mismodelling of the spatial distribution of sources, and the authors show that a strong flux misattribution effect can result from spatial mismodelling. The authors also examine the attribution of flux when the spatial distribution is correctly modelled. In particular, Fig. S7 shows that even with correct modelling, some diffuse dark-matter flux is attributed to the point-source model -although it should be clarified that the effect is considerably weaker than the spatial mismodelling effect observed in the rest of the study. Figure S7 also shows that when the true flux of the diffuse dark-matter component exceeds the flux of the point-source population, the flux is then attributed to the diffuse component of the model. We can understand that this behaviour is very likely impacted by the choice of priors in that work, given their importance as demonstrated above. In particular, Leane and Slatyer [2] placed a log-uniform prior on the diffuse emission and A, but a uniform prior on F b (2) . The use of a uniform prior for the point-source model flux break will place greater weight on this model to describe the diffuse flux -when the dN/dF for this model can accommodate both the diffuse flux and the brighter point-source emission favoured by the data. Accordingly, when small amounts of diffuse flux are injected we would expect it to be absorbed by the point-source model for this choice of priors. However, once the diffuse flux is comparable or larger than the point-source flux, it becomes difficult to explain both the population and the diffuse flux using the same power-law dN/dF . Thus, the model reverts to attributing the diffuse flux to the Poissonian model.
The prior effect can be observed more clearly in Chang et al. [3]. More generally, if we denote the total flux by F T , so that F T = F P S + F P -a diagonal line on the plane of F P S and F P -it is clear that the only choice of flux priors that will result in the expected behaviour are ones that assign equal probability to all values of F P S and F P that lie on this diagonal. Two choices of priors satisfy this requirement: either both uniform priors or both exponential priors on F P S and F P . 10 This may appear to suggest that a flat posterior on ω should be observed in Fig. 6 for the "Uniform F" cases. The non-uniform posterior in these cases comes from a second effect: although the prior on the flux break, F b (2) , is specified to be uniform, this does not mean the prior on the total point-source flux is uniform, as emphasised below Eq. 47 above. Recall, the total flux is proportional to AF 2 b(2) , so a uniform prior on A and F b(2) is effectively a F −1 T prior on the total flux, resulting in the observed concentration toward ω = 0.
A log-uniform prior on A with a uniform prior on F effectively generates an N −2 prior on the mean number of sources. Although this appears to be a uniform prior on the total flux, one must consider how the boundary of the prior transforms. The effect of the prior boundaries can be incorporated by marginalising out N from the joint prior, which results again in an F −1 T prior on the total flux. Accordingly, none of the commonly existing prior choices result in the desired flat ω distributions. Of course, the particular priors existing choices achieve may be desirable in certain circumstances. Nevertheless, when considering the question of whether emission is fundamentally Poisson or point source, it is undoubtedly useful to have a system of priors that generates a posterior where both are treated equally when the data is uninformative. With this in mind in the next section we will introduce a new approach which involves reparameterising the priors entirely with a new coordinate system.

C. Reparameterising Priors in a Natural
Coordinate System Much of the previous discussion already suggests an alternative approach. A natural way to describe the combination of both population and Poisson component is through the coordinates of ω and F T , the relative and total flux, respectively, so that the population model is specified in terms of F P S . From ω and F T , the flux of either component is straightforward to calculate, although we caution that given the inherent degeneracy, the individual fractions must be interpreted carefully. Nevertheless, as it is considerably easier to define and compute the likelihood using the A, n i , and F b(i) coordinates, a coordinate transform is needed from F P S . First, the complete coordinate system for the population model must be defined. A natural complement to F P S is N , the expected number of sources. From this, the average flux per source can be readily determined. Next, the position of the breaks must be defined. For m − 1 breaks, m − 1 coordinates are required. The F P S coordinate is necessarily involved, leaving m − 2 remaining coordinates to specify. These are defined as a series of fractions, β i , that give the location of each break relative to the previous break: These coordinates define a system of equations from which the break locations can be solved. The full transformation is detailed in App. G. Finally, the power-law indices must be specified. These could be left as the {n i } in Eq. 46; however, we take the opportunity to correct another subtle issue with the priors. The most common choice of prior on n i is a uniform prior. Suppose that such a uniform prior is chosen for n 1 in the range of 2 to 100. The uniform prior assigns nearly ten times as much prior probability to 10 < n 1 < 100 as it does to 2 < n 1 < 10. This is despite the fact that most observed power-law indices are less than 10; thus, a uniform prior is contrary to our prior knowledge of these physical systems. Instead, the index can be specified as an angle, ψ i , and the index defined as n i = tan ψ i . A uniform prior on ψ i now places as much probability to 2 < n i < 4 as it does to 4 < n i < 38. The intuition is also clear, on a log-log plot of the source-count function the prior is uniform on the angle of the line formed by the power-law. This should not be taken as an objectively better choice, and there may be scenarios where a uniform prior on the power-law index is a preferable; yet, the uniform prior is all too often used without much consideration, and the intention here is to provide a principled alternative.
In summary, we propose defining the priors on the broken power-law with m − 1 flux breaks, as defined in Eq. 46, using the coordinate system {N, F P S , β 2 , . . . , β m−1 , ψ 1 , . . . , ψ m }, rather than {A, F b(2) , , . . . , F b(m) , n 1 , . . . , n m }. A Poisson component may be added to these coordinates through the ω flux fraction parameter. The F P S parameter is then replaced by a F T parameter which represents the combined flux of the point-source population and the Poisson component. Then, during the likelihood evaluation, a coordinate transform is applied using F P S = ωF T and F Poiss = (1 − ω)F T .
In the next subsection we will demonstrate that within this arguably more natural coordinate system for pointsource distributions, priors can be chosen where the de-   The total flux is tightly constrained for N > 1; however, it becomes completely unconstrained once the mean number of sources is less than one in the whole population (N < 1).
generacy inherent in the physics is faithfully represented in the posteriors.

D. Demonstration of Bias Removal
We consider an identical simulation scenario to that defined in Sec. IV B, except we now approach it using priors defined in the natural coordinate system. A unit uniform prior was chosen for ω, uniform priors are chosen for the ψ i coordinates. As before, we consider four prior variations, which result from considering combinations of linear or log-flat priors on N and F T . Specifics are provided in Tab. III. The results are shown in Fig. 7, in the same format to those in Fig. 6. As the prior on ω is uniform, we observe that the posterior on ω is now generally uniform when the data has no preference for the population or Poisson component.
There is, however, one exception to this behavior that occurs for the combination of a log-uniform prior on N and a uniform prior on F . In that case, the results demonstrate a clear bias in the posterior toward assigning all of the flux to the point-source template. The cause is a combination of effects from both priors. The log-uniform prior on N allows for small values of N . In particular, when N < 1 the probability that there will be no sources in the image becomes significant. If there are no sources, the flux on the source population cannot be constrained, and any value on F is allowed. In such a case, if the prior on F is also log-uniform, then large values of F are relatively less weighted than they would be with a uniform prior, resulting in this lack of constraint having little effect on the posterior. However, when the prior on F is uniform, large values of F are encouraged, and the posterior assigns significant probability to N < 1 and F P S F P , as shown by Fig. 8. This issue may be avoided in two ways: choose either a uniform prior on N or a log-uniform prior on F , or alternatively set the lower bound of the prior on N to be larger than one.
Regardless, beyond this specific case, the desired diffuse and point source degeneracy can be readily be achieved in this coordinate system, and as such we advocate for its use generally over the existing choices.

E. Degeneracies between Multiple Components
This section has concentrated on the effect of the prior parameterisation on a Poisson and point-source population component with identical spatial distributions. Certainly one can expect similar problems if two Poisson components have identical spatial distributions, or if the spatial distributions for two point-source population components are identical, and we briefly comment on these scenarios here.
Caution should even be taken even for components that do not share a spatial distribution. If the spatial distributions for two components -Poisson or point source -are similar to the degree that each distribution cannot be distinguished from each other given the available data, then we can also expect a prior effect to manifest. Even if the spatial distributions for each component in the model are highly distinct, a degeneracy between the distributions can arise if the set of distributions is not linearly independent. This degeneracy allows multiple solutions for the given data, and so the prior effect may also manifest. If the spatial distributions are nearly linearly dependent -as measured by the statistical power to distinguish them -then they should also be considered potentially problematic.
Therefore, when constructing a model, care should be taken to avoid near linear-dependence between the spatial distributions. If the hypothesis in question requires such linear dependence, then the prior parameterisation we have introduced provides a solution that ensures that any physical degeneracy is faithfully represented in the posterior for the flux assigned to each source.

V. CPG PERFORMANCE AND COMPARISON WITH EXISTING METHODS
In this section, a mathematical connection is drawn between the CPG construction and previous approaches to the problem of parametric point-source inference. In particular we will consider a number of scenarios that highlight expected problems with the existing methods, and a performance comparison with the CPG is made using simulations. To highlight that these limitations arise from the likelihood construction, the natural coordinate system described in Sec. IV is employed for all simulations shown here.
As the NPTF method is the current leading parametric point-source inference method in high-energy astrophysics, this comparison will focus on the essentials of how the NPTF likelihood relates to the CPG construction. For a complete explanation of the NPTF method, we refer the reader to Mishra-Sharma et al. [55]. The NPTF likelihood is specified in Mishra-Sharma et al. [55] as a generating function. The NPTF generator,Ĝ, is written as an exponential of a power series, but can be equivalently written aŝ written in terms of which are the average detector response and template in the bin of interest, respectively. The NPTF generator in Eq. 50 also depends on ρ(f ), a function introduced to correct for the PSF. This PSF correction encodes the fraction, f , of point-source flux that migrates out-of or in-to a bin due to the finite angular resolution of the instrument. To build intuition for the effect this correction function aims to address, consider a single point source that is located in a bin, B. Not all of the flux generated with this point source will fall within B: one can find the captured flux by integrating the PSF over the extent of the bin. The expected observed counts from the point source within B is then the product of this flux fraction and the total expected counts from the source. If another point source is in an adjacent bin, then the PSF can cause a fraction of the flux to leak into B. The leakage would cause the flux of the original point source to be overestimated unless accounted for, so the PSF needs to be integrated again -now centered somewhere in the adjacent bin -to find the fraction of flux that leaks into B. This fraction is again multiplied by the appropriate counts of this second source -as determined by dN/dF that is specified for the entire population -so that this extra flux is properly accounted for. The value of ρ(f ) is proportional to the frequency of occurrence for a point-source contributing the fraction, f , of flux to a bin. The correction function is not a probability, however. The normalisation of ρ(f ) is fixed so that 1 0 f ρ(f )df = 1, which ensures that both the flux of the population is conserved, and the number of sources in the population is not overestimated.
The statistical justification for ρ(f ) is not given in Malyshev and Hogg [16], where it was introduced, nor any subsequent works. Instead, ρ(f ) is defined as an infinitesimal limit of a numerical estimation for the fraction of flux that a point source will contribute to a pixel after accounting for the PSF. This procedure is constructed as a simulation, and the complete details of the algorithm are given in App. F. To provide a brief description, the algorithm places a point source somewhere on the sky, and counts associated with this source are drawn from the PSF and placed in the image bins. All bins are then divided by the total number of simulated counts, so that each bin now contains the fraction of flux that the source contributes to that bin. These flux fractions are then histogrammed, and the final estimate is an average over multiple repetitions of this procedure, so that the histogrammed flux fractions captures multiple possible source locations. The final result is a numerical estimate of ρ(f ) that is normalised so that 1 0 f ρ(f )df = 1. A direct comparison of the CPG and NPTF generating functions, as given in Eqs. 35 and 50, reveals that a clear difference lies in the quantitiesκ B , T B , and ρ(f ), which only appear for the NPTF. In the NPTF construction, the detector effects are captured inκ and ρ(f ). One might imagine that a connection to µ B (ε) of Eq. 35 could be made through dε =κ b df ; however, ρ(f ) is not specified per bin, instead it is an average over all bins. Indeed, the CPG construction reveals that the detector effects cannot, in general, be averaged over bins in this way.
As an example, consider data that is binned irregularly (for instance bins of significantly differing size), the construction of ρ(f ) does not produce a correction function that represents any bin under consideration, and as a result, the likelihood does not describe the statistics of the data. In the case of NuSTAR, the field of view for the instrument is narrow enough that a Euclidian coordinate system can be used as a good approximation to the angular sky coordinates -allowing a regular binning. However, instruments such as Fermi and IceCube measure the entire sky. As a regular tiling of a sphere is impossible, data from these experiments must be irregularly binned, usually with the use of a HEALPix map [56]. As NuSTAR does not suffer from this issue, an in-depth examination of it is outside the scope of this investigation. However, one should not expect this to have a particularly large effect on the Fermi GCE analysis or most of the results from the recent IceCube analysis. Both analyses generally prioritise the region of the sky around the Galactic Center, using templates that place the most weight in this region. The HEALPix maps em-ployed were centered on the galactic coordinate system, placing the Galactic Center at the center of the HEALPix maps, where the maps have the most regular tiling. 11 Thus, for both analyses, the tiling is close to regular where the templates under consideration have the greatest weight. (One IceCube analysis considered a uniform all-sky template, for which the irregular HEALPix binning could produce issues.) In the case of the IceCube analysis, the construction of the ρ(f ) was further weighted by the spatial template under consideration, ensuring it more closely reflected the binning in this region. It should be noted that the irregulatity of the HEALPix binning is reduced -both in terms of average bin shape and average number of neighbours -for large numbers of bins; however, the potential for the regularity to be an issue must still be considered -especially for analyses where important information is present in the poles, or where bin sizes are large.
Extending NPTF to use a per-bin PSF correction, ρ B (f ), would address this problem, but limitations in such an NPTF-like method remain. The primary limitation of the NPTF construction is the use of an integrated value for the spatial distribution -the template. The CPG construction shows that the spatial distribution and detector response cannot, in general, be factorised into two separate terms. We will demonstrate this explicitly by considering a number of examples in the following subsections.

A. Non-uniform Spatial Distribution
To begin with, we consider an edge case which illustrates the problems that arise from the factorisation of the template from the detector effects. Consider a spatial population that occupies only a single bin, l = 0, such that T l = δ l0 . In addition, suppose the PSF is broad enough that a non-negligible amount of flux migrates out of bin 0. As the population does not exist outside of bin 0, out-migration is the only effect of the PSF for this bin. This implies that flux is not conserved for this bin: the flux captured by this bin is always less than the flux of the population. The other bins, in turn, are only affected by in-migration. According to the spatial template, T l = 0 for these bins and so the population flux for these bins are also zero. Thus, flux is not conserved for these bins either, as they receive flux from adjacent bins. Both of these effects are illustrated in Fig. 9.
Conservation of flux is only a property of the entire image, and cannot be enforced on a bin-by-bin basis through ρ(f ). The distribution of flux fractions that these bins receive is clearly a function of the bin in question, and so Figure 9: Left: A template has all of its probability concentrated in a single bin, shown as the 2D histogram on the left. As such, all three sources, {x 1 , x 2 , x 3 }, are located within this bin. Each has an associated PSF shown as the solid distribution centered on the source location. Above Right: The probability distribution for the number of counts in the central bin. The distribution for conserved flux, as in the NPTF construction, is shown in dotted red. The actual distribution is shown in black, with a smaller mean. Below Right: The distribution for an adjacent bin. The red dotted line shows the distribution for the NPTF construction with a zero-mean, the black shows the actual distribution which has a non-zero mean.
the average over all bins, ρ(f ), will not describe any bin in the image. In addition, the distribution of flux fractions for each bin clearly depends on the distance of that bin from bin 0, demonstrating manifestly why the spatial distribution of the population cannot be factorised out of the detector response.
Although this edge case is artificially extreme -in order to bring out the effect -it is not uncommon for scenarios to have much of the spatial distribution concentrated in a single bin. Any sharply peaked spatial distribution will suffer this problem to an extent; for example, the GCE spatial profile has a sharp peak at the Galactic Center.
A scenario was constructed in the NuSTAR simulation (described in detail in App. A 1) in order to investigate the potential bias this may induce. In this scenario, five bins have equally non-zero share of the population. Multiple bins are necessary for point-source population reconstruction, and the bins are spaced far enough apart that the demonstrated effect will be largely similar to the single-bin scenario. The edge case as described above can only be meaningfully analysed with the CPG likelihood. The NPTF method will fail entirely and result in zero probability for all parameter values. The reason for this is that from Eq. 50 it can be seen that the NPTF predicts zero flux in any bin with T B = 0, for any value of the model parameters. If any of these bins record counts due to the finite PSF, then these are events the NPTF simply cannot reconstruct. In order to allow for a comparison with the NPTF, we rescale the template so that no bins are exactly zero, but rather a small non-zero value of T B which cumulatively add up to no more than 1% of the template. Further details on the scenario setup are given in Tab. I.
The NPTF method was implemented by estimating ρ(f ) according to the algorithm in App. F, and then transforming it to an equivalent CPG representation through ρ B (ε) = T B ρ(ε/κ B )/κ B . In this way, the exact same computational implementation of the likelihood and Bayesian analysis procedure can be used to evaluate both the NPTF and CPG methods. Thus, any differences observed can be entirely attributed to the NPTF ρ(f ) construction, and are not due to any subtle differences in the software implementation of the likelihood and MCMC. The generation of µ(ε) required approximately 100 CPU hours; 12 however, in practice, excellent results can be achieved with significantly less time -as little as 5 CPU hours. As ρ(f ) is shared between bins, it requires substantially less time to generate; in this case, less than one CPU hour total. These times are highly   application dependent, as they are almost an entirely a function of the computational complexity of the detector simulation, and in the case of µ(ε), the chosen binning scheme. The natural coordinate system employed in the prior effect demonstrations of Sec. IV D was used, with the exception of the Poisson component. Thus, the priors used here are also described by Tab. III, other than for ω which is not needed, and that the flux parameter is referred to as F P S instead of F T . The posteriors were sampled using the emcee Affine Invariant MCMC [54]. A total of 200,000 steps were taken with 64 walkers, and the first 90% of samples were discarded as burn-in.
The results are shown in Fig. 10a. The true sourcecount function parameters are shown by a solid red line. The dotted and dashed lines describe the recovered posterior, where each colour represents one trial image. The dashed lines are the 1-σ Highest Posterior Density (HPD) regions, while the dotted lines define the 2-σ HPD regions. The NPTF posterior is biased towards a high num-ber of sources with an approximately correct total flux. In comparison, CPG does not exhibit the same bias. Here bias is defined as an overall shift in the recovered parameters across multiple trials. We can see that the yellow CPG trial is well outside the true parameters, while the purple NPTF trial is inside the true parameters. This is to be expected due to the random variations between trials. However, as a group, the NPTF trials are clearly recovering a larger number of sources than the true N . In Fig. 10b we further show the difference between the CPG µ B (ε) and the equivalent NPTF ρ B (ε). Each colour represents the µ B (ε) measure for a bin in the scenario, as NPTF averages the PSF correction over bins and the effective area is isotropic the equivalent ρ B (ε) is the same for all bins up to the two unique template values of T B .
The averaging approximation used in the NPTF construction is clearly deficient in this edge case. The inference is driven by the five bins that contain the population. The µ B (ε) for these bins is heavily weighted toward high ε. The NPTF ρ(ε) is heavily weighted toward low The detector effect correction function for both the CPG µB(ε) construction and the NPTF ρ(f ) construction. Figure 11: As in Fig. 10, but for the anisotropic PSF scenario. The injected population is well-recovered by CPG, while the NPTF posteriors are clearly biased towards low N . The cusp structure visible in each µ B (ε) is smeared out in ρ B (ε).
ε, due to averaging over the larger number of bins that do not contain the population. The normalisation condition, 1 0 df f ρ(f ) = 1, imposed on ρ(f ) modifies the overall normalisation of the function so that flux is conserved in all bins. However, as described above, the flux is manifestly not conserved on a bin by bin basis in this scenario. In Eq. 50, a change in the normalisation of ρ(f ) is equivalent to a change in N as this generator can also be written aŝ Thus, the normalisation effectively drives ρ B (ε) at high ε down, causing the posterior on N to be driven up to compensate. This leads to the observed overestimation bias in the number of sources. The posteriors in Fig. 10a show a between-trial variation that is on the same scale as the posteriors themselves. We can rule out statistical fluctuations as the dominant cause, as the variations are an order of magnitude in N for the NPTF posteriors. Instead, these variations are due to small amount of information that is present in only five bins, which leads to poor recovery of the total number of sources. We should expect the posteriors to overlap (and they mostly do for CPG), but as the NPTF likelihood does not describe the data, we cannot expect the posteriors to be well-behaved.

B. Anisotropic PSF
In the standard NPTF construction, ρ(f ) is common to all bins and thus it can, at best, capture the average PSF throughout the image. The following scenario examines the bias that this approximation may introduce to the inference of the source-count function model parameters when an anisotropic PSF is present, as arises, for instance, in NuSTAR.
We consider a scenario largely identical to that in Sec. IV B, with details given in Tab. I. In this scenario, multiple exposures are overlaid into a mosaic, forming a single binned image. This increases the amount of anisotropy in the PSF, as sources can bleed across the boundaries of the exposures. When a source contributes to two or more exposures, an appropriate mix of PSFs is required, leading to more complicated PSF distributions. For this scenario, we wish to only test the effect of an anisotropic PSF. Thus, the exposures are perfectly aligned edge to edge in a grid with no gaps or overlaps. This, along with the lack of vignetting, ensures that the effective area is uniform across the entire image. We note that this is not a natural arrangement -most mosaics involve a degree of overlap between exposures.
The results of analysing the resulting data sets are shown in Fig. 11a. A high degree of bias toward a low number of sources is observed in the NPTF posterior, while CPG is largely consistent with the true population parameters. Figure 11b further shows that the equivalent NPTF ρ B (ε) is much closer to the CPG µ B (ε) in this scenario as compared to the previous non-uniform spatial template scenario. However, the shape of ρ B (ε) is significantly different from µ B (ε) near the highest ε. Unlike before, the normalisation of ρ B (ε) will be correct here: the uniform spatial distribution ensures that flux is conserved on a bin-by-bin basis. While this normalisation ensures the average ε is correct, the distribution of ε is not. Observe that µ B (ε) generally has a cusp at high ε, while ρ B (ε) shows no cusp. This lack of cusp causes the NPTF likelihood to be driven by lower ε in comparison, and so the posterior drives up the average flux per source, and drives down the number of sources in order to maintain the conservation of flux. Note that the quantity shown in Fig. 11a is the total flux of the entire population, so when the number of sources, N , reduces, it is the average flux per source, roughly proportional to F P S /N , that increases.

C. Sub-bin Effective Area
The prescribed construction of ρ(f ) in the NPTF method does not include any accounting for the effective area, exposure time, or detector efficiency of the instrument and observation. This may lead to an incorrect construction of the likelihood distribution, for which the following scenario illustrates.
Consider an instrument where the effective area varies sharply within a bin, and a population which is spatially uniform. Thus, the distribution of fluxes for a source is not a function of position within this bin. However, the distribution of expected counts will be a function of position within the bin, as the flux must be converted to a number of photons using the effective area. In particular, take a scenario where there is no PSF, φ(y|x) = δ(y − x); and the injected source population has no variation in flux, p(F ) = δ(F −F P S ). In order to generate the effect of interest, we then let the detector response change discontinuously at some point within the bin, between the values of κ 0 and κ 1 . If a source is located where κ(x) = κ i , then the distribution of the detected number of counts at this location is p(S B |x) = δ(S B − κ iFP S ), with i = 1 or 2. The distribution for the whole bin is found by integrating the position dependent distribution Figure 12: Below: The detector response, κ(x), varies sharply over bin b. The average detector response for this bin,κ b , is shown by the dashed line. Above: If the average detector response is used to convert flux to counts, the probability distribution follows the red dotted comb; however, the within-bin variation of the detector response must create two modes in the distribution, shown by the black comb.
over T (x). Let the spatial template be uniform, then the whole bin distribution is a mixture of the previous two distributions: where C is some mixture fraction that depends on what area of the bin has a detector response of κ 0 . For this scenario, the NPTF calculates the average detector response across the bin,κ, and uses this to find p(S B ) = δ(S B −κF P S ). This will not result in the correct distribution of photon number; for example, if κ 0 = 0, then the NPTF construction preserves the total flux of the bin, but will overestimate the flux of sources within that bin, asκ < κ 1 . A comparison between the exact and mean distributions are shown in Fig. 12. It should be noted here that, due to computational constraints, the effective area is rarely considered on a bin-by-bin basis in the NPTF construction, and is instead averaged over larger "exposure regions" as the computational complexity of the NPTF power series calculation depends on the number of bins. Although these non-contiguous regions are chosen by the similarity of the effective area of each bin within the regions, this necessarily reduces the accuracy of the likelihood evaluation further. In contrast, The detector effect correction function for both the CPG µ(ε) construction and the NPTF ρ(f ) construction. Figure 13: As in Fig. 10, but for the variation in sub-bin effective area scenario. The CPG posteriors are clustered around the injected population, while the NPTF posteriors exhibit a small bias towards low N . The bimodal effective area is shown clearly by µ B ( ) while for ρ B (ε) this structure is removed by the averaging the effective area, resulting in a single mode.
CPG requires no exposure regions, as the computational complexity of the CPG power series calculation does not depend on the number of bins and is computed only once for each likelihood evaluation. As exposure regions only enter into NPTF due to computational constraints, they are not, by themselves, an intrinsic limitation of NPTF; therefore, we do not consider the effect of taking a number of exposure regions smaller than the number of pixels in this investigation.
To test this scenario, the simulated vignetting was modified into a checkerboard pattern, taking values of either κ 0 or κ 1 . Each square of the checkerboard was defined to be equal to the bin size, but with an offset so that the checkerboard boundaries lie within the bins of the image. This causes the effective area to change discontinuously across most bins of the image. The remaining scenario parameters are largely identical to those in Sec. IV B, with details given in Tab. I.
The results are shown in Fig. 13a. Here, NPTF fares significantly better when compared to the previous scenarios. However, a recognisable bias is still present in the recovered posteriors. CPG, on the other hand, is consid-erably closer to the true population parameters. The detector effect correction functions are shown in Fig. 13b. Clearly, the CPG µ B (ε) has the two expected modes corresponding to the sub-bin effective area variation. In contrast, the equivalent NPTF ρ B (ε) has a single mode corresponding to the average effective area. Thus the likelihood is driven by a lower ε, and so the average flux per source is driven up while the number of sources is driven down to compensate.
The CPG posteriors in Fig. 13a display a long tail toward high N . This same behaviour can be observed to a lesser extent in Fig. 11a (anisotropic PSF scenario). The summary Fig. 1 shows this effect most clearly. As discussed in Sec. IV, the discrimination ability of the CPG likelihood is reduced at high N as the population model approaches the diffuse limit. In contrast, population models with N less than the true value will be highly disfavoured, as they produce sources with high flux. In these scenarios, the prior on N is log-uniform, and does not discourage models with high N . Therefore, an asymmetry in the posterior toward high N is expected. This effect is also present in Fig. 10a (non-uniform template scenario), however the small number of bins with useful data causes a wide variation in the posteriors that largely hides the effect.

D. A Scenario where ρ(f ) is Valid
The three previous scenarios demonstrate the biases introduced into NPTF by the failure to correctly model the effective area, PSF, and the spatial distribution of sources, as well as how the comprehensive approach of CPG resolves these issues. When these complications are not present, however, it is expected that the NPTF construction will be equivalent to CPG. To confirm this, we consider a scenario where the population is spatially uniform, the PSF is isotropic and the effective area is constant throughout the image. The details are given in Tab. I, and the results are shown in Fig. 14. Both NPTF and CPG display no bias, and each trial is equivalentup to slight variations -between the two methods. This shows that under these conditions, the more general CPG reduces to NPTF, and NPTF does indeed work correctly.

E. Realistic Scenario
Finally, we turn to a realistic scenario, where all NuS-TAR detector effects are enabled, and a Poisson background is injected. This results in a realistic simulation of a NuSTAR observation. Eight observations are combined into a mosaic that exhibits the overlapping diamond pattern that is common to composite NuSTAR imagessee Hong et al. [47] for one such instance.
For this scenario, there are two Poisson background components: the intrinsic detector background, which is uniform within each observation, and the the Cosmic Xray Background (CXB), which is a flux that is multiplied by the effective area of the instrument. Both components scale with the exposure time, and their respective rate and flux is detailed in Tab. I. The background rate was chosen to be representative of an observation with NuS-TAR. The CXB flux was derived from Krivonos et al. [57] by integrating over the 3 -10 keV energy window.
The CXB is incorporated into the model using a shared flux parameter in the natural coordinate system discussed in Sec. IV. The intrinsic background, in contrast, is not specified as part of the natural coordinate system. The natural coordinate system divides a total amount of flux between multiple models, but this background component is not described in terms of a flux, as it is not astrophysical in origin. Thus, it was given a separate uniform prior around the known background rate, and is parameterised by λ bkg , a count rate for the entire detector, which is multiplied by the exposure time and divided by the relative bin size to get a mean number of counts per bin. The injected source-count function was selected to be similar to the observed population of X-ray sources near the Galactic Center [47]. Further details are given in Tab. I, and the prior ranges are shown in Tab. IV.
The results -derived with both the CPG µ B (ε) and NPTF ρ(f ) constructions -are shown in the upper half of Fig. 15 as a source-count density function. Each colour represents one of six trial images, and the dashed lines show the 16% and 84% percentiles for the source-count function recovered from the posterior. The solid red line shows the true population source-count function. The top of the grey fill is the mean of the 84% percentiles for each trial, while the bottom of the grey fill is the mean of 16% percentiles.
For the CPG construction, there is a large uncertainty on the location of the flux break itself; and, below the break, there is an underestimation in the number of sources. This is unsurprising, as the uniform prior on the flux fraction parameter between the point-source and CXB component, ω, gives weight to models where a significant amount of flux is carried by the CXB, which corresponds to the below threshold point-source flux being assigned to the CXB component.
This effect can be reduced by placing a more informative prior on ω, but this is not as easy at it may first appear. Even if the CXB flux is well-known, placing a prior on ω requires prior knowledge on the relative total ρ(f ) Construction with ω > 0.9 Figure 15: The recovered source-count function (as defined in Eq. 9) for the CPG µ B (ε) construction (top left) and the NPTF ρ(f ) construction (top right) in the realistic NuSTAR scenario. Each colour represents a different random trial, with the 16% and 84% percentiles shown as dashed lines derived from the posterior. The true source-count function is shown in solid red. The grey fill shows the average percentile band, as a guide. The ρ(f ) construction clearly recovers the incorrect source-count function, as the slope of the power-law is well outside all percentile bands. In comparison, the CPG construction is generally more accurate. The underestimation of sources in both methods is further discussed in the text. The same results, with the posterior conditioned to ω > 0.9, are shown in the bottom left and bottom right for CPG and NPTF respectively. With this conditioning, the recovery of the total number of sources is drastically improved for the CPG.
flux between the population and the CXB. If the prior is informed only by an estimate of this relative flux, then an incorrect value could exacerbate the effect. Alternatively, the prior on ω could be conditioned on the total flux F T so that the inferred total flux informs the prior on the estimated flux of the point-source population. We can also imagine a similar problem occurring for the intrinsic detector background, in that below threshold sources could be confused with the intrinsic background model. The solution here, which has been employed in this scenario, is to use a tight prior on the intrinsic detector background based on previous measurements. This is simple to implement, as the intrinsic detector background is not specified in the natural coordinate system and thus has an independent prior.
In the end, this is not an issue with the CPG construction, it is simply the nature of Bayesian analyses. We can see this more clearly by conditioning the posterior on ω > 0.9. This, in effect, largely removes the CXB from the model. The result is shown in the lower half of Fig. 15. The uncertainty on the flux break is greatly reduced, and the total number of sources is now accurately recovered to within the percentiles.
However, the power index, n 1 , sits just outside the percentiles. This turns out to be a generally unavoidable effect. The large bin sizes cause the brightest sources to be washed out by the large number of dimmer sources. For this reason, the model cannot distinguish between a few bright sources and many dim ones, and so the posterior is extended to larger n 1 , thereby pushing the true n 1 outside of the percentile interval. This effect can be reduced by reducing the bin size, so that bright sources make up a greater fraction of the total flux in each bin. Unfortunately, as discussed in Sec. III E, smaller bins will cause the statistics to be over-counted, making the posteriors artificially small.
Thus, when applying CPG to a realistic scenario with a Poisson background that has a poorly informed prior, care must be taken in interpreting the number of very low flux sources. In addition, the prior on the detector background should be chosen to incorporate as much information as possible that is known about the background rate. Finally, one must take care in the selection of the bin size. When the goal is to recover the popula-tion parameters, smaller bins should be chosen to retain more information on the high flux sources. Any reported results should then caution of the between-bin correlation effect elaborated on in Sec. III E. If model selection is the aim, and marginal likelihoods are in use, it may be better to use larger bins, as a high degree of over-counting of statistics may result in an erroneous marginal likelihood.
In comparison, the NPTF ρ(f ) construction has poorer performance. The same underestimation of the number of sources below the break is observed, but to a larger degree. In addition, the location of the break and index of the power-law above the break is incorrectly estimated, and the true source-count function lies well outside all of the recovered source-count functions in the region above the break. Conditioning the posterior on ω > 0.9 does not improve the situation appreciably: the total number of sources now lies within the percentiles (although the uncertainty is much larger than when using CPG), but the region above the break is still incorrectly recovered. In this instance, while the total number of sources is approximately correct, the total flux of the population is overestimated.

VI. CONCLUSION
In this paper we have introduced a new approach to the age-old problem of point-source population inference. In particular, we constructed a new likelihood using Compound Poisson Generator (CPG) functionals. To complement this likelihood, a new natural coordinate system has been developed for parametrising Bayesian priors on point-source population models where a Poisson background component is also present. In particular, we revealed that existing prior choices often result in posteriors that assign all flux to either the point-source population, or to the Poisson background model. The new natural coordinate system produces posterior distributions that correctly capture the uncertainty on the population model in this circumstance. Combined with this new prior formulation, the CPG likelihood has been shown to correctly handle a series of test scenarios that each exhibit a particular edge case that is common in astronomical instrumentation. For these scenarios, a comparison was made to Non-Poissonian Template Fitting (NPTF) -a current leading parametric point-source inference method. In contrast to CPG, the NPTF method has been shown to produce biases in the recovered posterior distributions, specifically the mean number of sources. These biases are attributed to limitations in the construction of the NPTF likelihood, which factorises the spatial distribution, PSF, and effective area. The CPG derivation shows how these contributions to the likelihood cannot be factorised in general. An implementation of the CPG is made publicly available here.
Our focus in the present work has been on the construction of the CPG and on demonstrating the improvements it brings. However, given the general importance of point-source studies applying this new method to actual data is an important open direction, and may even be able to finally shed light on open questions such as the nature of the GCE. only one of these telescopes will be considered in order to simplify the simulation.
The optics focus X-rays via grazing reflections from two sets of concentric shells arranged in an approximate Wolter-I geometry. When an X-ray is reflected by both shells, it will be correctly focused on to the detector plane. An X-ray may only undergo one reflection in the optics. These X-rays are poorly focused and are called ghost-rays. The observatory does not have an enclosed barrel, and the majority of the length of the telescope is open to space. A series of aperture stops collimate incoming X-rays, preventing most X-rays from entering the FPM through the sides of the telescope. However, a small window remains allowing X-rays that pass close to the optics modules to strike the detector without passing through the optics. Known as stray-light, X-rays that enter the FPM this way are unfocused.
The four detectors in each FPM are arranged in a 2 × 2 grid. Each detector is composed of 32 × 32 pixels, and sub-pixel positional information is available. The gap between the detectors is small and is ignored in this study. Further details on the observatory can be found in Refs. [48][49][50][51].

Simulation
Use of NuSTAR as a test case requires a simulation of the detector in order to create observations where the parameters of the injected point-source population are known exactly. The simulation draws a number of point sources from a given differential source-count function using a Poisson distribution with mean N . The flux of each source is also drawn from the p(F ), and the location of each source is drawn from a given spatial distribution specified by the template. The simulation constructs an image according to the NuSTAR field of view, and with an adjustable bin size. The detector response is extracted from the NuSTAR CALDB [48] FPM data using the vignetting corrected effective area. The energy spectrum is assumed to take the power-law form given in Eq. 4 with E 0 = 1 keV and γ = 1.5, whereas the energy range and exposure time are scenario dependent.
Once the mean counts for a given source, S, has been determined, a number of photons is then drawn from a Poisson distribution with mean S, and these photons are placed into the image according to locations drawn from the CALDB PSF. This PSF is generated by a physics based simulation of the X-ray optics. If a photon lies outside the field of view, the photon is discarded. The NuSTAR PSF is non-isotropic, and varies according to the distance of the source from the optical axis of the telescope. At the optical axis, the PSF is radially symmetric. As the source moves towards the edge of the image, it is distorted radially creating a fish-eye effect.
Apart from the point-source population, multiple other effects can cause the real or apparent detection of photons. Astrophysical backgrounds and foregrounds may include extended sources such as gas clouds, or very bright pointsources that are not part of the population such as magnetars. These can be modelled as Poisson distributions with appropriate flux maps. As they are highly scenario dependent, these effects are not modelled in this investigation. Ghost-rays and stray-light are usually only visible when bright sources are in or around the FOV, assuming the sources can be located, their contribution can be modelled by a Poisson distribution with an appropriate flux map. The effect of ghost rays from population sources is incorporated into the PSF by construction. The CXB is a significant source of stray-light induced background. This stray-light contribution can be modelled much like other astrophysical backgrounds -with a Poisson distribution using an appropriate CXB flux. As with the other sources of astrophysical background, the effect of stray-light CXB is not considered in this investigation. NuSTAR also suffers from a detector background caused by space-borne radiation. This background enters through the FPM shielding, and is not associated with the X-ray optics. Thus, the background is uniformly distributed across each detector, and varies across each of the detectors in the FPM. It can be simulated by drawing counts from a Poisson distribution according to an appropriate background count rate.
Position information of photons beyond the detector pixelisation is not available; however, in this simulation, the effect of pixelation is not considered and the photons are directly processed into bins. The size of these bins is scenario dependent, but they are generally much larger than the detector pixels, thus the effect of pixelation is considered small and the additional complications involved in the pixel subdivisions are avoided.
If the scenario calls for multiple observations, then the above procedure is repeated for each observation minus the final binning. This creates a list of photons from all observations, which are then transformed into a shared coordinate system for all of the observations and then binned into a shared binning scheme. An example of this is shown in Fig. 3, with a binning scheme chosen to be approximately the same as the NuSTAR sub-pixelation, in order to show the PSF.

Appendix B: Generating Functions and Functionals
In this appendix, a full derivation of the CPG generating function without approximations will be presented. From this generating function, an unbinned likelihood and a binned likelihood can be derived, both taking into account correlations between pixels. Unfortunately these likelihoods are intractable, and so the derivation in Sec. III is shown to be a tractable mean-field approximation for the likelihoods here.
Generating functions are a useful representation for a discrete probability distribution when complex constructions are required. In this appendix we provide a more expansive discussion of these objects than appeared in the main body. The generating function, G(z), is defined as the z-transform of a discrete distribution, P (n), over its support N : (B1) Typically, N ⊂ N 0 , with inf N = 0. The probability distribution can be recovered from the generating function via higher order derivatives evaluated at z = 0: Chief among the generating function's useful properties is the sum-of-random-variates rule. Let each X i be one of M random variates, drawn from the distribution P i (x) with generating function G i (z). The generating function for the sum of these variates, is given by This property has an analogy in the characteristic function of continuous distributions; where the distribution for a sum of two random variates is the convolution of the two distributions, the characteristic function is defined as the Fourier transform of the distribution function, so that the characteristic function for the sum of the variates is the product of the characteristic functions via the convolution theorem.
If the X i are identically distributed -that is, drawn from the same distribution P X (x) -and the number of variates to sum, M , is itself a random variate drawn from a distribution P M (m) with generating function G M (z), then the generating function for the sum is provided by the nested expression, When P M (m) is a Poisson distribution, G K (x) is known as a compounded generator. As a simple example, for the Poisson distribution, with λ the mean of the distribution, the generating function is Poisson distributions count the number of events that occur in a defined interval -most often in time, but we are also interested in space intervals. The above properties work well when the underlying Poisson process -the mechanism that generates these events -is homogeneous in this interval. When the Poisson process is inhomogeneous, a generating functional can prove more useful. An inhomogeneous Poisson process is defined by an intensity function Λ(x) with support Ξ. The integral of this intensity function over the support gives the expected number of events, λ = Ξ dx Λ(x). Thus, the distribution for the number of events is a Poisson distribution with mean λ. The generating functional for the process in this case is defined as where f (x) is a functional argument defined over the same support Ξ. The probability density function for finding an event at y is given by the variational derivative of this functional: where 0(x) = 0. The probability density for a set of events {y i } is the higher order variational derivative The Poisson process is one kind of process that generates events -known generally as a point process. The concept of a generating functional can be extended to all point processes.
The generating functional has a similar compounding property to the generating functions. Let G X be the generating functional for a point process, while G M is the generating functional for another point process. If G M , rather than directly generating events, instead generates point processes according to G X , then draws from G M will generate multiple events according to G X . The generating functional for this compounded process is For a point source population, the intensity function is such that the intensity is distributed according to the spatial distribution, T (x), and the total of the intensity function over the support of the spatial distribution is exactly the mean number of sources: Now, let G x|N,T be the generating functional for sources from the population: so that events drawn from this process define individual sources located at location x. Each source, accordingly, defines its own generating functional: from which events define counts detected at location y, given a source of flux F located at x. The population averaged source functional is thus The total count generating functional is now a compound of these two generators [52]: The unbinned likelihood can be calculated using G y|N,T,p(F ) ; here, however, we are interested in the binned likelihood. Consider the trivial generating function G(z) = z -a distribution with p(0) = 0 and p(1) = 1. Notice that that is, the compound of the generating functional G x|N,T with the generating function G(z) gives the generating function for the number of events generated by the Poisson process. We can interpret this as a compounding of a point process with a counting process which records 100% of events, as p(1) = 1. Direct substitution of this trivial generating function into G y|N,T,p(F ) would give the total number of counts in the entire image, as the integral dy runs over the entire support of η(y). To get the number of counts, k B , in bin B, we could perform this substitution, and then alter the support of η(y) to Ω B . An alternative, is to consider the slightly less trivial generating function where 1 is the indicator function, that is equal to one when y ∈ Ω B and zero otherwise. Thus the probability of recording an event conditioned on the photon location y is p(1|y) = 1 Ω B (y) which is equal to one when the photon is inside the bin and zero otherwise. The conditional probability of discarding an event is p(0|y) = 1 − 1 Ω B (y) as expected. Using this, the generating function for the number of counts in bin B is and after rearranging, From here, the µ B (ε) measure defined in Eq. 33 can be used to bring this into the expected form: For multiple bins, the indicator generating function is Substitution of this into G y|N,T,p(F ) gives the multiple bin generating function. However, evaluation of the probabilities for such a generating function would not only require a multidimensional µ({ε i }) with an effective detector response for each bin, but would also require computing many cross terms between the z i variables. For more than a few bins, this quickly becomes computationally untenable. For this reason, the whole image likelihood is constructed by taking the product of the one bin likelihoods as in Eq. 39 -essentially a mean field approximation, where each bin is treated as statistically independent, removing the correlations.

Appendix C: Evaluation of the Power Series
Here we demonstrate how to analytically evaluate the CPG generating function when the source-count function takes the form of a multiply broken power-law. Firstly, we write the generating function for a point-source population as where g(ε, z) = dF e εF (z−1) N p(F ).
As stated, we assume a broken power-law differential source-count function: This is equivalent to the broken power-law definition in Eq. 46, with the c i introduced to account for the normalisation factors.
We postpone a discussion of the case of j = 0 for now. Note that g j (ε, 0) is not a function of B, the bin number. This allows the z-series of g j to be computed once for the entire image, all of the bin specific detector effects are contained in µ B which requires a simple numerical integration against this series for each bin. This can be many orders of magnitude more computationally efficient than NPTF, which requires the z-series to be computed for every bin. The evaluation of upper and lower incomplete gamma functions is computationally expensive, and must be performed using numerical approximations. Thus, it is desirable to avoid evaluating these special functions for every term in the z-series. There exists a recurrence relation for the incomplete gamma functions, which is best exploited in terms of the scaled σ functions,σ: σ j+1 i (ε) = σ j+1 i (ε, 0) (j + 1)! = 1 − n i j + 1 σ j i (ε) + exp j ln J i (ε, 0) − J i (ε, 0) − ln Γ(j + 2) Forσ 1 andσ i , the base case of j = 0 is first computed from σ 0 1 (ε, z) = J 2 (ε, z) n1−1 Γ(1 − n 1 , J 2 (ε, z)) = E n1 (J 2 (ε, z)), (C17) where the exponential integral, E n (x), is used to stabilise the calculation. Then, the higher orderσ 1 andσ i terms are found using the above recurrence relations. Forσ m , the initial case starts from the highest order term required, which for the whole image is j = max B k B . This is computed using Eq. C8, then the lower order terms are found using the recurrence relations. The power series coefficient for j = 0 also includes the constant factor of N which appeared in Eq. C1: If g 0 is calculated using the above procedure, this factor of N will cause a catastrophic cancellation for very dim (i.e. below threshold) populations. This cancellation arises from the left hand term of the right hand side of this equation taking a value very close to N such that it is equal to N up to the machine precision of the floating point data-type used in the computation. A resolution to this cancellation is achieved by incorporating the constant factor into the calculation of the exponential integral. The expected number of sources, N , is first evaluated in closed form in terms of the source-count function parameters: Each term in this equation is incorporated into the correspondingσ 0 function, to give the modified σ terms In the very dim below threshold regime, J i (ε, 0) 1, and the exponential integral becomes dominated by a leading constant factor: Thus, the cause of the numerical instability is revealed. Define a modified exponential integral function,Ẽ n (x, y) = y + E n (x), so thatẼ Now the modified σ terms can be writteñ The numerical algorithm for approximating the exponential integral is based on Navas-Palencia [58]. This algorithm was modified to calculateẼ n (x, y). For x 1, series calculations are used by Navas-Palencia [58]; during evaluation of the first series term, y is added so that the modification occurs before subsequent terms are added. The result is that the first term of the series is cancelled, allowing subsequent terms to accumulate without being lost in the machine precision. For other scales of x, non-series calculations are used and so y is simply added to the final result -an acceptable solution as catastrophic cancellation does not occur in these regimes.
When J i (ε, 0) 1, the exponential integral algorithm may overflow. This corresponds to populations with unrealistically high flux per source. If this happens, a NaN is propagated through the likelihood evaluation algorithm and is returned as the final probability. This allows this failure mode to be caught and reported. In these investigations, the correct course of action was to simply reduce the range of the flux prior, as the problem was always caused by the MCMC initialisation drawing these unrealistically large flux values from the uniform prior.
With these modifications in place, the scaled power series coefficients can be defined as We will use these expressions in App. D.

Appendix D: Evaluation of Bell Polynomials
The probability for k b counts from the CPG likelihood was shown to be given by This expression is written in terms of the Bell polynomials, which obey the recurrence relation with B 0 = 1. Given this, we can write where ν are the scaled power series coefficients from App. C. This recurrence relation is equivalent to that found in Mishra-Sharma et al. [55].
The numerical calculation of P (k B ) is performed logarithmically, ln P (k B ) = ln so as to ensure that numerical underflow does not occur if the scale of the likelihood is below that of the machine precision of the floating point data-type used. A standard exponential sum algorithm is employed, which determines the scale of the sum terms in advance; then, the terms are rescaled to avoid underflow before the summation is computed; lastly, the final result is scale corrected.
In this appendix we provide additional details regarding the natural coordinate system introduced in Sec. IV of the main text. The mean number of sources is taken to be N , while, here, F P S will be the total amount of flux emitted by the population of sources. There will be m − 1 breaks, the locations of which will be defined by m − 2 numbers β i ∈ (0, 1), where is the ratio of flux location of the (i + 1)th break to same for the ith break. The conversion between this coordinate system and the standard coordinate system of Eq. 46 requires finding A and all F b(i) in terms of N , F P S , and all β i . To do so, let be a similar ratio to β i but to the flux location of the first break, F b(2) , instead. Now we can write the mean number of sources as There is a similar expression for the total flux of the population: From the ratio of Eqs. G6 and G4, we can find F b(2) , then from either of these equations we can find A. Finally, all remaining flux break locations, F b(i) , can be found through Eq. G2.

Setting
Prior demo Non-uniform dist Anisotropic PSF Sub-bin eff. area ρ(f ) scenario Realistic Spatial dist.    Table II: Prior ranges for the demonstration using the standard differential source-count function parameterisation.
Parameter Lower limit Upper limit  Table III: Prior ranges for the demonstration using the natural coordinate system for specifying the priors.