maxsmooth: Rapid Maximally Smooth Function Fitting With Applications in Global 21-cm Cosmology

Maximally Smooth Functions (MSFs) are a form of constrained functions in which there are no inflection points or zero crossings in high order derivatives. Consequently, they have applications to signal recovery in experiments where signals of interest are expected to be non-smooth features masked by larger smooth signals or foregrounds. They can also act as a powerful tool for diagnosing the presence of systematics. The constrained nature of MSFs makes fitting these functions a non-trivial task. We introduce maxsmooth, an open source package that uses quadratic programming to rapidly fit MSFs. We demonstrate the efficiency and reliability of maxsmooth by comparison to commonly used fitting routines. We show that by using quadratic programming we can reduce the fitting time by approximately two orders of magnitude. maxsmooth features a built-in library of MSF models and allows the user to define their own. We also introduce and implement with maxsmooth Partially Smooth Functions, which are useful for describing elements of non-smooth structure in foregrounds. This work has been motivated by the problem of foreground modelling in 21-cm cosmology for which MSFs have been shown to be a viable alternative to polynomials. We discuss applications of maxsmooth to 21-cm cosmology and highlight this with examples using data from the Experiment to Detect the Global Epoch of Reionization Signature (EDGES) and the Large-aperture Experiment to Detect the Dark Ages (LEDA) experiments. MSFs are applied to data from LEDA for the first time in this paper. maxsmooth is pip installable and available for download at:https://github.com/htjb/maxsmooth


INTRODUCTION
Maximally Smooth Functions (MSFs), functions with no inflection points or zero crossings in higher order derivatives, were first proposed by Sathyanarayana Rao et al. (2015) for modelling foregrounds in experiments to detect spectral signatures from the Epoch of Recombination. They are designed for modelling smooth structures in experimental data that are several orders of magnitude larger than non-smooth signals of interest and to leave behind signals and, where present, systematics in residuals (see also, Sathyanarayana Rao et al. 2017). MSFs can be considered part of a family of functions, which we refer to as Derivative Constrained Functions (DCFs) and includes functions with no turning points, Completely Smooth Functions and functions with a select number of non-zero crossing E-mail: htjb2@cam.ac.uk high order derivatives, Partially Smooth Functions (PSFs). We refer to the high magnitude smooth components of the data as foregrounds throughout this paper.
Our primary focus here is the application of DCFs to the field of Global 21-cm cosmology. Shaver et al. (1999) suggested that reionization of neutral hydrogen in the early universe would result in a sharp step in the global spectrum of the sky, and that this signal should be separable from the smooth spectrum emission that dominates the sky temperature at radio wavelengths, 70 − 240 MHz. Pritchard & Loeb (2010) showed that the foreground emission, if modelled using a low-order polynomial, could be subtracted from the global sky spectrum to retrieve signals from the Epoch of Reionization (EoR) of order 100 mK; Harker et al. (2012) and Bernardi et al. (2015) expounded on this work, including further instrumental effects.
In comparison to unconstrained polynomials, DCFs are better able to separate the smooth foreground spectra from the anticipated EoR signals and instrumental systematics (Sathyanarayana Rao et al. 2017). This motivates their use in Global 21-cm cosmology experiments such as; REACH (Radio Experiment for the Analysis of Cosmic Hydrogen, de Lera Acedo 2019), SARAS (Shaped Antenna measurement of the background RAdio Spectrum, Singh et al. 2018a), EDGES (Experiment to Detect the Global Epoch of Reionization Signature, Bowman et al. 2018), LEDA (Largeaperture Experiment to Detect the Dark Ages, Price et al. 2018), PRIZM (Probing Radio Intensity at High-Z from Marion, Philip et al. 2019), BIGHORNS (Broadband Instrument for Global HydrOgen ReioNisation Signal Sokolowski et al. 2015), SCI-HI (Sonda Cosmológica de las Islas para la Detección de Hidrógeno Neutro Voytek et al. 2014) and MIST (Mapper of the IGM Spin Temperature, http://www.physics.mcgill.ca/mist/).
Specifically, the Global 21-cm signal is the sky averaged temperature deviation between the cosmic microwave background (CMB) and the spin temperature of hydrogen gas during the EoR and the period of cosmic history known as the Cosmic Dawn (CD). The physics of the Global 21-cm signal have been extensively reviewed and what follows is a brief summary of the structure-defining processes. For further details see Furlanetto et al. (2006), Pritchard & Loeb (2012) and Barkana (2016).
During the CD the first stars begin to form in halos that have accumulated mass under gravity. In the following EoR the neutral hydrogen gas becomes completely ionised by the ultraviolet emission from the first luminous sources. The structure of the 21-cm signal is defined by various astrophysical processes including the adiabatic cooling and collisional coupling of the neutral hydrogen and the gas, the Wouthuysen-Field (WF) effect, X-ray heating and ionization.
At high redshifts, during the dark ages of the universe, collisions between neutral hydrogen, other hydrogen atoms, electrons and protons couple the spin temperature to the gas temperature. The gas cools adiabatically and at a faster rate than the CMB producing an absorption against the CMB before the first stars begin to form. As the universe expands the density of baryons reduces and collisional coupling becomes inefficient. The spin temperature is driven consequently back to the CMB temperature by radiative coupling.
Once the first luminous sources begin to form in the CD, the WF effect begins to become important. The absorption and re-emission of Ly-α photons from the first luminous sources by neutral hydrogen causes spin flip transitions and drives the distribution of hydrogen atoms in the excited and the ground 21-cm states (Wouthuysen 1952;Field 1959). As a result, the spin temperature couples to the gas temperature again, which has continued to cool adiabatically producing another absorption trough against the CMB.
X-ray sources heat the gas at later times and, if sufficient heating occurs, this causes the gas temperature to exceed the CMB temperature producing an emission above the CMB. The primary sources of X-ray emission during this epoch are thought to be X-ray binaries (e.g. Fragos et al. 2013). Finally, at lower redshifts the neutral hydrogen gas is ionized by UV emission. As the gas becomes completely ionized the signal disappears against the background of the CMB. This period and the subsequent reheating and reionization of the gas are the focus of the experiments listed above.
The above processes do not occur at independent epochs and do not start and stop instantaneously. Consequently, the structure of the signal is determined by the interplay between these mechanisms and by the change in the dominant processes with time. The exact timing and intensity of the signal is only broadly understood within a theoretical parameter space (Cohen et al. 2017; Singh et al. The MSF fits the data to a higher degree of accuracy and reveals a systematic that has been partially removed by the polynomial as part of the foreground. The polynomial is given by equation (2) of Bowman et al. (2018) and is taken to be 5 th order. We use the best fitting 11 th order MSF from the built-in library in maxsmooth to illustrate the quality of fit recovered.
2018b; Monsalve et al. 2019;Cohen et al. 2020). Experiments that search for the Global 21-cm signal are attempting to detect a signal, according to standard ΛCDM cosmology, approximately 250 mK in foregrounds of up to 10 4 − 10 5 orders of magnitude brighter. These high-magnitude foregrounds are dominated by synchrotron and free-free emission in the Galaxy and extragalactic radio sources which have smooth power law structures. Modelling of these foregrounds without signal loss is essential for an accurate detection and not always possible with unconstrained polynomials. However, unconstrained polynomials and linear combinations of unconstrained polynomials remain the traditionally used foreground model in 21-cm experiments (Bowman et al. 2018;Singh et al. 2018b;Monsalve et al. 2019).
The bandwidth is determined by the intrinsic frequency of the 21-cm transition, 1420.4 MHz, which is redshifted by the expansion of the universe. Studies of Gunn-Peterson troughs in quasar spectra and of the CMB anisotropies put the end of the EoR at z ≈ 6 (Becker et al. 2001;Spergel et al. 2007;Planck Collaboration et al. 2018). It is predicted that the onset of star formation occurred at z ∼ 30 (Abel et al. 2002) and consequently the bandwidth of interest for 21-cm cosmology is approximately 50 − 200 MHz. Fig. 1 shows an example of the application of MSFs to 21-cm cosmology. Here we have fitted publicly available data from the EDGES low band experiment with an MSF and a 5 th order polynomial of the form given by equation (2) in Bowman et al. (2018). The MSF is shown to fit the foreground to a higher degree of precision and potentially reveal a sinusoidal systematic which has been previously identified in the data (Hills et al. 2018;Singh & Subrahmanyan 2019;Sims & Pober 2020). A more detailed discussion of the EDGES data can be found in section 5.2.
The constrained nature of DCFs, namely that specific derivatives do not cross zero in the domain of interest, makes fitting these functions a non-trivial task. While this has been historically performed with optimization routines such as Basin-hopping (Wales & Doye 1997) and Nelder-Mead (Nelder & Mead 1965) we find that the use of quadratic programming (Nocedal & Wright 2006) is considerably more computationally efficient and reliable. Our DCF code, maxsmooth is therefore based on quadratic programming and uses the Python based convex optimisation code, CVXOPT (Anderson et al. 2018). A discussion of quadratic programming can be found in App. A.
The constraints on a DCF are not explicitly linear but are piecewise linear with various combinations of positive and negative signs on the high order derivatives. For low order, N, DCFs testing every combination of positive and negative signs is a computationally inexpensive task. However, this becomes increasingly time consuming with increasing N and maxsmooth uses a cascading routine in combination with a directional exploration to quickly search the discrete sign spaces.
DCFs can be formed from a variety of different basis functions and maxsmooth has a built-in library. The library is not intended to be complete, and the user can implement their own basis functions. For basis functions in which the number of high order derivatives is not finite, maxsmooth constrains derivatives up to order m = N − 2. MSFs form the basis of the analysis performed in this paper and we focus on their uses and applications. However, the description of MSFs can be more broadly applied to DCFs.
In section 2 we describe MSFs in more detail and give examples using the built-in DCFs in maxsmooth. In section 3 we discuss the application of quadratic programming to DCF fitting with reference to CVXOPT and the piecewise linear constraints on the derivatives. Section 4 discusses the fitting algorithm implemented by maxsmooth and compares its efficiency to alternative optimization routines. In section 5 we discuss the use of PSFs in 21-cm cosmology. This discussion is then followed by the application of the fitting routine to the EDGES low band data (Bowman et al. 2018) and data from LEDA (Price et al. 2018). We conclude in section 6 highlighting the particular applications of maxsmooth.

MAXIMALLY SMOOTH FUNCTIONS
MSFs are functions which feature no inflection points or zero crossings in higher order derivatives (see Sathyanarayana Rao et al. 2015Rao et al. , 2017. The coefficients of the basis functions are constrained such that the m th order derivative satisfies where x and y define the independent and dependent variables and for MSFs m ≥ 2. More generally for DCFs m can be greater or equal to any value or equal to a select set of derivative orders. maxsmooth features seven built-in DCFs which we use for fitting. Their functional forms and derivatives are shown in Tab. 1. Generally, the DCF functions can be decomposed in terms of basis functions, φ and parameters, a k as For the first DCF shown in Tab. 1, the Normalised Polynomial model, the basis functions are given by where y 0 and x 0 correspond to a pivot point, defaulted to the midpoint, in the data sets. The normalised nature of this polynomial model ensures that the fit parameters, a k , are of order unity. Here N is the order of the DCF and can take on any value. However, for a given model and data set there is a limiting value beyond which a further increase in N does not increase the quality of the fit and this is illustrated in section 5.2 and Fig. 10. The DCF model will have powers from 0 to N − 1. Two more basis functions built-in to maxsmooth are given by the Polynomial and Difference Polynomial models where the latter is based on the basis function used in Sathyanarayana Rao et al. (2017). The built-in set of models is not meant to be complete with the intention for it to be extended in the future.
The fourth basis function built-in to maxsmooth, Log Polynomial, produces an MSF in y − log 10 (x) space. maxsmooth is also capable of fitting a DCF in log 10 (y) − log 10 (x) space given in Tab. 1 as the Log Log Polynomial model. In this instance the function is constrained by derivatives in log 10 (y) − log 10 (x) space. This can be advantageous in situations where the foregrounds are expected to take on a power law structure.
The penultimate basis function in the maxsmooth library of models is built from the orthogonal Legendre Polynomials, P k (z), where z is a variable of length y over the range [-1, 1]. The m th order derivatives of this model are determined by the Associated Legendre Polynomials, P m k (z). By definition the Legendre polynomials are a linear combination of the basis functions of the Normalised Polynomial model. This is true also for the Polynomial model and less trivially for the Difference Polynomial model.
Typically the basis functions are designed so that equation (9) has a finite number of high order derivatives and is consequently polynomial in nature. However, more elaborate models with an infinite number of derivatives are plausible if we consider these functions to be maximally smooth when all derivatives with 2 ≤ m ≤ N − 2 are constrained. The final DCF model built-in to maxsmooth, the Exponential DCF, is an example of this with exponential basis functions. This model fails at high N where the exponential cannot be computationally calculated. However, it is a useful example of a DCF with infinite derivatives and performs well with low values of N. The exact value of N at which this basis begins to fail is determined by the magnitude of the x data. An alternative example of a basis with infinite derivatives that maxsmooth is capable of fitting would be a polynomial function with non-integer powers.
Generally, the form of the basis function is important in determining the quality of the residuals and careful exploration of the basis functions are needed in order to draw sensible conclusions about the data set. Again, this is illustrated with an example in section 5.2. We also note that DCFs fitted in y − log 10 (x) space, log 10 (y)−log 10 (x) space, y−x space or y−z space are not equivalent since the form of the constraints and the function that we minimise are different in each case. This is discussed further in section 3.
With appropriate normalisation maxsmooth will be able to transform any basis function into a 'standard' form, which can be solved easily and transformed back into the initial basis function choice. Designing and automating such a normalisation is the subject of ongoing work. Provided this 'standard' form is chosen well such that it will always return the best quality fits and is computationally solvable with quadratic programming, the initial choice of basis function will largely be negated. Its form will only be determined by the need of the user to model their foreground using a specific model. For example, in 21-cm cosmology this specific model may be a linearised physical model of the data fitted as an MSF. While normalisation remains absent in maxsmooth, the user has the ability to input normalised x and y data.
For quadratic programming, the method used here to fit MSFs, it is useful to reformulate equation (9) as a matrix equation. Explic-Name Function Derivatives itly we have discrete data points y i and x i which means that φ k (x) forms a two dimensional matrix, Φ. The matrix of basis functions has dimensions (D × N) where D is the length of y and N, as before, is the order of the function. We write this as where φ ik = φ k (x i ). We can summarise this as where a is a column vector of length N representing the parameters. For the polynomial basis function in equation (3), the element φ D0 , or φ 0 (x D ), has the form x 0 D and φ 0(N −1) has the form x (N −1) 0 . Reformulating equation (1) in terms of matrices for quadratic programming is more complicated. If we take the definition of the condition with the derivative ≤ 0 and write this in the form of a matrix for a given derivative order m we find where both matrices are columns of length D. Each row in the derivative matrix corresponds to an evaluation of the m th order derivative for a given y i and x i . We can expand the elements of the derivative matrix out into a matrix of derivative prefactors, G, and the matrix of parameters a, as in equation (12), which is useful for implementing quadratic programming. This is best illustrated with an example, we will look at the simple case of N = 3 with one constrained derivative m = 2. We will say that our data sets have a length D = 4 and choose the simplest functional form for our MSF given by equation (3). In this case G is given by for the range k = 0 up to but not including N − m. For this problem k has only one value 0 which would produce a column matrix of elements of length D. However, since we need to multiply this by the column matrix a of length N then the matrix G should have dimensions of D × N. The additional elements in this instance are 0 so that the product of these elements with the corresponding elements of a equals 0. For example here the evaluation of the second order derivative Ga for the first data element will be Generally if the row elements of G have a position from 0 − (N − 1) then the elements with position ≤ m − 1 will be 0. The matrix G for our specific problem then becomes which when multiplied by a gives us the evaluation of the derivatives as a column matrix with length D = 4. For quadratic programming we need one matrix expression for all of the constraints on our function. Our definition of G scales with the order of the DCF so that for N = 4 we have which includes the prefactors on both the m = 2 and m = 3 derivatives. We can, therefore, re-write equation (1) for ≤ 0 and equation (13) as where generally G will have shape (CD)× N and 0 will have a length CD. Here C is the total number of constrained derivatives and in the two examples above C = 1 and C = 2 respectively. We can write the first case in equation (1) in one of two ways as Ga ≥ 0 or as −(Ga) ≤ 0. For the implementation of quadratic programming used in maxsmooth the second is the most useful and a full discussion of this can be found in section 3.
In some cases we find that a DCF with one or more high order derivatives free to cross zero is needed to better fit the data. It is to this effect that the potential to allow zero crossings to the fit is built-in to maxsmooth. However, maxsmooth will not force zero crossings and produce a PSF if it can find a better solution without the need.

FITTING DERIVATIVE CONSTRAINED FUNCTIONS USING QUADRATIC PROGRAMMING
We provide a brief overview of quadratic programming in App. A and what follows is a discussion of the specific problem of fitting DCFs with quadratic programming. When fitting a curve using the least-squares method we minimize where y fit, k denotes the elements of the fitted model. We can substitute equation (12) for the fitted model and re-write this in terms of matrices as where we are looking for solutions of the parameters a that minimise χ 2 (a). When expanded out this becomes Since y T y is a constant it is irrelevant for the minimisation problem and we can ignore it. We can also divide through by the factor of 2 and this leaves where Q = Φ T Φ and q T = − y T Φ.
As previously discussed in section 2, the constraint in equation (1) is not explicitly linear but is two separately testable linear constraints. The quadratic program solver CVXOPT minimizes equation (22) subject to equation (18). It requires G and 0 as inputs which explains the motivation behind defining the stacked matrix of derivatives as Ga. In CVXOPT the identity is fixed and we cannot directly constrain the problem via a greater than or equals inequality.
We can initially force all the derivatives to be positive, the first of the two conditions in equation (1), by multiplying each element in G by a negative sign as discussed in the previous section. However, it will not necessarily be the case that the optimal DCF fit will have an entirely positive or entirely negative set of derivatives. Rather than forcing the entire matrices to produce positive derivatives we can multiply the elements of Ga corresponding to given derivatives by a negative sign. Consequently, we have to analyse different discrete sets of sign combinations in order to find the best fit.
We refer to the combination of signs on G as the maxsmooth signs, s, and we can incorporate this into our definition of G so that it becomes G(s). s is a vector of length C and each element is either given by 1 for a positive sign or −1 for a negative sign. For example in equation (17) since both derivatives are negative the maxsmooth signs are s = [1, 1]. For an N th order MSF there are N − 2 derivatives with m ≥ 2, consequently, there are 2 (N − 2) sign combinations. For low order N we can explore this space exhaustively at reasonable computational cost with CVXOPT. However, as N becomes larger, the total number of sign combinations rapidly increases. While N = 4 has 4 potential sign combinations, we find that N = 13 has 2048. This would mean performing an exponentially increasing number of CVXOPT fits which will become increasingly time consuming. An alternative approach navigating through the discrete sign spaces is detailed in section 4.
We can visualise equation (1) by varying the parameters of an optimal MSF fit over a given range to get a better understanding of the constraints. In order to perform this analysis we use a simulated noiseless Global 21-cm foreground following x −2.5 . We perform a 5 th order MSF fit with maxsmooth on this data using equation (3) to find the optimal foreground parameters. While this fit will not return the best χ 2 , as shown in Fig. 2, it is sufficient to allow us to investigate how variation in the parameters affects the constraints. We vary each parameter's value 200% either side of the optimum found and sample these ranges using 400 points. Fig. 2, left panel, shows the parameter space for the fit described above. Black regions in the figure are combinations of parameters for which the condition in equation (1) is violated. The coloured regions are regions in which the condition is upheld where their colour is related to the maxsmooth sign combinations. Each panel in the figure shows the parameter space for two of the five parameters and the contour lines show the values of χ 2 across the parameter ranges. While varying the parameters relevant to each panel we maintain all others at their optimal values found with maxsmooth. The contour lines help us to determine correlations between the parameters and this is particularly useful when fitting a physically motivated DCF.
Transitions through a region of violation between viable regions correspond to changes in the maxsmooth sign of one or more of the derivatives. This is illustrated by the use of different colour maps across the different viable regions. For example, in the panel corresponding to variation in a 0 and a 2 for a 2 ≤ 15 s = 1 and for a 2 ≥ 25 s = −1 for the m = 2 derivative. The transitions become more complex when varying parameters a 2 , a 3 and a 4 because these parameters affect the magnitude and signs of multiple high order derivatives. We also see transitions between regions of different sign combinations when a 4 switches sign. This causes the final constrained derivative to switch sign because it is a constant mul-   The parameter space of a 5 th order MSF of the form given by equation (3), fit to data generated with y = x −2.5 . The tested parameter ranges are taken to be 200% on either side of the optimal results found by maxsmooth. We maintain the optimal parameter values for three of the parameters when varying the two corresponding to each panel. Black regions of the graph show parameter combinations that violate equation (1) and coloured regions correspond to viable regions. The central region, marked with a circle, in each panel corresponds to the optimum parameters and the optimum maxsmooth sign combination on the derivatives. Transitions through regions of violation correspond to changes in the sign of one or more high order derivatives. A change in sign of parameter a 4 corresponds to a change in the sign of the final constrained derivative. Since this derivative is a constant there is no violated region between these two possible sign combinations. Top Right: The data and the fitted MSF, where T represents the measured sky temperature and ν is the frequency. Bottom Right: The residuals after subtracting the fitted MSF from the data.
tiplied by a 4 . There is no region of violation between these viable regions because a constant value of a 4 = 2.5 × 10 −4 meets the MSF constraint as will a constant value of 0 or a 4 = −2.5 × 10 −4 . The equivalent graph in 5 dimensions, varying all parameters around their optimal values, would feature 5 dimensional convex faceted regions in which equation (1) is met with a unique set of maxsmooth signs. This concept scales up and down to higher and lower dimensions of parameter space.
The parameters a 0 and a 1 do not affect the constrained derivatives of the MSF or the validity of the conditions and the associated colour map gives the optimum maxsmooth signs. Since the central region of each panel corresponds to the optimum parameters for the fit, this will always be a viable region and will have the same maxsmooth sign combination as the panel corresponding to a 0 and a 1 . Fig. 2 illustrates this point and the central viable region always corresponds to derivatives of order m = 2, 3 and 4 having s = [−1, 1, − 1].
We performed the equivalent fit with the logarithmic basis that has derivatives constrained in log 10 (y) − log 10 (x) space. The associated parameter graph can be found in App. B, and a comparison with the graph presented in this section shows that the constraints are much less severe in logarithmic space. The weak constraints mean that all of the discrete sign combinations on the derivatives have similar minimum χ 2 values. This becomes a problem when attempting to quickly search the discrete maxsmooth sign spaces and is discussed further in section 4.2.
Generally, the above conclusion will be specific to the data being fitted here and this analysis is not a complete exploration of the basis functions available. However, since the basis functions in y − x space are all related by linear combinations of each other we find similar parameter distributions for all. Importantly, the analysis highlights the effect that the choice of basis function has on the quality of fit and ease of fitting as well as demonstrating the constrained nature of MSFs. These plots can be produced using maxsmooth for any DCF fitting problem.

NAVIGATING DISCRETE SIGN SPACES
This section discusses in more detail the fitting problem, defines the maxsmooth algorithm and compares its efficiency with an alternative fitting algorithm. To restate concisely the problem being fitted  Table 2. The table illustrates the relationship between the binary representation of the sign combination number and the maxsmooth signs, s. A 1 in the (N − 2) bit binary representation for an MSF corresponds to a negative maxsmooth sign (positive derivative). The signs and their respective combination numbers are used in the fitting routine and for the visualisation of the χ 2 distribution as shown in Fig. 3.
we have A 'problem' in this context is the combination of the data, order, basis function and constraints on the DCF.
With maxsmooth we can test all possible sign combinations. This is a reliable method and, provided the problem can be solved with quadratic programming, will always give the correct global minimum. When the problem we are interested in is 'well defined', we can develop a quicker algorithm that searches or navigates through the discrete maxsmooth sign spaces to find the global minimum. Each sign space is a discrete parameter space with its own global minimum as discussed in section 3. Using quadratic programming on a fit with a specific sign combination will find this global minimum, and we are interested in finding the minimum of these global minima.
A 'well defined' problem is one in which the discrete sign spaces have large variance in their minimum χ 2 values and the sign space for the global minimum is easily identifiable. In contrast we can have an 'ill defined' problem in which the variance in minimum χ 2 across all sign combinations is small. This concept of 'well defined' and 'ill defined' problems is explored further in the following two subsections.

The χ 2 Distribution
We investigate the distribution of χ 2 values, shown in Fig. 3, for a 10 th order MSF fit of the form given by equation (5) to a simulated 21-cm foreground, like that shown in Fig. 2. We add Gaussian noise with a standard deviation of 0.5 to the foreground. For a typical 21-cm experiment this noise is unrealistic and would mask any potential signal in the data, however, it illustrates the behaviour of the maxsmooth algorithm when fitting a difficult problem.
In Fig. 3, a combination of all positive derivatives (negative signs) and all negative derivatives (positive signs) corresponds to sign combination numbers 255 and 0 respectively. Specifically, the maxsmooth signs, s, are related to the sign combination number by its C bit binary representation, here C = (N − 2). In binary the sign combination numbers run from 00000000 to 11111111. Each bit represents the sign on the m th order derivative with a 1 representing a negative maxsmooth sign. For example, the sign combinations surrounding number 25 are shown in Tab. 2.
Although we note that Fig. 2 corresponds to a different prob- lem, we would expect a similar parameter space for the fit performed here. Each region in the equivalent figure would correspond to a single sign combination, and the associated minimum χ 2 value in the regions would give us the data that informs Fig. 3. The distribution appears to be composed of smooth steps or shelves; however, when each shelf is studied closer, we find a series of peaks and troughs. This can be seen in the subplot of Fig. 3 which shows the distribution in the neighbourhood of the global minimum found in the large or 'global' well. This type of distribution with a large variance in χ 2 is characteristic of a 'well defined' problem. We use this example χ 2 distribution to motivate the maxsmooth algorithm outlined in the following subsection.

The maxsmooth Sign Navigating Algorithm
Exploration of the discrete sign spaces for high N can be achieved by exploring the spaces around an iteratively updated optimum sign combination. The maxsmooth algorithm begins with a randomly generated set of signs for which the objective function is evaluated and the optimum parameters are found. We flip each individual sign one at a time beginning with the lowest order constrained derivative first. When the objective function is evaluated to be lower than that for the optimum sign combination, we replace it with the new set and repeat the process in a 'cascading' routine until the objective function stops decreasing in value.
The local minima shown in Fig. 3 mean that the cascading algorithm is not sufficient to consistently find the global minimum. We can demonstrate this by performing 100 separate runs of the cascading algorithm on the simulated 21-cm foreground, and we use equation (5) with N = 10 to model the MSF as before. We find the true global minimum 79 times and a second local minimum 21 times. For an MSF fit to this simulated date the difference in these local minima is insignificant, ∆ χ 2 = 0.12. However, we see the same behaviour with real data sets from EDGES and LEDA, and when performing joint fits of foregrounds and signals of interest ∆ χ 2 can greatly increase.
The abundance of local minima is determined by the magnitude  (1a) shows the initial random starting point and point (1b) shows a rejected sign combination in the cascade routine from (1a) to (2). Point (2) is an accepted step through the cascade with a χ 2 value smaller than the previous minimum. Point (3) marks the end of the cascade and the start of the left directional exploration. Finally, point (4) illustrates the end of the right directional exploration when the χ 2 value exceeds the limit on the directional exploration. The black dots mark the entirety of the searched sign combinations. and presence of signals, systematics and noise in the data. When jointly fitting a signal/systematic model with a DCF foreground, the signal/systematic parameters are estimated by another fitting routine in which maxsmooth is wrapped. The initial parameter guess will not be a perfect representation of any real systematic or signal. This, along with a large noise, can produce a large difference between the true foreground and the 'foreground' being fitted causing the presence of local minima to become more severe.
To prevent the routine terminating in a local minimum we perform a complete search of the sign spaces surrounding the minimum found after the cascading routine. We refer to this search as a directional exploration and impose limits on its extent. In each direction we limit the number of sign combinations to explore and we limit the maximum allowed increase in χ 2 value. We prevent repeated calculations of the minimum for given signs and treat the minimum of all tested signs as the global minimum.
We run the consistency test again, with the full maxsmooth algorithm, and find that for all 100 trial fits we find the same χ 2 found when testing all sign combinations. In Fig. 4 the red arrows show the approximate path taken through the discrete sign spaces against the complete distribution of χ 2 . Point (1a) shows the random starting point in the algorithm, and point (1b) shows a rejected sign combination evaluated during the cascade from point (1a) to (2). Point (2), therefore, corresponds to a step through the cascade. Point (3) marks the end of the cascade and the start of the left directional exploration. Finally, point (4) shows the end of the right directional exploration where the calculated χ 2 value exceeds the limit on the directional exploration.
The global well tends to be associated with signs that are all positive, all negative or alternating. We see this in Fig. 3 where the minimum falls at sign combination number 169 and number 170, characteristic of the derivatives for the simulated 21-cm data, corresponds to alternating positive and negative derivatives from order m = 2. Standard patterns of derivative signs can be seen for all data following approximate power laws. All positive derivatives, all negative and alternating signs correspond to data following the approximate power laws y ≈ x k , y ≈ −x k , y ≈ x −k and y ≈ −x −k (see App. C).
The maxsmooth algorithm assumes that the global well is present in the χ 2 distribution and this is often the case. The use of DCFs is primarily driven by a desire to constrain previously proposed polynomial models to foregrounds. As a result we would expect that the data being fitted could be described by one of the four approximate power laws highlighted above and that the global minimum will fall around an associated sign combination. In rare cases the global well is not clearly defined and this is described in the following subsection.

Ill Defined Problems and their Identification
We can illustrate an 'ill defined' problem, with a small variation in χ 2 across the maxsmooth sign spaces, by adding a 21-cm signal into the foreground model and fitting this with a 10 th order logarithmic MSF defined by equation (6). We take an example signal model from Cohen et al. (2017) and add an additional noise of 20 mK more typical of a 21-cm experiment. The resultant χ 2 distribution with its global minimum is shown in the top panel of Fig. 5.
The global minimum, shown as a black data point, cannot be found using the maxsmooth algorithm. The cascading algorithm may terminate in any of the approximately equal minima and the directional exploration will then quickly terminate because of the limits imposed.
If we repeat the above fit and perform it with equation (2) we find that the problem is well defined with a larger χ 2 variation across sign combinations. This is shown in the bottom panel of Fig. 5. The results, when using equation (6), are significantly better than when using equation (2) meaning it is important to be able to solve 'ill defined' problems. This can be done by testing all maxsmooth signs but knowing when this is necessary is important if you are expecting to run multiple DCF fits to the same data set. We can focus on diagnosing whether a DCF fit to the data is 'ill defined' because a joint fit to the same data set of a DCF and signal of interest will also feature an 'ill defined' χ 2 distribution.
We can identify an 'ill defined' problem by producing the equivalent of Fig. 5 using maxsmooth and visually assessing the χ 2 distribution for a DCF fit. Alternatively, we can use the parameter space plots to identify whether the constraints are weak or not, and if a local minima is returned from the sign navigating routine then the minimum in these plots will appear off centre.
Assessment of the first derivative of the data can also help to identify an 'ill defined' problem. For the example problem this is shown in Fig. 6 where the derivatives have been approximated using ∆y/∆x. Higher order derivatives of the data will have similarly complex or simplistic structures in the respective spaces. There are many combinations of parameters that will provide smooth fits with similar χ 2 values in logarithmic space leading to the presence of local minima. This issue will also be present in any data set where the noise or signal of interest are of a similar magnitude to the foreground in y − x space.

Comparison With Basin-hopping and Nelder-Mead Methods
For comparison of the two methods, testing all sign combinations and navigating through sign spaces, we generate a signal y with The χ 2 distribution found when fitting simulated 21-cm experiment data with the logarithmic basis function, equation (6). The distribution has a noise like structure and is difficult to solve with the maxsmooth sign navigating algorithm. However, the global minimum can be found by testing all sign combinations with maxsmooth. The symmetric nature of the distribution stems from the symmetric nature about 0 of the high order derivatives in logarithmic space. Bottom Panel: The same as above using a normalised polynomial given by equation (2). The distribution is clearly defined and easily searchable with the sign navigating routine. The difference between this result and that shown above can be used to understand what makes a problem 'ill defined'.  (6) and equation (2) respectively. Also shown as orange lines are the approximate first derivatives of the respective best fits for comparison. The large variance of the first derivative of the data in logarithmic space highlights that there are many different ways to fit this data set 'well' producing the χ 2 distribution we see in Fig. 5.  (25) using the two built-in quadratic programming methods. For comparison the time taken by a Basin-hopping/Nelder-Mead routine is also shown up to N = 7 after which the routine fails to find the optimum solutions without adjustments to the routine parameters. All of the fits were performed with equation (2) and on the same machine.
polynomial dependence on the coordinate x and a Gaussian random noise with a standard deviation of 0.5 We fit this data with a 10 th order MSF of the form described by equation (2) and assess the χ 2 distribution to find that the problem is well defined. This is as expected since the data follows an approximate x k power law and we are fitting in linear space. The algorithm run time becomes a significant issue when performing joint fits of foregrounds, signals of interest and/or systematics in which multiple DCF fits have to be performed. The time taken to perform both in-built maxsmooth routines is shown in Fig. 7. It is quicker to partially sample the available spaces for high N than testing all sign combinations and as discussed for 'well defined' problems this will return the minimum χ 2 .
The runtime of the sign navigating routine is dependent on the starting sign combination, the limits imposed on the directional exploration, dominating factor, and the width of the global well. There is no consistent measure of the difference in time taken to fit the data between the two maxsmooth methods. However, for the sign navigating routine we are inevitably fitting for a smaller number of the sign combinations than when testing all. Fig. 7 also shows the time taken to fit the data with equation (2) using a Basin-hopping routine followed by a Nelder-Mead algorithm, hereafter referred to as BHNM. These two algorithms have been previously used either separately or in conjunction for fitting MSFs (Sathyanarayana Rao et al. 2015Rao et al. , 2017Singh & Subrahmanyan 2019). We find that the BHNM method is approximately 2 orders of magnitude slower than maxsmooth. Between N = 3 and 7 we find a maximum percentage difference in χ 2 of ≈ 0.04% when comparing the BHNM method with the results from maxsmooth.
The primary difference in the approaches comes from the division of the parameter space into discrete sign spaces. The BHNM method attempts to search the entire parameter space and penalises parameter combinations that violate equation (1). However, assessment of Fig. 2 highlights that this is not a convenient method because across the whole parameter space there are multiple local minima with different sign combinations and transitioning from one 'basin' to another is not trivial for a heavily constrained parameter space. By dividing the space up into discrete sign spaces with maxsmooth, we can test the entirety of the parameter space, unlike when using the BHNM method, guaranteeing we find the global minimum. We could perform the same division of the space and in each discrete sign space perform a Nelder-Mead or equivalent routine however we use quadratic programming because it is designed specifically for fast and robust constrained optimisation.
For the BHNM method, we show here only fits up to N = 7 after which it begins to fail without further adjustment of the routine parameters. The freedom to adjust these parameters can be considered a disadvantage that leaves the user to determine whether the routine parameters they have chosen produce a true global minimum. In contrast maxsmooth is designed to reliably give the optimum result, with the only adjustable routine parameters being the total number of CVXOPT iterations and the directional exploration limits.

The Recovery of Model 21-cm Signals
A discussion of MSFs and a comparison to unconstrained polynomials for Global 21-cm cosmology can be found in Sathyanarayana Rao et al. (2017). Foreground modelling with high order MSFs is shown to accurately recover Global 21-cm signals in simulated data and unconstrained polynomials are shown to introduce additional turning points, when compared to those in a mock signal. The number of additional turning points is shown to increase with polynomial order. The addition of extra turning points can obscure the signal of interest and lead to the false identification of systematics. In the following subsections we look at fitting foregrounds for 21-cm experiments with DCFs and compare these to unconstrained polynomial fits.
Deviations from a smooth structure can be induced in data by experimental systematics or they can be intrinsic to the foreground. In the case of a smooth foreground, by using PSFs we can correct for non-smooth structure directly with our foreground model rather than separately fitting out systematics. PSFs allow for zero crossings in the high order derivatives but remain more constrained than traditionally used polynomial fits. However, lifting constrains on a DCF model has the potential to also result in signal loss where the level of signal loss is dependent on the presence of non-smooth structure in the foreground and the number of lifted constraints.
In the following analysis the quality of the fit in terms of RMS is of secondary importance. An unconstrained high order polynomial will generally produce residuals with a lower RMS than a DCF. However, a correctly constrained DCF will leave behind the structure of a signal in the residuals meaning it can be identified easily. As a measure of signal structure in the residuals we take the number of turning points, p as used by Sathyanarayana Rao et al. (2017). The Global 21-cm signal is expected to have a distinct number of turning points, 2 − 3 determined by various astrophysical processes (see section 1). A comparison of the number of turning points in a mock signal to the residuals after removing a DCF fit from simulated data including the same mock signal will help to identify the degree to which DCFs preserve signals.

DCFs and 21-cm cosmology
To compare the performance of DCFs and unconstrained polynomials, we use the sample of 264 signal models, S(ν) over the bandwidth ν = 40 − 200 MHz, presented in Cohen et al. (2017) Table 3. The table shows the total number of fits, using one of four foreground models, to the smooth foreground plus signal simulations from Cohen et al. (2017) that have residuals with the same number of turning points as the signal, p Signal . The data corresponds to that shown in the left panel of Fig. 8. We also show instances where the recovered residuals have one or two additional turning points.
//people.ast.cam.ac.uk/~afialkov/Collab.html. We add to these models a foreground given by a ν −2.5 to produce simplistic mock data sets. The data sets are noiseless and while this is unrealistic, we would not expect the addition of noise to obscure any larger signal structure present in the data from a 21-cm experiment with a low radiometer noise. We fit each simulated data set with an MSF, low order unconstrained polynomials and a set of PSFs. All of the fitted DCF foreground models are 13 th order and of the form given by equation (6). We test all sign combinations for the DCF fits in this section for reasons that were explained in section 4.2 and find that the chosen DCF model and order provides the best fits after testing the built-in maxsmooth models. Fig. 8, left panel, shows the difference in the number of turning points, ∆p for the residuals, p Fit Residuals , and for the signal, p Signal , using four different foreground models as a function of the maximum brightness temperature, T max , and minimum temperature, T min , of the simulated signal. Each data point corresponds to one of the 264 mock data sets and ∆p = 0 signifies that the residuals have the same number of turning points as the signal. The unconstrained polynomial fits have the same functional form as the DCFs.
We can quantify the probability of returning residuals with the same number of turning points as the model signals. Tab. 3 shows the total number of residuals for each fit type that returned the same p as the simulated signals and 1 or 2 additional turning points. The MSF fits return p Signal for 44% of the mock data sets and 99% of the time it returns at most p Signal plus 2 additional turning points. They are the most likely, of the tested foreground models to return the correct number of turning points. The equivalent figures for the 5 th order logarithmic unconstrained polynomial, one of the most frequently used foreground fits in 21-cm cosmology, are 0% and 64% and for the PSF with m ≥ 3 they are 0.004% and 85%. The statistics suggest that modelling foregrounds with MSFs and well constrained PSFs can frequently result in residuals that closely follow the structure of the signal. We include in the statistics the cases with 1 or 2 additional turning points because the signals should still be identifiable in the residuals. A joint fit of foreground model plus a signal model in these instances should return an approximately correct parameterisation of the signal model.

Example Residuals
Also shown in Fig. 8, right panel, is an example of the fits for a given model and this is akin to Fig.7 in Sathyanarayana Rao et al. (2017). We see that an MSF, top panel, while not identically recovering the signal but rather a smooth baseline subtracted version does preserves the three turning points of the model signal as expected. The example residuals from unconstrained polynomial fits shown in the bottom right panel of Fig. 8 show a larger disparity with the The difference in the number of turning points between the fit residuals and the simulated signal, ∆p = p Fit Residuals − p Signal , as a function of the maximum and minimum temperatures of the signal for a smooth foreground model. This is shown for four different foreground models and in each panel the data points correspond to one of the 264 mock data sets. The graph shows that the MSF fit is the most likely model to return the structure of the signal and that a correctly constrained PSF can more frequently preserve the signal than an unconstrained low order polynomial. It also shows that the fit quality is dependent on the maximum and minimum temperatures of the signal. All of the DCF fits used to produce this graph were logarithmic and 13 th order. Right: Shown are examples of how the addition of allowed zero crossings in the high order derivatives of a DCF can affect the residuals and how accurately they preserve the turning points of any signal present. We use a signal model, blue line in all panels, from the theoretically motivated set presented in Cohen et al. (2017) along with a model of a 21-cm experiment foreground. Fits with an MSF and polynomials are also shown for comparison with the number of turning points, p, displayed for each of the residuals (see legend for details). While the residuals after fitting and removing an MSF do not identically match the signal it is the best representation of the tested foreground models. In this case the residuals represent a smooth baseline subtracted version of the signal as discussed in (Sathyanarayana Rao et al. 2017). structure of the model 21-cm signal than the MSF fit. The PSF with derivatives of order m ≥ 3 constrained, middle panel, produces residuals with one additional turning point. The behaviour at low frequency of the DCF fits is consistent across all of the 264 tested models. It is a byproduct of the basis choice, the frequency of data sampling and the steep nature of the foreground at low frequencies.
We can alleviate some of these issues by increasing the sampling rate of our mock experiment and by reducing the bandwidth. However, the dominant cause is the basis function choice. In logarithmic space any non-smooth variations in the data and derivatives are amplified as shown in section 4.2. Since the mock data set here is noiseless, the only non-smooth structure comes from the signal. However, the data is predominately smooth at high frequency and so the optimum fit tends to be an accurate representation of the foreground in this region and poorer at lower frequencies. The additional turning points when comparing the MSF and PSFs are also seen at low frequencies for the same reason. Despite the above we maintain the full bandwidth, the same sampling rate and the logarithmic basis function in this analysis. We do this because this basis function gives us the best fitting DCF models and in a real experiment our knowledge of any present signal of interest will be too poor to constrain the bandwidth.
We have not analysed the full family of possible PSFs or unconstrained polynomials. Removing constraints on the lower order derivatives first will have the largest affect on the structure of the residuals and consequently we have extensively analysed the effects of lifting the restrictions on the 2 nd and 3 rd order derivatives. We have, however, found evidence to suggest that MSFs and well constrained PSFs can recover signal structure to a higher degree of accuracy than unconstrained polynomials in the case of a smooth foreground. If the foreground features additional non-smooth structure we may expect that an appropriately constrained PSF will act as an MSF. Determination of the appropriate constraints on a PSF will depend on the structure of the data, the expected structure of the signal in the data and the quality of the fit in terms of χ 2 .

Identifiable 21-cm Signals and Limitations of DCFs
The left panel of Fig. 8 is useful for 21-cm cosmology, in characterising the signals most likely to be detectable using DCFs. We find that MSFs and m ≥ 3 PSFs will best recover signals with approximately T min ≥ −225 mK and T max ≥ 0 mK. We would expect this to be true generally because these signal models have complex structures and feature the strongest deviations from the smooth foreground. For the coldest models with T min ≤ −225 mK and T max ≤ 0 mK, X-ray heating is negligible and the spin temperature is always seen in absorption against the CMB. Consequently, they have the simplest structure, a weak deviation from the smooth foreground and are likely to be fitted out as part of the foreground modelling.
Comparison to restrictions placed on the most probable structure of the Global 21-cm signal from experimental data will help identify whether DCFs can recover these signals. For example . The top row shows the resultant residuals when using an MSF to fit just a foreground and to jointly fit a foreground and Gaussian signal model to three mock data sets with no signal, a wide Gaussian signal and a narrow Gaussian signal from left to right. The second row shows the equivalent for an unconstrained polynomial fit. The third row shows the change in foreground, δT fg , between the pure foreground fit and joint fits for both foreground models and all three signal models. The bottom row shows the recovered signals for both foreground models alongside the true signal models in the data sets. Singh et al. (2018b) ruled out models with low X-ray heating and T max < 0 mK, this is supported by results presented in Monsalve et al. (2019). This suggests that DCFs are well suited to identify the most probable 21-cm signals. However, X-ray heating is only one of the structure defining processes. A more thorough exploration of this in terms of signal model parameters, such as star-formation efficiency and X-ray luminosity, is needed to fully understand the types of theoretical 21-cm signals that DCFs are sensitive to. This is out of the scope of this paper and will be the subject of future work.
Smooth systematics, like smooth 21-cm signals, in the data will be removed or fitted out as part of the foreground. However, unless independent modelling of systematics is required, this can be considered an advantage. Modelling foregrounds with DCFs will help to identify non-smooth systematics in data sets where unconstrained polynomials have the potential to fit these out. This is particularly important in 21-cm cosmology where these types of systematics need to be identified and instrumentation needs to be iteratively improved to increase the chances of a detection.

Smooth Signal Models
For 21-cm cosmology it is also important to consider how the Global 21-cm signal is modelled when performing joint fits. Typically the signal is modelled as a Gaussian, flattened Gaussian or using physically motivated models. If a Gaussian model is jointly fit with a DCF foreground, then the fit is biased towards returning a 'smooth' Gaussian signal with a large variance, σ, or full width at half max, FWHM = 2 2 log(2) σ, even if such a signal is not real. This can cause uncertainty in the presence of such signals.
We can illustrate this by generating three different data sets all with foregrounds following ν −2.5 and Gaussian noise of 20 mK. Into two of the three data sets we add mock Gaussian signals with central frequencies ν c = 100 MHz and generated using where A is the amplitude. The first mock signal has an amplitude of 80 mK and a variance of 80 MHz representing a smooth Gaussian 21-cm signal. The second represents a narrow Gaussian signal with an amplitude of 230 mK and a variance of 10 MHz. The final data set has no additional signal in the band of 60 − 200 MHz. We fit each data set, normalised to the maximum temperature, with a best fitting 11 th order MSF of the form given by equation (5). We compare the resultant residuals for each data set to those from a joint fit of an MSF, of the same functional form, and a Gaussian 21-cm signal model. We perform the joint fits by using maxsmooth with the Python implementation of the nested sampling software MultiNest (Buchner et al. 2014;Feroz & Hobson 2008;Feroz et al. 2019). Here, MultiNest estimates the Gaussian signal parameters, maxsmooth fits the foreground model to the data minus the estimated Gaussian signal at each iteration and  Table 4. The table shows the maximum difference between the recovered signals and the change in foreground for the polynomial and MSF pure foreground and joint fits, (δT fg −T sig ) max , to the three simulated 21-cm experiment data sets in Fig. 9. Also shown are the changes in RMS, ∆RM S, between the pure foreground and joint fits for both foreground models and all three Gaussian signal types. We also provide log-evidences and values for the change in log-evidence, ∆ log(Z), between the joint and pure foreground fits with errors calculated from the error in each log(Z). The large change in RMS and log-evidence as well as the increased discrepancy between the change in foreground and the recovered signal for the narrow Gaussian signal data set provide confidence that the recovered signal is truly present in the data.
MultiNest minimises the data minus the foreground model plus the signal model. We also perform the equivalent fits with a 5 th order unconstrained polynomial given by equation (6) using a Lavenberg-Marquardt (Levenberg 1944;Marquardt 1963) algorithm in place of maxsmooth. The results of these fits are shown in Fig. 9 where we have provided the same theoretically motivated priors on all of the Gaussian 21-cm models. Respectively from top to bottom the rows in Fig. 9 show the residuals after just an MSF fit and after a joint fit for comparison, the equivalent for the unconstrained polynomial fits, the change in foreground between the foreground fit and the joint fit and the recovered signal in comparison to the actual signal. The columns correspond to the case of no signal, a wide Gaussian signal and a narrow Gaussian signal from left to right.
We can see that in the absence of a signal jointly fitting with a Gaussian model and MSF returns a deep absorption trough. This recovered absorption trough is approximately the same as the change in the foreground model when fitting with just an MSF and the change in RMS is small. If this were a real experiment the fitted model could easily be misinterpreted as a real signal that had been fitted out as part of the foreground model. Without prior knowledge of the signal, however, we cannot confidently draw any conclusions about the signals existence.
When jointly fitting in the presence of a wide Gaussian signal with an MSF we see a similar result and we could not confidently say that the signal is present in the data if we had no prior knowledge. In fact, the joint fit has recovered a poor representation of the Gaussian signal because the smooth signal in both instances, pure foreground fit and joint, has been absorbed in the foreground modelling.
Finally, we see that in the case of a narrow Gaussian signal in the data set with an MSF we get an almost exact recovery after a joint fit. There is a larger discrepancy between the difference in foreground models and the recovered signal, as shown in Tab. 4, and the reduction in RMS is significant giving us confidence that the signal is truly present. Importantly, we could have predicted the presence of the signal by assessing the residuals after an MSF fit to this data set which clearly shows the remnant of a non-smooth absorption trough.
We see similar results when using the unconstrained polynomial indicating that this problem is not unique to DCFs. However, as illustrated in the previous subsections, correctly constrained DCFs are more likely to return an accurate representation of a 21-cm signal in the residuals. For the case of a wide Gaussian signal model we find that the MSF recovers approximately the correct central frequency where as the polynomial does not. Similarly, for the narrow Gaussian we also find that the log-evidence is higher for the MSF fit at ≈ 242 compared to ≈ 238 with the unconstrained polynomial indicating that the MSF fits the foreground better.
Tab. 4 also shows values for the log-evidence for each fit and the change in log-evidence, ∆ log(Z), between the foreground and joint fits for each data set and each foreground model. While MultiNest returns a log-evidence for the joint fits, maxsmooth is not a Bayesian algorithm. However, assuming a Gaussian likelihood we can use MultiNest to calculate the evidence for our pure foreground fits by having it estimate the noise and calculate the likelihood. As with the RMS and difference between the change in foreground and recovered signal, we see a larger change in ∆ log(Z) for the data set with a narrow Gaussian signal giving us confidence that are recovered signal is real. ∆ log(Z) for the cases of no signal and a wide Gaussian signal are ≈ 0, excluding the MSF fit to the no signal data set, indicating that the joint fits are no more probable than the pure foreground fits. For the MSF fit to the no signal data set the evidence for the joint fit is less than that for the unconstrained polynomial. This implies a lower level of confidence in the recovered signal, which is in this case a false detection, when fitting with an MSF.

MSFs and the EDGES Data
In 2018 the EDGES team reported the detection of an absorption trough at 78 MHz which could be interpreted as a Global 21-cm signal (Bowman et al. 2018). The reported signal is ≈ 2 times the maximum magnitude predicted by current cosmological models (Cohen et al. 2017), and, in order to explain the signal as a 21-cm signal, interactions between dark matter and baryons Berlin et al. 2018;Kovetz et al. 2018 The occasional increase in χ 2 with N for the logarithmic model is because this basis is increasingly unstable with higher N and requires all sign combinations to be tested. confirmed by measurements from LWA (Dowell & Taylor 2018) there are concerns about the analysis of the EDGES data (Hills et al. 2018;Sims & Pober 2020;Singh & Subrahmanyan 2019). These studies and the following work presented here use the publicly available integrated spectrum from the EDGES Low Band experiment which can be found at: https://loco.lab.asu.edu/ edges/edges-data-release/. Hills et al. (2018) found that recovering the absorption profile using the 'physically motivated' foreground model in Bowman et al. (2018) produces unphysical negative values for the ionospheric electron temperature and optical depth. This suggests that the treatment of the foreground in Bowman et al. absorbs part of an unknown systematic. It was also found that a large change in foreground was needed when just fitting a 'physical' foreground to the data and when jointly fitting with a flattened Gaussian 21-cm signal profile. In Hills et al. the authors identify the potential presence of a sinusoidal function in the EDGES data with an amplitude of ≈ 60 mK and a period of ≈ 12.5 MHz.
Sims & Pober (2020) fit a range of models to the EDGES data varying the 21-cm models between a Gaussian model, a flattened Gaussian model as used by Bowman et al. (2018) and physical simulations from the ARES code (Mirocha 2014). They vary the unconstrained polynomial order for the foreground model and examine likelihoods with and without an additional noise term and a damped sinusoidal function. They use Bayesian Evidence to quantify the most likely scenarios of an atlas of 128 models. The 21 highest evidence models all feature damped sinusoidal functions all with a consistent amplitude of ≈ 60 mK and a period of ≈ 12.5 MHz.
An MSF fit to the foreground should leave a periodic sinusoidal function behind in the residuals if it is present in the data because it is non-smooth in nature. This has previously been shown to be the case by Singh & Subrahmanyan (2019), hereafter S19, who identified a sinusoidal feature with an amplitude of 60 ± 10 mK and a period of 12.3 ± 0.1 MHz. We attempt here to re-create this analysis to illustrate the abilities of maxsmooth. The fitting routine used and choice of basis function are the only differences between the results presented here and in S19. The use of maxsmooth means that our joint fit of the data and a systematic model will be computationally quicker and more reliable than the Nelder-Mead based approach to fitting taken in S19, as demonstrated in section 4.
We begin first by assessing the quality of fits using the various basis functions built-in to maxsmooth. S19 used a basis function constrained in T − log 10 (ν) space and although the functional form is not explicitly stated in S19 it is derived from the models in Sathyanarayana Rao et al. (2017) and so is similar to, if not identical to, equation (5). Fig. 10 shows the resultant χ 2 as a function of MSF order for fits with varying basis functions using maxsmooth. This figure again shows how the choice of basis function can affect the quality of the fit. Of note is that our T − log 10 (ν) space model cannot achieve the same RMS as that found by S19 with a similar model. With an N = 7 MSF constrained in logarithmic frequency space S19 return an RMS of 44 mK, whereas with equation (5) maxsmooth returns an RMS of 87 mK. We believe this is due to the lack of normalisation in maxsmooth and as previously discussed this is an ongoing area of development.
We use Fig. 10 to inform our choice of basis function and MSF order and proceed using an 11 th order MSF of the form given by equation (4). We find residuals with an RMS of ≈ 40.4 mK as shown in Fig. 1. We note that this is in approximate agreement to results shown in S19. We also find troughs at ≈ 70 MHz and ≈ 85 MHz which correspond to those found in all of the reported sinusoidal functions.
We jointly fit the data with an 11 th order MSF and a sinusoidal function of the form and the resultant residuals are shown in the left panel of Fig. 11. Note we have not included a model 21-cm signal in this fit. We use maxsmooth along with a Lavenberg-Marquardt algorithm implemented with Scipy to perform this joint fit and with initial parameters of p 0 = 60 mK, p 1 ≈ (2π)/12.5 MHz −1 and p 2 = 0 rad. We find that the results change with the initial parameters when using the Lavenberg-Marquardt algorithm, however, the chosen initial parameters are well informed by the previous work outlined above. We return parameters of p 0 ≈ 56.6 mK, p 1 ≈ 0.52 MHz −1 or a period of ≈ 12.1 MHz and p 2 ≈ 1.1 rad in close agreement with previous analysis. We find an RMS value of 20.1 mK, in close agreement with the result of 22.9 mK found in S19 when jointly fitting an MSF and the sinusoidal function. However, we note that the RMS of the joint fit in the left panel of Fig. 11 is equivalent to the RMS found by S19 when jointly fitting an MSF foreground, a sinusoidal systematic and Gaussian 21-cm signal model. Their proposed Gaussian 21-cm signal model fits with standard predictions. However, the RMS of our joint fit without a Gaussian signal model may highlight some of the difficulties in detecting 'smooth' Gaussian signals discussed in section 5.1.3.
We perform a joint fit of a Gaussian 21-cm signal, a sinusoidal systematic and MSF. Due to the increased complexity of the fit and uncertainty in the fit parameters for the Gaussian model we perform our fit using maxsmooth and MultiNest. We provide prior ranges of 30−80 mK for the amplitude, 0−25 MHz for the period and 0−2π for the phase shift of the sinusoidal systematic. For our Gaussian we set realistic priors on the amplitude of 0 − 250 mK, on the central frequency of 50 − 100 MHz and on σ of 0 − 20 MHz. For the sinusoid we find an amplitude of 56.5 mK, a period of approximately 12.1 MHz and a phase of 1.1 rad. These results, using a more extensive search of the parameter space, are consistent with our previously found sinusoid further indicating that when performing the fit shown in the left panel of Fig. 11 our initial parameters were MultiNest returns a noise parameter of 20 mK and our resultant fit, shown in the right panel of Fig. 11, has an RMS of approximately 20.0 mK. This is much wider and deeper than the signal reported in Singh & Subrahmanyan (2019) which has a depth of 133 ± 60 mK and FWHM of 9 ± 3 MHz however we return the same central frequency of 73 MHz. Noting the discussion of plausibly detectable 21-cm signals when using DCFs and the bias towards 'detection' of 'smooth' Gaussian signals in section 5.1.3 we assess the feasibility of the returned model signal. Here the notion of 'smoothness' is relative to the bandwidth. In section 5.1.3 a Gaussian with FWHM ≈ 24 MHz is confidently identifiable as a real signal but the bandwidth is much larger than that for the EDGES data. We can see by comparison of the bottom panels of Fig. 11, showing the data minus the foreground from the joint fits, that there is a large change in foreground when we include the Gaussian model as part of our joint fit. This may be because in our initial fit, without the Gaussian model, the foreground model was fitting out the smooth signal. Alternatively, the signal may not be present and the fitting routine has returned a smooth signal by extracting it from the foreground component of the fit. The reduction in RMS when the joint fit includes a Gaussian is 0.054 mK and it is consequently challenging to determine whether or not the signal is present in the data.

MSFs and the LEDA Data
LEDA, like EDGES, is a radiometer based Global 21-cm experiment analysing the band 30 − 88 MHz and aiming to detect the anticipated absorption feature Greenhill & Bernardi (2012). The design and calibration approach of LEDA is detailed in Price et al. (2018). In contrast to EDGES, the LEDA experiment is comprised of 5 dual-polarization radiometer antennas that are part of a larger 256antenna interferometric array. This approach is intended to allow inter-antenna comparison and in-situ measurement of the antenna gain response. Similar to other radiometry experiments with absolute calibration, LEDA uses two noise diode references to calibrate the measured antenna temperature into units of Kelvin. Corrections are then applied to account for the impedance of the antenna and receiver, derived from vector network analyser (VNA) measurements.
As shown in Price et al. (2018), data are seen to vary between antennas, which are not perfectly identical, and this is attributed to minor differences in terrain, analog component response, and physical construction. While calibrated spectra are presented, it is suggested that there are unidentified systematics in the data; work has been undertaken to better characterize and update the LEDA system with iterative improvements. Further measurements were taken in 2017 and 2018, which are under analysis (Gardsen et al., in prep.).
Here, we fit MSFs to data from the LEDA 2016 campaign (Price et al. 2018, Spinelli et al., in prep.). This is the first time MSFs have been applied to LEDA data. Specifically, we fit data from antenna 252A, taken on January 26 th 2016 in the LST range 11:00-12:00. In this LST range, the Galactic contribution to the antenna temperature is at a minimum. The data are binned into 1.008 MHz channels, spanning 40-85 MHz.
We fit an MSF of the form given in equation (4) to the data as we find that this basis function returns the best fit consistently for N ≥ 8. Shown in Fig. 12 are the resultant residuals from an N = 9 MSF fit with an RMS of ≈ 15 K. The resultant residuals are large and would obscure a cosmological 21-cm signal. We note that as per equation (4) in Price et al. (2018) the radiometer noise is expected to be ≈ 0.5 K.
The residuals from the MSF fit clearly feature a damped sinusoidal systematic. We proceed to fit a systematic model given by  (4). We see evidence of a damped sinusoidal systematic in the data set.
along with a 9 th order MSF by using maxsmooth and MultiNest. ν 0 is chosen to be the central frequency of the band. We provide a prior on the power of 0 − 3 for weak damping. Prior ranges of 25 − 75 K for the amplitude of the sinusoidal function, 0 − 1 MHz −1 for the period, P, which is fitted as p 1 = (2π)/P, 0 − 2π for the phase shift and a log uniform prior on the noise of 10 −2 − 10 1 K are also provided. The results of this fit are shown in the left panel of Fig. 13. We return optimal parameters of a power of ≈ 2.7, an amplitude of ≈ 27.9 K, a period of 21.7 MHz and a phase shift of 3.7 rad. The residuals have an RMS of ≈ 8.8 K and MultiNest returns a noise parameter of ≈ 7.7 K. Price et al. (2018) suggest, from analysis of the 2016 LEDA data, that the systematic in the data is caused by the directiondependent gain of the antenna. A frequency dependent group delay may also be caused by a bandpass filter that could contribute unaccounted for reflections. The pattern of oscillations that form the systematic have also been found to change after rainfall. This systematic may then be caused by moisture in the surrounding soil or by changes in the electric length of the dipoles caused by moisture on the dipole itself. We also highlight the similarities in structure of the systematic with that in the EDGES data. Both have sinusoidal structures and so similarities between the experimental setups and calibration processes may hint at larger causes of systematics across 21-cm cosmology experiments. The systematic is not likely to be associated with the sky because of the difference in periodicity and amplitude found by both experiments. EDGES does not have a bandpass filter, it could still be affected by moisture in the surrounding environment however we note that this experiment is in a typically dry location (MRAO, Australia).
The residuals shown in the top left panel of Fig. 13 show a further sinusoidal structure after removal of the, leading order damped sinusoidal systematic. We therefore attempt a joint fit to the data using an MSF foreground, a damped sinusoid and an additional sinusoid described by equation (27). We maintain the same priors on the original damped sinusoidal function and provide a prior of 10 − 30 K on the amplitude, 0 − 1 MHz −1 on the period and 0 − 2π on the phase shift of the additional sinusoidal systematic.
We find best fit parameters for the leading damped sinusoid of ≈ 1.8 for the power, ≈ 30 K for the amplitude, a period of ≈ 19 MHz and a phase shift of ≈ 6.2 rad. For the additional sinusoidal systematic we find an amplitude of ≈ 17 K, a period of ≈ 16 MHz and a phase shift of ≈ 1.5 rad. MultiNest returns a noise of ≈ 7.2 K and the fit shown in the right panel of Fig. 13 has an RMS of ≈ 7.2 K.
Distinctions between the two systematics have been made in the middle right panel of Fig. 13 for clarity. The RMS of the residuals after removal of these two systematics is still significantly larger than the radiometer noise for this experiment, ≈ 0.5 K. However, the decrease in the RMS when these systematics are included in the fit would strongly suggest that both are present in the data. A further addition of sinusoidal systematics will inevitably reduce the RMS of the residuals in the same way that the residuals after foreground removal could accurately be described by a Fourier series. Higher order terms in the series would feature smaller periods until the periodicity of the terms matched that of the noise. However, the systematics present in the data may have a form described by the leading order terms in a damped Fourier series as found here. We leave more rigourous investigation of the additional oscillatory structure in the residuals, top right panel of Fig. 13, to future work.

CONCLUSIONS
Derivative Constrained Functions (DCFs) generally are advantageous for experiments in which the desired signal is masked by higher magnitude smooth signals or foregrounds. A 'smooth' foreground is one that follows a power law structure and DCFs are designed to accurately replicate this by constraining individual high order derivatives to be entirely negative or positive across the band of interest. They are particularly useful when the signal of interest is expected to be several orders of magnitude smaller than the foregrounds, similar in magnitude to the experimental noise and non-smooth in structure (i.e., having high order derivatives that cross zero in the band of interest).
We have introduced maxsmooth as a fast and robust tool for fitting DCFs and demonstrated its abilities with examples from 21-cm cosmology. maxsmooth features a library of example DCF models which is designed to be extended. Further work into the normalisation of DCF models for maxsmooth is required with the aim to improve the quality of fitting and efficiency of the software.
In Sathyanarayana Rao et al. (2017) the authors fit Maximally Smooth Functions (MSFs) using a Nelder-Mead routine to simulated sky data with Global 21-cm signals. They demonstrate that their fitting routine recovers the same residuals for 7 th , 10 th and 20 th order MSFs. maxsmooth is shown, however, to be capable of producing good fits ≈ 2 orders of magnitude faster than a Basinhopping/Nelder-Mead based algorithm. This is an important improvement when jointly fitting signals, systematics and foregrounds using a Bayesian likelihood loop as in nested sampling (Anstey et al., in prep.).
maxsmooth is also designed to be able to cover the entire available parameter space, unlike a Basin-hopping/Nelder-Mead based routine, by dividing it into discrete parameter spaces based on the different allowed combinations of signs, positive and negative, on the constrained derivatives. The extensive exploration of the parameter space provides confidence in the results and the employment of quadratic programming, a robust method for solving constrained optimisation problems, allows maxsmooth to remain an efficient algorithm.
We have reproduced analysis of the EDGES data using maxsmooth and analysed data from the LEDA experiment with MSFs for the first time. We have highlighted limitations of DCFs when jointly fitting for 21-cm signals and illustrated this using the The resultant residuals, black, found when fitting the LEDA data with an MSF foreground, damped sinusoidal, blue, and additional sinusoidal systematic, purple. Again the bottom panel shows the residuals, red, after just subtracting the fitted foreground model from the data. The further reduction in RMS suggests that both these systematics are present in the data and indicates that the larger systematics in the LEDA data may be represented by the leading order terms in a damped Fourier series. EDGES data. We have shown that in the presence of a smooth signal or no signal that DCFs can incorrectly recover signals that are smooth across the band when jointly fitted with signal models. However, this is not a problem that is unique to DCFs and we have illustrated that it is of equal prevalence when using unconstrained polynomials. We show, also, that MSFs preserve turning points of 21-cm signals more consistently than commonly used low order logarithmic unconstrained polynomial models. This is particularly true of 21cm signals with maximum brightness temperatures, T max ≥ 0 mK and minimum temperatures, T min ≥ −225 mK which feature the strongest deviations, a distinct absorption trough and emission above the background CMB, from the smooth foreground approximated by a ν −2.5 power law. A more detailed exploration of the signal parameter space is needed to fully understand the types of 'detectable' or reproducible 21-cm signals when using DCFs with varying constraints to model the foreground.
Through the EDGES data and LEDA data we have illustrated that MSFs are useful in identifying non-smooth and periodic experimental systematics. This is advantageous for two reasons: it allows for better identification of any Global 21-cm signal present in the data and it allows the causes of the systematics to be better identified leading to iterative improvements in experimental setups. Where systematics with a smooth structure across the bandwidth of interest are also present we expect that these will be fitted out by DCF foreground models.
In the LEDA data, we have identified the presence of a damped sinusoidal systematic and additional sinusoid. We suggest that the similarities between the structure of systematics in the EDGES data and the LEDA data could highlight a larger issue in 21-cm experimentation. Further work is needed to identify a probable cause for such systematics and exploration of similarities between the approaches of the two experiments could help identify these causes with DCFs being the primary tool for foreground modelling.
We suggest here that DCFs may also be used as a tool for identifying low level RFI, weak spectral lines and as illustrated for MSFs by Sathyanarayana Rao et al. (2015), signals from the Epoch of Recombination. In all cases the signals of interest are nonsmooth features masked by higher magnitude smooth signals that can be modelled and removed with DCFs. Applications of DCFs in these fields is left for future work.
with ∇ χ 2 (a * ) = 0 and consequently we have i µ * i ∇g i (a * ) = 0 by equation (A9) leading to the required saddle point.
The algorithm consequently looks for solutions a * for which a non-negative µ * can be found and the KKT conditions can be satisfied. To summaries the conditions are as follows, (i) Stationary Condition: The optimal solution of the prime and dual problems will produce a saddle point in the Lagrangian.
(iv) Dual Feasibility: The Lagrangian multiplier satisfies the inequality µ * ≥ 0. Fig. B1 shows the resultant parameter spaces when fitting a 5 th order MSF to data of the form y = x −2.5 using the logarithmic basis function given by equation (6). As discussed in section 3 the parameter space presented here is unique to the data set and DCF model used. However, it highlights the importance behind the choice of basis and illustrates the differences in the constraints produced when defining the DCF in a different data spaces.

APPENDIX C: STANDARD DERIVATIVE SIGN PATTERNS
In section 4 we introduce the concept of standard derivative sign patterns for particular polynomial structures. To reiterate and enforce this point Fig. C1 illustrates that the derivatives of a polynomial of the form y ≈ x k are all positive, y ≈ −x k are all negative, y ≈ x −k are alternating negative to positive from m = 1 and y ≈ −x −k are alternating positive to negative from m = 1. Since, as discussed in section 3, CVXOPT constrains the derivatives, Ga subject to equation (18) we would expect the optimum maxsmooth signs for an MSF fit to y ≈ x k to be approximately all negative. Similarly for an MSF fit to a polynomial of the form y ≈ −x −k we would expect the optimum signs to be alternating positive to negative for m ≥ 2.
Note that these standard derivative sign patterns are defined in y − x space. The patterns in y − z space will have similar structures and in logarithmic space they are expected to be different however they will still subscribe to a regular structure. This paper has been typeset from a T E X/L A T E X file prepared by the author.  Fig. 2 using a 5 th order MSF of the form given by equation (6) and constrained in log 10 (y) − log 10 (x) space. As with Fig. 2 black regions show regions in which the MSF condition is violated and the coloured regions illustrate sign combinations for which the constraints are upheld. The ranges on the parameters are determined to be 200% on either side of the optimal values from the MSF fit. In each panel two of the parameters are varied while the others are maintained at their optimal values. Here, the regions for which the conditions are violated are narrow and consequently multiple discrete sign spaces are found to produce similar χ 2 values. This strongly suggests that the problem is ill defined and hard to solve using the sign space navigation described in section 4. Top Right: The mock 21-cm experiment data and the MSF fit for which the parameter space is analysed. T refers to the averaged sky temperature and ν to the frequency. Bottom Right: The residuals after subtracting the MSF fit from the data set. . Standard derivative sign patterns associated with four possible standard polynomial data structures. The first row shows example power laws following y ≈ x −k , y ≈ x k , y ≈ −x −k and y ≈ −x k . The second row shows the derivatives of those power laws up to m = 3 and the associated patterns in derivative sign. Note these are not the maxsmooth signs and this is discussed in the associated text.