First Results from HERA Phase I: Upper Limits on the Epoch of Reionization 21 cm Power Spectrum

We report upper-limits on the Epoch of Reionization (EoR) 21 cm power spectrum at redshifts 7.9 and 10.4 with 18 nights of data ($\sim36$ hours of integration) from Phase I of the Hydrogen Epoch of Reionization Array (HERA). The Phase I data show evidence for systematics that can be largely suppressed with systematic models down to a dynamic range of $\sim10^9$ with respect to the peak foreground power. This yields a 95% confidence upper limit on the 21 cm power spectrum of $\Delta^2_{21} \le (30.76)^2\ {\rm mK}^2$ at $k=0.192\ h\ {\rm Mpc}^{-1}$ at $z=7.9$, and also $\Delta^2_{21} \le (95.74)^2\ {\rm mK}^2$ at $k=0.256\ h\ {\rm Mpc}^{-1}$ at $z=10.4$. At $z=7.9$, these limits are the most sensitive to-date by over an order of magnitude. While we find evidence for residual systematics at low line-of-sight Fourier $k_\parallel$ modes, at high $k_\parallel$ modes we find our data to be largely consistent with thermal noise, an indicator that the system could benefit from deeper integrations. The observed systematics could be due to radio frequency interference, cable sub-reflections, or residual instrumental cross-coupling, and warrant further study. This analysis emphasizes algorithms that have minimal inherent signal loss, although we do perform a careful accounting in a companion paper of the small forms of loss or bias associated with the pipeline. Overall, these results are a promising first step in the development of a tuned, instrument-specific analysis pipeline for HERA, particularly as Phase II construction is completed en route to reaching the full sensitivity of the experiment.


INTRODUCTION
The 21 cm line of neutral hydrogen has emerged in the past few decades as a theoretically powerful probe of cosmology and astrophysics by tracing the growth of structure across cosmic time (Hogan & Rees 1979;Madau et al. 1997). At high redshift (z > 6), the 21 cm line is a tomographic probe of the intergalactic medium (IGM) and allows us to directly trace its ionization, density, and temperature state. This is particularly important for understanding the Cosmic Dawn and the subsequent Epoch of Reionization (EoR), when radiative feedback from the first generation of stars and galaxies heated and ionized the IGM over the course of baryonic structure growth from 40 < z < 6. Understanding these two milestones will give us a much better understanding of the formation of the first luminous sources, their environments, and the growth of large-scale structure.
Our current understanding of the timing of the EoR is largely based on measuring the absorption and scattering of the IGM against background sources. These can be either (i) astrophysical sources, such as galaxies and quasars whose Lyman series transmission is sensitive to the ionization state of the IGM; or (ii) the Cosmic Microwave Background (CMB), whose fluctuations are sensitive to the integrated column density of free electrons. Recent analyses of both (i) and (ii) point towards a EoR that is evolving rapidly between 5.5 < z < 7. In particular, IGM damping wing absorption in QSO spectra (e.g. Mesinger & Haiman 2004;Bolton et al. 2011;Davies et al. 2018) and the rapid decline of Lyman alpha emitting galaxies (e.g. Stark et al. 2010;Schenker et al. 2012;Jensen et al. 2013;Caruana et al. 2014;Pentericci et al. 2014;Mesinger et al. 2015;Mason et al. 2018Mason et al. , 2019, suggest that the EoR is ongoing at z ∼ 7, albeit with significant systematic uncertainties. The Lyman-α forest at z ∼ 6 suggests that reionization * NSF Astronomy and Astrophysics Postdoctoral Fellow † Jansky Fellow of the National Radio Astronomy Observatory has mostly completed by then (McGreer et al. 2015). However, the sizable sightline-to-sightline scatter in the forest transmission (Becker et al. 2015;Bosman et al. 2018) seems to require the final, overlap stages of the EoR to extend down to z ∼ 5.5 (Kulkarni et al. 2019;Keating et al. 2020;Choudhury et al. 2021;Qin et al. 2021). This is broadly consistent with the CMB constraints on the electron scattering optical depth (Planck Collaboration 2018) that supports a relatively late EoR with a midpoint around z ∼ 7 (e.g. Mitra et al. 2015;Douspis et al. 2015;Millea & Bouchet 2018;Qin et al. 2020b), and it is also consistent with the high-z galaxy luminosity functions, given reasonable assumptions about their properties (Robertson et al. 2015;Price et al. 2016;Gorce et al. 2018;Hazra et al. 2019;Park et al. 2019;Qin et al. 2020a). However, much remains to be understood about the EoR, such as when it began, the properties of the astrophysical sources that drove it, and what impact it had on future generations of star-forming galaxies. Meanwhile, considerably less is understood about Cosmic Dawn, including when the first Population III stars formed, what their physical and spectral properties were, what their impact was in heating and enriching the IGM, and how this ultimately enabled the emergence of Population II stars and modern galaxies as we understand them. Low-frequency radio experiments are aiming to directly measure the high redshift 21 cm signal and in doing so place constraints on the timing and duration of the EoR to understand the underlying physical processes driving the heating and reionization of the IGM (for reviews, see Ciardi & Ferrara 2005;Furlanetto et al. 2006;Morales & Wyithe 2010;Pritchard & Loeb 2012;Mesinger 2016;Liu & Shaw 2020). Prior and ongoing interferometric experiments include the Donald C. Backer Precision Array for Probing the Epoch of Reionization (PAPER; Parsons et al. 2010;Cheng et al. 2018;Kolopanis et al. 2019), the Murchison Widefield Array (MWA; Tingay et al. 2013;Dillon et al. 2014Dillon et al. , 2015Ewall-Wice et al. 2016b;Beardsley et al. 2016; Barry et al. 2019b;Li et al. 2019;Trott et al. 2020), the Low Frequency Array (LOFAR; van Haarlem et al. 2013;Patil et al. 2017;Gehlot et al. 2019;Mertens et al. 2020), the Giant Metre Wave Radio Telescope (GMRT; Paciga et al. 2013), and the Long Wavelength Array (LWA; Eastwood et al. 2019), which have put increasingly stringent constraints on the power spectrum over the past decade. At the same time, total-power measurements of the skyaveraged signal (or global signal) have similarly put upper limits on the monopole component of the 21 cm signal (Bernardi et al. 2016;Singh et al. 2017;Monsalve et al. 2017), with a tentative first-detection of the 21 cm global signal at z ≈ 17 from the Experiment to Detect the Global Eor Signature (EDGES; Bowman et al. 2018), with a robust discussion of whether this signature may be due to instrumental systematics (Bradley et al. 2019;Hills et al. 2018;Singh & Subrahmanyan 2019;Sims & Pober 2020;Mahesh et al. 2021).
The 21 cm power spectrum at the EoR and Cosmic Dawn contains a wealth of statistical information that can be used as both a cosmological and astrophysical probe (Mao et al. 2008;Patil et al. 2014;Pober et al. 2014;Greig et al. 2016;Ewall-Wice et al. 2016a;Kern et al. 2017). While radio interferometric experiments have indeed made substantial progress over the past decade, a first power spectrum detection has yet to be made. As second-generation 21 cm experiments are designed and built, such as the Hydrogen Epoch of Reionization Array (HERA; DeBoer et al. 2017) and the Square Kilometre Array (SKA; Koopmans et al. 2015), the understanding and control of instrumental systematics will be the crucial factor in enabling their ultimate success in making a first robust detection of the 21 cm power spectrum.
HERA is an interferometric array of fixed, zenith ponting dishes located in the Karoo desert, South Africa. The dishes are 14 meters in diameter and packed hexagonally into a nearly continuously covered core 300m across. The array is being built in a series of phases with simultaneous construction and observing. Phase I used the feeds and correlator from the PAPER experiment (Thyagarajan et al. 2016;Ewall-Wice et al. 2016;Patra et al. 2018;Fagnoni et al. 2021), while Phase II will use a new feed as well as analog and digital systems (Fagnoni et al. 2020). The data reported here come from the second internal data release (IDR2) taken from the first Phase I observing season, which commenced with roughly 50 antennas.
A series of recent papers describe the Phase I analysis pipeline in detail, which includes redundant calibration (Dillon et al. 2020), absolute calibration (Kern et al. 2020a), systematic modeling (Kern et al. 2019(Kern et al. , 2020b, power spectrum estimation and error propagation (Tan et al. in prep.), and pipeline validation (Aguirre et al. in prep.). Complementary studies on the data set discussed in this work also include foreground modeling (Ghosh et al. 2020), imaging (Carilli et al. 2018), power spectrum analysis of the bispectrum phase (Thyagarajan et al. 2020), antenna primary beam characterization (Nunhokee et al. 2017), and electromagnetic modeling of the front-end signal chain (Fagnoni et al. 2021). Here, we give an overview of the full analysis pipeline, discuss the criteria used for data selection, present the power spectrum limits, and describe statistical tests used to characterize the performance of the system and our analysis techniques.
In this paper, we report the first upper limits on the 21 cm power spectrum from HERA Phase I at redshifts 7.9 and 10.4 with 18 nights of observations. These observations were made with only a fraction of the array built (∼ 50 out of 350 antennas), as it was under active construction at the time. Our analysis shows that the data are largely consistent with the expected thermal noise level at large line-of-sight Fourier k modes, acheiving a dynamic range with respect to the peak foreground emission of 10 9 in power. However, at low Fourier k modes we see evidence for residual, low-level systematics that begin to exceed the thermal noise. Nonetheless, the limits presented here are to-date the most sensitive at z ∼ 8 by over an order of magnitude. A companion astrophysical interpretation paper shows that, under standard galaxy astrophysics, these limits disfavour cold reionization scenarios, where the IGM temperature is not substantially heated above its adiabatically cooled limit (HERA Collaboration in prep.).
The paper is organized as follows. In §2 we give a summary of the instrument and the observations analyzed in this work. In §3 we discuss the data reduction pipeline. In §4 we discuss our power spectrum estimation pipeline and present our integrated power spectrum limits. In §5 we discuss a suite of statistical tests used to assess the quality and stability of the final data products. Finally, in §6 we summarize our results.

OBSERVATIONS
Here we discuss the observational parameters used for this analysis, as well as the state of the HERA instrument when these observations were made. As noted, the data discussed in this work was taken with the Phase I instrument, which was a hybrid HERA/PAPER system. Phase I re-purposed the radio frequency (RF) signal chains and correlator from PAPER and attached them to new HERA dishes. The antenna consists of a 14-meter dish with a cross-dipole feed at its focal point measuring two linear polarizations (DeBoer et al. 2017). At 150 MHz the beam has a full width at half maximum (FWHM) of ∼10 • Fagnoni et al. (2021). Not all PAPER front ends could be salvaged and as a stopgap new signal chain components-feed baluns and post-amplifierswere manufactured to be backward compatible with the 75 Ω cables carried over from PAPER. Kern et al. (2020b) characterized the performance of the new and old signal chains for the Phase I system, finding spectral structure across a range of delays that cover the EoR window, which they attribute partially to cable reflections. They also present methods for modeling and suppressing these systematics in the data. Fagnoni et al. (2021) also performed electromagnetic simulations of the Phase I signal chains, finding a broad range of structure induced by both instrument cross-coupling, cable reflections, as well as dish reflections.
The HERA Phase I correlator, re-used from the PA-PER experiment, employs an FX architecture, which was housed in a radio frequency interference (RFI)shielded container in the field. In the F-engine, analogto-digital (ADC) units on a ROACH2 board (Parsons & Backer 2009) digitize each antenna's linear polarization voltage stream, which are then fed to a field programmable gate array (FGPA) that Fourier transforms the signal into voltage spectra with a 100 MHz bandwidth from 100-200 MHz. This is done across 1024 channels, leading to a spectral resolution of 97.66 kHz. The spectra are then sent to a Graphics Processing Unit (GPU)-based X engine that correlates all N (N − 1)/2 cross antenna-polarization pairs and all N auto antennapolarization pairs, where N is the number of feeds. The correlator then integrates the data for 10.7 seconds before writing them to disk. Finally, the data are chunked into 10-minute files and sent to the on-site Karoo Array Processing Center for storage, and eventually transported to the National Radio Astronomy Observatory (NRAO) Array Operations Center in New Mexico, USA via internet connection. Other observational parameters are summarized in Table 1.
The HERA Phase I observing season ran from October 2017 to April 2018 while the array was under active construction. Observations were taken throughout the South African summer at night when the galactic center and Sun are below the horizon, while active construction and commissioning proceeded during the day. A major focus of development was building an online data quality assessment pipeline that reported on radio interference, calibration quality, and other metrics. These metrics, some of which we discuss in the next section, were used to select a subset of data taken during a rel- atively stable epoch in the observing season when construction activities were at a minimum and data quality was consistently high. This resulted in an 18-day dataset that is the basis of this work, referred to as the second internal data release (IDR2). These observations spanned December 10th -28th, 2017 (Julian Dates 2458098-2458116). For each of these days, observations were made for 12 hours per night, of which roughly 10 hours are used for science data when the Sun was below the horizon. Together, these observations span a local sidereal time (LST) of 0 to 12 hours (Figure 1), which overlaps with a radio cold patch at 4 hours LST, as well as the galactic plane at 8 hours LST for HERA's latitude of -30.7 • South. In total, this work considers an 18-night data set (for a total of 180 hours), however, as discussed in Section 3, after additional rounds of data cuts we are left with three fields, each with ∼ 36 hours of integration for power spectrum analysis. The area of the sky observed in the IDR2 dataset is highlighted in Figure 1, which shows the diffuse foreground emission from the galaxy and also identifies the three main fields used for power spectrum analysis (colored boxes). The edges of the data set are not included due to reduced sensitivity of the nightly-averaged data products caused by LST drift of the observations from night-to-night.
At the time of observations, the array had 52 fully deployed antennas arranged in a close-packed fashion, which constitues only a fraction of the full 350 that will eventually be built, and corresponds to a corner of the full array ( Figure 2). However, as we will discuss next, only 39 of these antennas were deemed science quality and were used for data analysis.   -Costa et al. 2008) showing the dominant diffuse foregrounds from the galaxy. Being a non-tracking, zenith pointed array, HERA's field of view is centered at δ = −30.7 • and in theory spans a full 360 • range in right ascension. With a primary beam full width at half maximum of ∼10 • (at 150 MHz), HERA's sensitivity is primarily focused on a narrow strip of sky. The data reported here spans a range of right ascensions from 0 to 12 hours (IDR2), although only three specific fields (colored boxes) largely devoid of bright foregrounds are used for power spectrum estimation.

DATA REDUCTION
Our analysis pipeline is summarized in Figure 3, which highlights the data reduction pipeline (green), the power spectrum pipeline (red), the data products that are input to and output by the pipeline (blue), and the steps that are tested in the companion validation analysis (dashed; Aguirre et al. in prep.). Here we will focus only detailing the data reduction pipeline, which we do in order of their appearance in the pipeline. The primary role of the reduction pipeline is to identify and flag faulty data, solve for the complex gain solutions imparted by the instrument and invert them, average the visibilities coherently across nights, and treat the data for known systematics. Note that, unlike other works, we do not perform any explicit foreground subtraction or filtering in this analysis, although future work may benefit from such techniques (e.g. Ewall-Wice et al. 2021).
Note that all of the important analysis packages can be found in the publicly accessible HERA-Team 1 software repository, including the hera cal, hera pspec, hera qm, and hera opm packages used extensively in this analysis.

Antenna Metrics
Raw data from the correlator are first sent through an antenna metrics stage where faulty antennas are identified and flagged. This is an important pre-calibration step to prevent obviously malfunctioning antennas from adversely affecting the calibration solutions. Metrics are calculated on every 10-minute file for each of the dual-polarization feeds of every antenna, which gives us a sense for how antennas and their linear polarization feeds are behaving throughout any given night.
While a few different metrics were trialed, one metric was found to be the most useful and was the primary metric used to flag faulty antennas. This metric was the mean visibility amplitude for each antenna, which we define for a given antenna i as where V ij (ν, t) is the visibility between antennas i and j at frequency ν and time t, N is the number of antennas in the array, and the sum is taken over all antennas j = i, times and frequencies in the observation. For notational brevity, we will drop the explicit frequency and time dependence of V ij . This metric measures when an approximate estimate of the antenna gain is low compared to most other antennas, which can occur, for example, when the signal chain is not connected to the feed or a gain stage has lost power. It was also used to help determine when an antenna was cross-polarized, or when its East and West polarization signal chains were accidently interchanged in the cable connections downstream. Because the metric is computed before calibration, and thus uses raw data, we cannot make an absolute cut based on the expected noise level of the visibilities in Jansky. Instead, we estimate the standard deviation of the metric across antennas and compute a Z-score, which is defined as the deviation of any given point from its sample mean in units of the sample standard deviation. To account for outliers, we use the more robust, modified Z-score, defined as where x is the mean visibility amplitude metric defined above. 2 The median absolute deviation (σ mad ) is a robust estimator of the standard deviation but requires a correction factor of 1.482 to reproduce the standard deviation in the case of white Gaussian noise (Guthrie 2020). It is generally clear from these metrics which antennas are poorly behaving, nonetheless, we flag antennas that deviate from the sample mean beyond a 5σ level. Antennas that were repeateadly flagged by the antenna metrics stage for nights throughout the data set are flagged outright for all nights. If either of an antenna's polarizations is thus flagged, we flag the entire antenna (i.e. we flag both linear polarizations).
The faulty antennas caught the antenna metrics stage are really only the very worst offenders, shown in blue in Figure 2. Other sub-nominal antennas are flagged subsequently in the process of calibration, which we describe next.

Redundant-Baseline Calibration
One of the foremost challenges for 21 cm cosmology is the task of precision instrumental gain calibration. In general, the measured visibility V obs ij between antennas i and j is related to the true, uncorrupted visibility V true ij via a product of each antennas gain, g i , where n ij is the thermal noise in the measurement (Hamaker et al. 1996;Smirnov 2011). Note that we solve Equation 4 for both linear polarizations independently (EE and NN), which are also implicitely a function of time and frequency. This representation ignores intra-feed D-terms as well as direction dependent gain calibration, which we do not solve for in this analysis (a simulated primary beam model is used for sky-based calibration discussed in Section 3.3). HERA's redundant array configuration opens the possibility of calibrating the relative gains of antennas using internal degrees of freedom among measurements of groups of redundant baselines (Dillon & Parsons 2016 Figure 3. A diagram of the data reduction and power spectrum estimation pipelines, starting with raw HERA data and ending with the averaged power spectrum and their associated null tests. The blue boxes represent data products while the green and red boxes represent steps in the data reduction and power spectrum pipelines, respectively. Boxes with dashed borders represent elements tested by the validation pipeline (Aguirre et al. in prep.). Noise simulations are generated after the data reduction pipeline and fed through the power spectrum pipeline for diagnostic purposes when evaluating null tests.
Redundant-baseline calibration uses the principle that every redundant baseline should measure the same visibility, they in practice do not due to antenna-based gain terms, which allows one to constrain the relative gain of each antenna (Wieringa 1992;Liu et al. 2010). Specifically, redundant-baseline calibration seeks to minimize a χ 2 written as where V sol i−j is the visibility solution for the redundant baseline type with the same vector separation as V ij .
The key distinction in redundant-baseline calibration highlighted here is that the visibility solutions are left as free parameters to be solved for in the process of calibration along with g i and g j . For highly redundant configurations like HERA, this leads to an overconstrained system of equations that can be solved to yield estimates of the gains in addition to the model visibilities.
Due to inherent degeneracies in the system of equations, redundant-baseline calibration cannot solve for the overall amplitude and directional phasing (the "Tip-Tilt" phasor) of the antenna-based gains (Liu et al. 2010;Zheng et al. 2014;Dillon et al. 2018;Li et al. 2018;Byrne et al. 2019). Consequently, these must be solved for independently and added to the redundant calibration gains, commonly referred to as absolute calibration.
A complete description of the redundant-baseline calibration of HERA appears in Dillon et al. (2020), in-cluding a quantitative assessment of the redundancy of HERA via, e.g. the deviation of χ 2 for the expected value in a perfectly redundant array. For more detail on the series of algorithms with which we minimize χ 2 (Equation 5), see Dillon et al. (2020). Furthermore, the subsequent process of solving for the degenerate modes (i.e. absolute calibration) is described in Kern et al. (2020a).
In Figure 4 we show the results of redundant-baseline χ 2 -minimization for different nights in our analysis. Our results show clear evidence for non-redundancy in excess of the thermal noise at the ∼20% level in the visibility. Dillon et al. (2020) attributes much of this excess to small deviations in the placement of dishes and to minor antenna-to-antenna variation of the primary beam. Some of it is also likely attributatble to baselinedependent systematics (Kern et al. 2020b) which break the assumption in Equation 4. As discussed in Dillon et al. (2020), the statistic is not consistent with one, meaning there are extra sources of variance in the data beyond noise that cannot be constrained by antenna-based calibration. We assess the effect of nonredundancy on our final power spectrum limits in Section 3.11.

Absolute Calibration
To finish calibration, we need to solve for the degeneracies of redundant-baseline calibration by referencing to the sky. One way to do this is to compare a model of the true visibilities to redundantly-calibrated visibili-  . Redundant calibration attempts to minimize χ 2 , defined in Equation 5, using a model that assumes that redundant baselines observe the same true visibilities up to their antenna-dependent gains and the thermal noise. Were this true, we exect that χ 2 normalized by the number of degrees of freedom (DoF) in the model (approximately 520 in this analysis, see Dillon et al. (2020) for a precise quantification) would have a mean of one and follow a χ 2 -distribution.
Here we show the distribution of χ 2 /DoF over polarizations and unflagged (see Section 3.4) times and channels for different nights in our analysis. While consistant from night-tonight, the overall mean of 1.389 is clearly inconsistent with one, indicating persistant non-redundancy at levels roughly 20% in excess of the thermal noise. ties and varying the degenerate parameters to make the two match. In this work, we use a series of previously calibrated visibilities over the full LST range as our set of reference (or model) visibilities that we use to extract the degenerate components of the calibration gains.
We build our reference visibilities following the skybased calibration methodology described in Kern et al. (2020a), which uses the Common Astronomy Software Applications (CASA; McMullin et al. 2007) software to construct visibility models at specific fields where we expect to be able to model the dominant contributions of the flux on intermediate and long baselines. Specifically, we select three fields transiting our FoV at 2.0, 5.2, and 14.4 hours LST. These fields are chosen to have minimal diffuse foregrounds within the FoV, and have a bright point source from the MWA GLEAM catalog (Hurley-Walker et al. 2017) near the field's zenith pointing, which is used to confirm the absolute flux density scale of the calibrated data.
For each of the three fields we pick a different night in the Phase I observing season, selected such that the field transits roughly halfway through the nighttime observations. We then select-out five minutes of observations centered at the field transit, average the data and perform calibration using standard CASA routines (e.g. gaincal and bandpass). These gains are then transferred to every integration for that observing night, leaving us with three nights of calibrated night, each of which span a slightly different range of LSTs. Drift in the gain amplitude throughout the nighttime observation is not corrected for, but is expected to drift at the ∼ 3% level (Kern et al. 2020a).
Finally, these three sets of visibilities are averaged onto a common LST grid spanning nearly 20 hours of LST, which yields our full set of reference (or model) visibilities that we use for absolute calibration. After averaing, the visibilities are passed through a Fourier low-pass filter across frequency, which filters out all signals with delays beyond the baseline horizon delay plus an additional 50 nanoseconds. This suppresses noise in high-delay structure that is not associated with foregrounds, and also fills-in flagged frequency channels with a best-guess of the foreground structure using a delaydomain deconvolution (similar to the inpainting algorithm described later; Parsons & Backer 2009;Kern et al. 2020a). Figure 5 shows CLEANed, multi-frequency synthesis images (across 135−−165 MHz) of the three fields we use for constructing the reference visibilities, demonstrating the point source distribution in each field. It also shows the Julian Date (JD) of the night used to calibrated each field as well as range in LSTs of each of the nights (green, orange, and blue bars). Stacking these visibilities onto a common LST grid gives us our final "Full Model" (purple bar) that we use in performing absolute calibration of the nightly data in the Phase I data set considered here. It should be noted that the gains solved for are inherently direction-independent gains, with no cross-feed D terms.

Radio Frequency Interference Flagging
The Karoo Astronomy Reserve is an radio frequency interference (RFI) quiet zone, and as such much of the frequency band between 120 to 180 MHz is relatively clean of interference. Nonetheless, we still see narrowband RFI due to satellite and terrestrial transmitters, as well as rare occurances of wideband RFI above 180 MHz. Our RFI detection algorithm relies on the detection of local outliers, using the distribution of surrounding data points in time and frequency to distiguish RFI from thermal noise fluctuations. Our pipeline uses a number of data products to identify not just interference but also general problems with the array, such as correlator malfunctions. As its input it takes raw HERA data, the gain solutions from redundant-baseline calibration and the redundant visibility solutions, its χ 2 Figure 5. Construction of the absolute calibration reference visibilities (purple) from a set of three nights throughout the observing season. Each night is calibrated at a single field that transits near midway through the night. The calibration is transferred to all integrations from that observing night, and the visibilities are then LST binned onto a common grid to form the full model spanning nearly 22 hours in LST. Drift in the gain amplitude through each night is not corrected for, but is expected to drift at the ∼3% level (Kern et al. 2020a).
distributions, the absolute calibration gains, and their χ 2 distributions.
For each data product (except the raw visibilities) a corresponding modified Z-score metric is computed, Z mod , which is formed by median-subtracting the data product with a median filter and then normalizing it by an empirical estimate of its noise. Performing the median filter across frequency and time helps to identify occasional sources of wideband RFI from broadcast televsion that can be as wide as 8 MHz (Wilensky 2020). These metrics are averaged in quadrature across all baselines, antennas, and polarizations to increase the sensitivity to low-level contamination, resulting in a single metric as a function of time and frequency for each data product.
Next we flag these metric waterfalls with an initial threshold of Z mod ≥ 5, followed by a watershed algorithm where any time and frequency pixel adjacent to a flagged pixel is itself flagged if it exceeds a lower threshold of Z mod ≥ 2 (Kerrigan et al. 2018). We found that different types of contaminants will manifest more brightly in certain metrics, so each metric is flagged individually.
Having removed the brightest sources of RFI, we now apply the resultant flags and repeat the flagging procedure; however, this time we compute the standard Z-score metric using a mean filter rather than a median filter. The mean filter is considerably faster, so at this stage we are also able to include the raw visibility data as an additional metric. Just as before, we compute these metrics, average them in quadrature, and apply the same watershed flagging routine on the averaged metrics. The result is a single waterfall of flags for all baselines. This entire routine is performed on every 10-minute file over the course of a nightly observation.
Our last step is to examine each metric waterfall over the entire night and use use medians to collapse them to a single time series or spectum. These are again flagged by computing a modified Z-score for each channel or integration, flagging any above 7 or any above 3 that are adjacent to a previously flagged channel or integration. This step helps us find global outliers in time that may not be obvious from the analysis of a signal file and to find channels that are so frequently occupied by RFI that it safer to assume they are always RFIcontaminated.
The strongest forms of indentified RFI are persistent, narrowband emitters, which can be clearly seen in the calibration gain solutions (e.g. Figure 6). However, the final flagging mask applied to the nightly data excises both narrow and wideband features (e.g. the left panel of Figure 8). The contaminating features in the data include FM radio (ν < 111 MHz), ORBCOMM satellite communication (ν = 137 MHz), broadcast television (ν > 174 MHz), as well as intermittent correlator integration failures (wide band).
In addition to flagging the data for RFI and correlator failures, we also flag 50 channels on either edge of the frequency band where the bandpass filter falls off steeply, as well as all integrations where the Sun is at or above the horizon due to issues it creates in our calibration. Lastly, after LST binning (described below), the averaged visibilities are manually inspected for low-level narrowband RFI that was missed by the nightly RFI excision. In this process we find and flag a handful of frequency channels (for all baselines and times), which amounts to < 1% of the total data.
On average, we flag roughly 15% of the band in our RFI flagging step. This estimate excludes our routine flagging of the band edges due to the rolloff of the bandpass filter (∼ 10 MHz on either side of the band), as well as the complete flagging of certain times due to problems with correlator failures. Through visual inspection, we see areas where our pipeline is likely overflagging the data. This suggests that our flagging strategy is somewhat more aggresive and could be improved to reduce the false positive rate. Future work will investigate how more precise RFI excision techniques, like the matchedfilter SSINS package (Wilensky et al. 2019), can improve the overall RFI identification performance.

Gain Smoothing
While the goal of gain calibration is to remove spectral structure introduced by the instrument, it is a doubleedged sword and can also impart spurious spectral structure. Many sources of uncertainty come into play when calibrating a real instrument, including baselineto-baseline non-redundancies (Orosz et al. 2019), incomplete sky flux density models (Byrne et al. 2019), and various baseline-dependent instrumental systematics. A number of tehcniques have been proposed to mitigate this effect, including the consensus optimization technique (Yatawatta 2015(Yatawatta , 2016 used in LOFAR power spectra (e.g. Patil et al. 2017 andMertens et al. 2020) and fitting low-order polynomials to averaged gains, a technique employed with the MWA (e.g. Barry et al. 2019a,b). Kern et al. (2020a) discussed some of these effects for HERA, and found their impact to be most severe for |τ | > 100 ns. They introduce a Fourier filtering procedure for low-pass filtering the gains after calibrating out a per-antenna delay, which mitigates the impact these spectrally-dependent gain errors have on the data. This filter is a two-dimensional frequency and time iterative deconvolution algorithm that both acts as a low-pass filter and also fills-in missing calibration solutions due to RFI flagging. It is conceptually similar to the Hogbom CLEAN algorithm (Högbom 1974). Dillon et al. (2020) showed that per-integration redundant-baseline calibration exhibits time dependent structure that is LST-locked and argued that this variation was due to bright sources moving through direction dependent non-redundancy. Examining the temporal power spectra of gain solutions after dividing out an LST-locked nighly average revealed no evidence for intrinsic gain fluctuations on timescales shorter than 6 hours. Likewise, Kern et al. (2020a) show that the gains drift in amplitude with changes in the ambient temperature, which varies slowly over the course of the 10-hour nightly observation. Therefore, we also use the 2D lowpass filter to remove temporal structure in the gain solutions on all timescales shorter than 6 hours. This substantially mitigates sky-dependent effects, since the filtering timescale is much longer than the beam-crossing time-approximately 40 minutes at 150 MHz, given a 10 • FWHM (Neben et al. 2016). Figure 6 shows the progression of the antenna gains from redundant-baseline calibration (left), absolute calibration (center), and gain smoothing (right). Redundant-baseline calibration is prone to jumps in the degenerate subspace of calibration solutions, particularly in phase. These are solved by absolute calibration, which fills-in the degenerate components, thus mitigating the phase discontinuities in time introduced by redundant calibration. Next the fast and localized spectral and time-dependent features seen in the redundant and absolute gains are filtered out by the smoothing step. Gain filtering occurs after RFI identification, meaning we can input our flag masks into the Fourier filtering algorithm to deconvolve them out and fill in the missing pixels with a model of the smoothly varying components. 3 This is reflected in the sharp spikes in the gains in RFI-dominated pixels that are smoothed over after   Figure 6. The progression of a single antenna based gain through the calibration process of a single night, showing the gain amplitude (top panels) and phase (bottom panels). Redundant calibration (left) solves for the relative gain but is susceptible to phase jumps at certain times and requires certain degeneracies, like the overall gain amplitude, to be filled in. Absolute calibration (center) solves both of these problems by pinning the degenerate components and removing the phase jumps. Lastly, the final gains are smoothed across frequency and time (right) to limit excess structure (Kern et al. 2020a;Dillon et al. 2020)-for example, amplitude fluctutations caused by the passage of Fornax A around 3.4 hours seen in the top left and middle panels. Note that the process of gain smoothing is applied to the center column, not to the first column where phase jumps are apparent.
gain filtering ( Figure 6). Gain smoothing is performed independently for each antenna and linear polarization.
We are left with a set of gains spanning all frequencies and times of our data on a nightly basis that has been filtered from the initial solution at each freuqency and time sample to a restricted number of degrees of freedom for each antenna and linear polarization. The total bandwidth of our data leads to a delay resolution of ∆τ = 10 ns meaning we keep 21 spectral degrees of freedom, and the nightly observation duration is 10 hours, leading to 3 temporal degrees of freedom, for a total of ∼ 63 degrees of freedom in each antenna-polarization gain. 4

LST Binning
Having calibrated each night independently, we next coherently average them at fixed LST. Due to sidereal drift night to night, each observing night has a slightly offset LST grid, which must be accounted before LST binning. To do this, we establish a single LST grid span-4 The gains are complex quantities and our filter extends from −100 < τ < 100 ns, giving us 21 independent Fourier modes: 10 on each side of the τ = 0 ns mode. Likewise, we keep the temporal modes correspoding to the −1st, 0th, and 1st Fourier modes.
ning 0 to 24 hours at a cadence of 21.4 seconds, which is double the native correlator integration time and helps to increase the sample count in the subsequent outlier rejection routine. For each of these LST bins, we take data from all nights that fall within the bounds of this bin and rephase their pointing centers to the center of the LST bin. We then perform another round of outlier rejection to help flag missed RFI and other problems with the data that do not repeat night-to-night at the same LST (as the cosmological signal should). We do this by looking at the distribution of visibilities to be binned together for every particular LST, frequency, and baseline combination and rejecting individual samples across the nights that have a modified Z-score (as defined in Equation 2) greater than 4σ. We set the threshold at 4σ such that the clipping has an effectively negligible impact on the Gaussian noise distribution, yet still rejects strong outliers that were for whatever reason not caught in our initially flagging round. Clipping is performed on the real and imaginary component of the visibility separately, but if either are clipped in the process we flag the pixel entirely. Indeed, in practice we find that the sigma clipping routine tends to catch clear broadband artifacts due to correlator failures that were not initially caught. For the ∼18 nights of nightly data sampled at a cadence of 10.7 seconds, each LST bin contains on average 36 samples as input to this sigma clipping routine. All data in the bin that were not originally flagged or flagged during the sigma clipping routine are then uniformly averaged onto a single LST grid.
A key requirement for LST binning is that each night is calibrated accurately, otherwise the averaged data will decohere. We assess this by looking at the variability of the calibrated data across all nights in the data set. Figure 7 shows the delay-transformed visibilities across each night from a single baseline and at a fixed LST. The dashed black line shows the root-mean-square (RMS) of the visibilties across nights. At high delays we see the RMS is consistent with the apparent noise level of the nightly data; however, at low delays the RMS rises to a level that is ∼3% of the peak foreground power. The most likely origin of this is the night-to-night variability of the gain calibration, which would put the gain errors on the ∼1.5% level. This is consistent with the fidelity of the Phase I calibration pipeline assessed in Aguirre et al. (in prep.).

Data Inpainting
Flagging masks due to RFI and other instrumental problems are generally fairly complex, with a non-trivial frequency, time, and antenna dependence. Flagging masks presents a challenge for power spectral analyses that make measurements in the Fourier domain as they induce strong sidelobes in non-periodic signals. This is especially troublesome for 21 cm measurements, which require high dynamic range signal separation against bright foregrounds in the Fourier domain. A related effect is produced when time averaging masked visibility data with differing frequency flags as a function to time. The resultant spectrum contains a complex weight structure as a function of frequency which can manifest as sidelobe structure of bright foregrounds (Offringa et al. 2019).
A common solution to the problem of masked or missing data is to inpaint the masked data with a best-guess of the signal that should have been measured in that bin. A wide variety of algorithms have been developed to solve this kind of problem; the CLEAN algorithm (Högbom 1974), for example, is one way to do this, and is in fact how we addressed the problem of masked data in our gain smoothing procedure of Section 3.5. In this work, we use a similar delay-based iterative deconvolution algorithm to fill-in the missing data in the LSTbinned dataset (Parsons & Backer 2009;Ali et al. 2015;Kerrigan et al. 2018). The deconvolution is performed across nearly the full bandwidth, spanning 115 -185 MHz, and is performed on a per-integration basis down to the thermal noise of each visibility. The algorithm finds model components in delay space across a window spanning −2000 < τ < 2000, which is chosen to encapsulate all of the strong features seen in the data across all baselines. Figure 8 shows an example of the LST-binned data products and the results of delay deconvolution-based inpainting. Inpainting only affects the pixels where the data are masked, and because it is done only across frequency, integrations that are fully flagged remain fully flagged. Aguirre et al. (in prep.) study the accuracy of the inpainting algorithm described here, showing that its performance is degraded for widechannel gaps, like the 138 MHz ORBCOMM features (this is furthermore seen in our spectral window null test Section 5.4). Kern et al. (2020b) and Fagnoni et al. (2021) outline the major systematics seen in the HERA Phase I system, which are attributed to cable reflections in the 150-meter coaxial cable between the feed and the node and in the 20-meter cable between the node and the digitization stage, in addition to signal chain cross-coupling systematics (e.g. mutual coupling). Fagnoni et al. (2021) use electromagnetic simulations of the HERA front end to show that mutual coupling (i.e. dish-to-dish communication) can appear as a decaying shoulder in the autocorrelation delay spectrum from 100 − 500 ns. They also predict the presence of a cable reflection at the termination of the coaxial cables, as well as a series of com-  plicated sub-reflections in the 150-meter cable spanning 100 − 1200 ns that can be unresolved by the native delay resolution of the data. Inspection of the data by Kern et al. (2020b) affirm the presence of a shoulder in the auto-correlation delay spectrum, however, its exact origin is not quite clear. Kern et al. (2020b) speculate that the shoulder could in fact be due to the cable sub reflections and not mutual coupling, in part because it can be effectively modeled and suppressed via a reflection calibration procedure, which is not to be expected from a highly direction-dependent systematic like mutual coupling. Future work will seek to more clearly understand these systematics via improved electromagnetic beam modeling and visibility simulations, which explicitly include mutual coupling effects.

Systematic Modeling
To deal with the sub-reflections, we employ the same reflection fitting algorithm on all antenna autocorrelations from the LST binned data, iteratively solving for 25 individual reflection terms within a delay range of 150 -1500 ns. These parameters were chosen manually after visual inspection of the residual visibilities. All reflection systematics are modeled at a 21.4 second cadence and then smoothed in time with a similar gain smoothing procedure employed in the first round of calibration.
The most prominent source of systematics observed in Phase I data is the presence of high delay features in the visibilities that are nearly constant in time, thus occupying the fringe-rate 0 Hz mode, and roughly symmetric at negative and positive delays (Kern et al. 2020b). For baselines with a projected East-West separation ≥ 14 meters, this systematic can be effectively filtered out from the data without significantly reducing the mea-sured EoR power (Kern et al. 2019). However, this does lead to a small amount of signal loss on the cosmological signal that we correct for in the power spectra (Section 3.11). Overall, the combination of reflection calibration and cross-coupling filtering leads to roughly two orders of magnitude of suppression in instrumental systematics spanning 0.1 < k < 0.8 h Mpc −1 . If not enacted properly, this step can lead to non-negligible amounts of cosmological signal loss, and is therefore vetted in our validation pipeline to ensure it performs as we expect it to (Aguirre et al. in prep.).

Faraday Rotated Foreground Emission
After removing instrumental systematics from the data, we found low-level excesses in the power spectrum at high delays (beyond the foreground horizon) that transited on a beam-crossing timescale. These excesses decreased their characteristic delay at increasing frequency, suggestive of Faraday rotation. We investigated these systematics both in instrumental and pseudo-Stokes polarization visibilities, the latter of which are simply defined as a linear combination of the instrumental polarization visibilities as where the East-East (or EE) instrumental polarization is an East-facing feed correlated with another East-facing feed, and likewise for the North-North (or NN) polarization. We are able to identify these systematics as Faradayrotated foreground emission through a few pieces of evidence. First, the characteristic period of the frequency oscillations (i.e. the delay or k mode of the systematic) decreases at higher frequencies. This is shown in Figure 9, where the four-panel plot shows the delay transform of the instrumental and pseudo-Stokes visibilities over four different spectral windows with increasing central frequency (but fixed bandwidth of 10 MHz), showing a notable decrease in the k mode of the systematic at higher frequencies. In probing the relationship between the systematic's peak k and the central frequency of the spectral window, we find a tight relationship that follows the theoretical expectation of Faraday-rotated foregrounds found in Equation 3 of Moore et al. (2017) (see also Jelić et al. 2010;Asad et al. 2015). Finally, we can perform a direct rotation measure synthesis of the data to look for known sources of foreground emission with high rotation measures in the field-of-view. Note that we make the assumption here that the inferred rotation measure is equal to the Faraday depth, which amounts to assuming that the rotation measure is locally compact along the line of sight. We do this by forming a spectral cube of dirty Stokes Q and U images and then take their rotation measure synthesis for each pixel on the sky. This yields the Faraday depth distribution, written as where Q and U are the Stokes Q & U maps, λ is the observing wavelength, and φ is the Faraday depth in radians meter −2 (Brentjens & de Bruyn 2005;Kim et al. 2016). Note we do not actually form the true Stokes Q and U maps, but simply image the pseudo-Stokes Q and U visibilities, which is a good approximation of the Stokes parameters near the center of the FoV. Doing this and scanning through φ reveals a bright point source at the center of the field (right panel of Figure 9), which aligns exactly in location and in the inferred rotation measure with a known pulsar (PSR J0742-2822; Lenc et al. 2017). There are few other cases in the data where we see a high delay excess and can connect it to a known pulsar in the Faraday depth maps; however, they are seen at fairly low SNR. While these systematics appear above the noise in the EE, NN, pseudo-Stokes Q and U visibilities, they do not appear appreciably in the pseudo-Stokes I visibility (Figure 9). While this is expected, it is also true that intrinsic Q→I leakage from primary beam asymmetry between feeds can cause these systematics to appear at suppressed levels in the Stokes I power spectrum (Moore et al. 2013;Asad et al. 2016Asad et al. , 2018. Feed and LST jacknife tests could help to determine if future, deeper data sets are beginning to detect these systematics in Stokes I.

Coherent Time Averaging
While HERA observations are taken in drift-scan mode, integrations taken at nearby LSTs can be coherently averaged if they are phased to a common pointing center . However, summing integrations that are separated by too much in time will begin to decohere the averaged signal due to drift in the primary beam weighting, in addition to the inability to properly re-align the fringes across the entire sky with a single fringe-stopping procedure. We evaluated the amount of loss induced by time averaging by generating an ensemble set of mock EoR visibilities and averaging them with increasingly large time windows. We settled on a maximum time window of 428 seconds, which leads to an average of ∼1% decoherence of EoR power that is scale independent. In Aguirre et al. (in prep.), this test is repeated with a different kind of EoR model and report a similar finding of ∼1% decoherence in power. Recall that our power spectrum estimator cross multiplies adjacent time bins to avoid a noise bias. In order to ensure that data separated by more than 428 seconds are not cross-correlated, we actually halve the coherent averaging time window to 214 seconds.

Decoherence due to Non-Redundancies
HERA's array layout has a large number of instantaneously redundant baselines. Averaging the visibilities directly rather than their power spectra provides an extra boost in sensitivity by a factor of √ N baselines , and is a key factor in HERA's overall sensitivity. The instrument is not perfectly redundant however -variations in baseline length and orientation, antenna primary beams, and even the calibration itself are expected at some level (Orosz et al. 2019). Deviations from perfect redundancy will cause the signal to decohere to some extent under redundant averaging, and if these deviations are large enough, some degree of signal loss will be sustained.
Here we develop an empirical metric to quantify the amount of signal loss incurred due to decoherence when averaging slightly non-redundant visibilities. The impact of non-redundancy-based decoherence can be estimated by comparing the foreground power spectrum amplitude derived using coherent redundant averaging compared to incoherent redundant averaging. The incoherently averaged power spectrum, P inc , is constructed by first estimating power spectra for each baseline-pair in the redundant group and then averaging them together such that where V 1 , V 2 , . . . indexes individual baselines in a group, V is the Fourier transformed visibility, and N is the number of baselines in a group. Since phase information is discarded by the initial power spectrum estimation step, phase errors cannot combine destructively when the individual spectra are subsequently averaged. Thus, P inc should be less susceptible to phase nonredundancies (e.g. caused by miscalibration or primary beam variations). A cross coherent power spectrum is constructed by cross multiplying all redundant baseline pairs and then taking their average, which does allow phase errors to combine destructively. Note that this is very similar to averaging the redundant visibilities and then cross multiplying them, except for the explicit exclusion of baseline terms paired with themselves. In other words, we form the cross coherent power spectrum as where M is the number of baseline-pairs in a group. In the ideal limit of perfect redundancy there should be no decoherence effect as each baseline is an exact replication of the signal. Thus the coherent and incoherent spectra should measure the same value, with the exception of the different integrated thermal noise level which, as we explain later, is a negligible difference for the delay modes of interest. Note that this is not the estimator used to estimate the 21 cm power spectrum is later sections, this is only used as an approximate estimator for the purposes of assessign signal loss.
To measure the amount of decoherence induced by the coherent average we compute the fractional loss in power, where t denotes LST, τ is delay, and . . . denotes a time (LST) average. The time average is necessary to provide a stable baseline that limits spikes in the ratio when P inc goes through a null. Note that the delay of the visibility, τ , is simply the direct Fourier dual to frequency in the Fourier transform, with units of 1/seconds. What we are interested in is determing how the EoR signal is suppressed in the coherently averaged power spectrum due to non-redundant decoherence. Being an isotropic cosmological signal, the EoR field is statistically invariant across the sky, meaning our sensitivity to it comes predominately from the main lobe of the primary beam. In order to use the measured foreground emission, which is not statistically isotropic, as a proxy for the amount of EoR signal loss we will sharpen the metric in two ways. First, we only inspect the τ = 0 ns mode, which isolates smooth spectrum foreground emission to a strip intersecting the main field of view of the primary beam. And second, we choose to evalute the metric at times where diffuse foregrounds fill the main lobe of the primary beam, thus upweighting the parts of the beam where we expect the EoR signal to be most sensitively measured. For HERA, this occurs most prominently at the transit of the galactic anti-center. The behavior of this metric has been explored in an accompanying paper (Choudhuri et al. 2021). In this study, numerical simulations of a small HERA-like array were performed with different types of primary beam non-redundancies and calibration nonredundancies. The ∆χ foreground metric was found to accurately reflect the degree of non-redundancy, signal loss and modulation effects in the recovered EoR power spectrum. Its behavior in the presence of baseline nonredundancies was not studied. Figure 10 shows the ∆χ metric at τ = 0 ns for several redundant baseline groups. We see an average of ∼1% power loss, suggesting that the real non-redundancies between baselines within a redundant group do not significantly decohere the sky signal. The worst case is the b = 50.6 m baseline, which sees an average decoherence of about 3%. Inspecting the visibilities of this baseline we see that this is due in part because this baseline experiences a particularly strong null in power compared to the other baselines, which systematically brings down P inc in the denominator of the metric, causing the overall ratio to slightly inflate.

Other forms of Signal Loss in the Pipeline
Here we tabulate all of the identified forms of systematic signal loss that arise from the analysis pipeline. Recall that the philosphy of the analysis employed here is one that is generally meant to minimize the inherent loss of the analysis pipeline, therefore, corrections to the loss that have been identified are generally a small fraction of the total measured power.
Some forms of loss are entirely expected, like from the time averaging of drift-scan visibilities, and other forms were not expected but found in the process of pipeline validation (Aguirre et al. in prep.), the most notable being the absolute calibration bias. This bias was due to the logarithmic linearization of the gain amplitude parameter in the process of absolute gain calibration, which when employed on visibilities with low signal-to-noise can result in a bias in the recovered amplitude (Boonstra & van der Veen 2003). Because we calibrate HERA's drift-scan observations every 10.7 seconds, there are some fields where this bias can be nonnegligible, and reaches into the percent level on the gains. After gain smoothing, this bias is shown to be Note-Percentage loss in power for Band 1 (Band 2), which are corrected for after forming the power spectrum. These figures are partially derived from the validation analysis (Aguirre et al. in prep.), and from this analysis ( Figure 10).
largely time independent but does have a slight frequency dependence (Aguirre et al. in prep.). We estimate this bias as a single number for each of the two power spectrum bands (Section 4.3) by making multifrequency synthesis images of the calibrated and LST binned visibilities and comparing them to images of the initial absolute flux density model. The ratio of a zenithlocated point source in the data and model is used as the correction factor. All forms of loss measured here are EoR model independnet and scale independent (i.e. flat across k). In some cases we have physically motivated reasons to expect this (e.g. an overall bias in absolute calibration is scale and EoR model independent) and in other cases we have explicitely tested this (as is the case for coherent time averaging and the non-redundancy decoherence discussed above). These are tabulated in Table 2 and are each corrected for at the end stage by correcting the power spectra.

HERA PHASE I POWER SPECTRUM LIMITS
In this section we outline our power spectrum estimator, giving a brief overview of quadratic estimators (QE), our power spectrum pipeline, and present our derived power spectrum limits from the analysis discussed in this work. Our power spectrum estimator codebase, hera pspec, 5 is publicly available software. Throughout this work, we adopt a ΛCDM cosmology (Planck Collaboration et al. 2016) with Ω Λ = 0.6844, Ω b = 0.04911, Ω c = 0.26442, and H 0 = 67.27 km/s/Mpc.

The Quadratic Estimator Formalism
The 21 cm power spectrum is the square of the three-dimensional spatial Fourier transform of the 21 cm brightness temperature field. Given that in the flat-sky  Figure 10. Redundancy decoherence test for 9 redundant groups, marked by the baseline length and angle in local array XYZ coordinates (blue dashed is their average). This plots the difference of the coherently and incoherently averaged power spectrum normalized by the time-average of the latter. We show the metric for the τ = 0 Fourier mode at LSTs when bright, diffuse foregrounds fill the primary beam. On average, we see roughly 1% power loss, suggesting that while our final set of redundant visibilities are not perfect, they are redundant enough to retain the vast majority of sky power in the main lobe of the primary beam when forming baseline-to-baseline cross power spectra.
limit the interferometric visibility is the 2D transverse spatial Fourier transform of the temperature field, we can estimate of 3D 21 cm power spectrum by taking the Fourier transform across frequency, where the delay domain (τ ) is the Fourier dual of frequency. Note that delay modes are not a direct mapping to the line-of-sight k wavevector; however, for short baselines they are nearly the same and approximating τ modes as k modes is known as the delay approximation (Parsons et al. 2012;Liu et al. 2014a). This is written asP whereP is our estimate of the power spectrum, X and Y are scalars mapping sky angles and frequency to cosmological distances, respectively, Ω pp is the sky integral of the squared primary beam response and B is the Fourier transform bandwidth. Furthermore, the subscript on our visibilities, V 1 and V 2 , indicates that we are crossmultiplying visibilities with independent noise realizations, meaning that the estimated power spectrum has no noise bias. As written, the power spectrum P has units mK 2 h −3 Mpc 3 . The cosmological wavevectors are related to the natural telescope units of delay and baseline length (Parsons et al. 2014 where X = c(1 + z) 2 ν −1 21 H(z) −1 , Y = D(z), ν 21 = 1.420 GHz, H(z) is the Hubble parameter, and D(z) is the transverse comoving distance (Hogg 1999;Condon & Matthews 2018). We can further define the cosmologically dimensionless power spectrum, which has units of mK 2 . The "Fourier transform and square" delay spectrum estimator can also be cast into the more general form of a quadratic estimator, whereq α is the unnormalized estimate of the α bandpower, x 1 and x 2 are the visibility vectors we will crosscorrelate, R is any weighting matrix we may want to apply to the data (e.g. the inverse covariance in the optimal quadratic estimator case), and Q α is the mapping from data space to power spectrum space. In general, this mapping is given by the derivative of the covariance with respect to bandpower α, i.e. Q α = C ,α = ∂C ∂pα . We use C to mean the true covariance matrix of the visibilities, which in general is never known exactly. Estimating the covariance empirically is a subtle process, and if not done accurately risks biasing the power spectrum estimate (Ali et al. 2018;Cheng et al. 2018). The lack of access to the true covariance also makes accurate error estimation tricky, as we will discuss in more detail later in this section. For our purposes, we simply define Q α = c † α c α , where c α is a discrete Fourier transform operator that takes the (weighted) data to delay space. We denote our data vectors as x 1 and x 2 because in this work we opt to form cross power spectra between data vectors with independent noise realizations, which eliminates the noise bias term that can be difficult to estimate precisely Ali et al. 2015;Pober et al. 2016).
To normalize our power spectrum estimate we gather our unnormalized bandpowers into a vector and form the quantity,p = Mq, where M is the normalization matrix andp is the normalized bandpower estimates. Our estimated power spectra are related to the true bandpowers by the window function matrix,p = Wp, where p is a vector holding the true bandpowers. While p is never known a priori, we can use the window function matrix to appropriately map a theoretical EoR power spectrum to the basis of the measured power spectrum. Relating this to our normalization matrix we have where H is the response matrix of the bandpowers, defined as In the case where R = C −1 , our estimator becomes the optimal estimator, and H becomes F, the bandpower Fisher matrix (Tegmark 1997). The last component of our power spectrum formalism is the bandpower covariance matrix, 1 E α x 2 and substituting this in, we get that the power spectrum covariance is where the total covariance, C = xx † = S + N, can be written as a sum of the sky signal and noise covariance . While the noise covariance of our data is straightforward to quantify, the signal terms are considerably harder. We will come back to how we estimate the bandpower errors in practice. An unbiased estimate of the power spectrum is made by building a window function matrix whose rows sum to unity, giving the analyst some freedom on exactly how to construct M, with statistical implications forp and Σ (Tegmark 1997; Liu & Tegmark 2011;Ali et al. 2015). Choosing M = H −1 , for example, yields W = I, meaning the bandpowers are independent; however, we also see that the bandpower errors become quite large and correlated. In this work we choose arguably the simplest approach, which is to set M to a diagonal matrix. This minimizes the resultant error bars at the expense of measurements that overlap in Fourier space and also have slightly correlated errors. This returns us to the simple delay spectrum estimator, but framed in the machinery of a quadratic estimator, with the diagonal of M given by the coefficients of Equation 12.  Figure 11. A frequency and LST dependent model for the system temperature of a particular antenna that is used to compute power spectrum error bars, and also to generate realistic thermal noise realizations of the data.
Our quadratic estimator is sub-optimal in the sense that we do not weight our data by the inverse covariance matrix; however, this also safeguards our estimator against concerns about signal loss from empirical inverse covariance weighting (Cheng et al. 2018), and the distortion of the window functions at low k modes due to complicated data weighting (Liu et al. 2014b;Kern & Liu 2021). In the meantime, we defer optimal inverse covariance weighted power spectrum estimation to future work. Nonetheless, we still need to account for the fact that, without foreground removal, there exists a large dynamic range between the foreground signal at low k modes and the EoR and noise signal at higher k modes. To limit foreground spectral leakage in the discrete Fourier transform, we apply a tapering (or apodization) function along the diagonal of R. We use a Blackman-Harris function (Blackman & Tukey 1958), which achieves 50 decibels of sidelobe suppression in Fourier space.

Error Bar Methodology
Tan et al. (in prep.) discuss different kinds of error bars one can use to quantify uncertainty on the measured bandpowers. In summary, they find that all of the error bar methodologies explored are in general agreement with each other and with the true sampling distribution of the bandpowers in the ensemble limit; however, they suggest two specific error bars that: 1. encapsulate the full thermal noise contribution to the bandpower uncertainty, and 2. can be computed relatively straightforwardly without requiring a model of the signal covariance. We will briefly summarize these here.
At minimum, one needs to encapsulate the fundamental thermal noise uncertainty of the fully averaged power spectrum, which is the root-mean-square (RMS) of the bandpowers in the limit that they are noise dominated. However, as Tan et al. (in prep.) point out, there is an additional source of noise variance on the bandpowers in the limit of a non-neglible coherent signal in the data. 6 For weak, marginally detectable signals the additional variance is a small correction, however, when there is a significant detection of the signal this correction dominates the thermal noise uncertainty on the bandpowers, and thus should be accounted for.
An analytic expression for the root-mean-square (RMS) of noise-limited power spectra after a given amount of averaging can be written as where t int is the integration time of the data, N coherent is the number of coherent averages of the data (i.e. visibility averages), N incoherent is the number of incoherent averages (i.e. averaging after forming the power spectra), and T sys is the system temperature of the signal chain, which is the sum of the sky and receiver temperatures. Lastly, Ω eff is the effective beam area, defined as Ω eff = Ω 2 p /Ω pp , where Ω p is the sky integral of the primary beam and Ω pp is the sky integral of the squared beam (Pober et al. 2013a;Parsons et al. 2014;Cheng et al. 2018). A frequency and LST dependent model for T sys is derived from the auto-correlation visibilities of each antenna to produce a baseline, time, and frequency dependent noise model (e.g. Kern et al. 2020a;Dillon et al. 2020;Tan et al. in prep.). Figure 11 shows that HERA measures a system temperature in the range of 200 -400 Kelvin at 160 MHz depending on the LST. Note that the T sys models are used not only to compute the error bars on the power spectrum, but also to generate realistic noise simulations of the data that are used for null testing.
The second error bar that Tan et al. (in prep.) investigate accounts for the additional sources of noise in the power spectrum that arise in the presence of a signal, also derived in Kolopanis et al. (2019). These extra terms come simply from the signal-noise cross terms when squaring the visibilities to form the power spectrum. This error bar is written as where P S is the power spectrum of the signal in the data. Note that, like P N , P SN can be thought of as the 1σ un-certainty on the measured bandpower. What is readily apparent is that to compute this quantity we need the power spectrum of the signal, which we do not have any more than we have the covariance of the signal. In that case, we can estimate it directly from the data, usinĝ P in its place, which Tan et al. (in prep.) show is a good approximation. However, for noise-limited bandpowers where P S << P N we see that by substituting P S forP we are actually overestimating the error bar (in a sense we are double counting the noise-noise contribution). Nevertheless, Tan et al. (in prep.) show that this can be corrected by constructing a modified error bar estimator One downside to Equation 23, Equation 24, and Equation 25 is that they make certain simplifying assumptions. Mainly, they assume that the bandpowers are uncorrelated. In other words, they only account for the diagonal of the bandpower covariance. However, we expect the off-diagonal components to possibly have nonnegligible covariance (specifically due to the apodization function) and it should be accounted for. Tan et al. (in prep.) show how the error bar estimators presented above can also be derived by propagating a combination of the data and the frequency-frequency noise covariance through the quadratic estimator, which conveniently yields the full bandpower covariance matrix, not just its diagonal. Therefore, to quantify the offdiagonal components in the bandpower covariance matrix, we use the frequency-frequency noise covariance of the data propagated through the QE formalism to Σ.
When quoting upper limits on the power spectrum or presenting power spectrum measurements, we will exclusively use the modifiedP SN error bar as the 1σ uncertainty, which is also the 68% confidence interval given that we expect the bandpowers to be Gaussian distributed (c.f. Tan et al. in prep.). We will also generally plot the P N limit as a reference for whether the data are consistent with the thermal noise floor. Lastly, we also use P N and the bandpower noise covariance matrix when evaluating statistical null tests in Section 5.

Forming Power Spectra
We form power spectra from the pseudo-Stokes I visibilities (Equation 6) after LST binning, systematics treatment, and coherent time averaging. We choose two spectral windows over which to form power spectra spanning 117.1-132.6 MHz and 150.3-167.8 MHz, corresponding to redshifts of 10.4 and 7.9, respectively. We refer to these as Band 1 and Band 2, shown in Figure 12, which were selected based on their relatively low flag occupancy of a few percent. The shaded curves shows the extent of the Blackman-Harris tapering function applied along the spectral window of each band. Evolution of the cosmological signal will begin to affect the power spectrum over large line-of-sight bandwidths, also known as lightcone effects. Although this effect is technically EoR model dependent, it has been found that for ∆ν < 10 MHz, lightcone effects on the power spectrum are kept below 10% for k < 0.5 h Mpc −1 and 7 < z < 10 ( Datta et al. 2014). Note that because we apply a Blackman-Harris tapering function across each of our spectral windows, their effective bandwidths are 7.75 MHz and 8.75 MHz for Band 1 and Band 2, respectively.
To form power spectra following Equation 16, we cross multiply adjacent time integrations (Pober et al. 2013b) separated by 214 seconds after time averaging between all baselines in a redundant set. A redundant set is defined by all physical baselines that share a common length and orientation. Note that we compute the power spectrum of all baseline permutations (i.e. we compute V 1 × V 2 and V 2 × V 1 , where V 1 and V 2 are visibilities in the same redundant set); however, we do not compute the power spectrum of a baseline paired with itself (i.e. V 1 ×V 1 ), which reduces the impact of baseline-dependent systematics. This achieves nearly the full sensitivity of a coherent visibility average across redundant baselines, thus the need to account for signal loss from a coherent redundant average (Section 3.11).
Next we motivate the choice of the three unique "fields" in LST over which we form power spectra. Tracking arrays like the GMRT, the MWA and LO-FAR can pick out specific parts of the sky where foregrounds emission is minimal (Ghosh et al. 2012;Trott et al. 2020;Mertens et al. 2020). However, HERA is not afforded this luxury because it is a static-array that observes a uniform stripe across the sky. Nonetheless, we can pick out certain LSTs where foreground emission is relatively less bright. This is important because we know that systematics like RFI and residual instrumental gains act to leak foregrounds modes in the power spectrum to higher k , thus partially contaminated the EoR window. Furthermore, as noted in Section 3, we do not perform any kind of foreground subtraction in this work. Therefore, picking LSTs that avoid the brightest foregrounds can help to minimize the impact of residual systematics. Specifically, we pick three fields, each spanning ∼ 2 hours in LST that avoid the extended radio galaxy Fornax A at an RA of 3.36 hours and the galactic anticenter at ∼ 7.5 hours (Figure 1). They also give a 0.5 hour buffer at the ends of the total LST range to limit edge effects in our fringe-rate filtering. The chosen fields span an LST range of 1.25-2.7 hours, 4.5-6.5 hours, and 8.5-10.75 hours for fields 1, 2, and 3, respectively. Figure 13 shows a redundantly-averaged power spectrum waterfall for a single redundant set with and without systematics treatment (Section 3.8) over the full LST range, with the three fields marked. Based on the amplitude of the k = 0 h Mpc −1 mode, we can see that Field 1 has the least amount of overall foreground power, making it an ideal candidate for power spectrum analysis. Figure 13 also demonstrates the ∼ 2 orders of magnitude of systematic suppression achieved by our pipeline, which would otherwise contaminate a large portion of the EoR window modes.
Next we incoherently average the power spectra over the redundant baseline axis and across the remaining time bins in each field. This is equivalent to a cylindrical binning in k space onto a k and k ⊥ plane. All averaging is weighted by the squared inverse of the computed thermal noise floor, P N , which is both a function of baseline-pair and time but is independent of k . Simultaneously, we also propagate the bandpower covariance and theP SN error bar through the averaging. This leaves us with a 2D power spectrum for each band and field, shown in Figure 14 for Band 1 and Figure 15  10 5 10 7 10 9 10 11 10 13 P(k ) [mK 2 h −3 Mpc 3 ] Figure 13. A waterfall of a redundantly averaged power spectrum for a single baseline group without systematic treatment (left) and with systematic treatment (right). The LST cuts for each of the three fields are shown in blue, orange and blue, which are chosen to avoid the transiting of bright foreground sources like Fornax A at 3.36 hours, the galactic anticenter at ∼7.5 hours, and to give a buffer of ∼30 minutes from both the beginning and end of the full LST range.
mal noise floor. Even though the power spectrum of a coherent signal is non-negative by construction, the measured power spectrum can fluctuate negative due to the fact that we have cross-correlated data with independent noise realizations, which is a mean-zero process with non-zero variance. The root-mean-square of the 2D power spectra divided by their thermal noise floors for k > 0.2 h Mpc −1 is consistent with one to within ∼3%, which means that (1) we are able to estimate the noise variance in the real data at a precision of a few percent, and (2), as we will also demonstrate later, the data at high k modes are largely noise dominated. The dashed line plots the foreground horizon limit, which is the theoretical maximum k for smooth spectrum foregrounds, and the solid line plots the horizon limit plus a buffer of 200 nanoseconds (∆k = 0.11 h Mpc −1 for z = 7.9 and ∆k = 0.10 h Mpc −1 for z = 10.4). As we will discuss below, this cut at the horizon limit plus a buffer helps to reduce foreground contamination in the final spherically averaged power spectrum. The exact buffer was chosen to mitigate foreground leakage beyond the hori-zon limit (particularly at large k ⊥ modes) while keeping some non-zero weight at the lower k modes, even though there is some observed leakage beyond this buffer at low k ⊥ (Figure 14 & Figure 15).
The fact that the data are in rough agreement with the expected noise at high k , especially given that we have performed no foreground removal, is a testament to both the instrument design as well as the data reduction algorithms, which have been able to effectively contain foreground emission to the lowest k modes within the foreground wedge. Figure 16 shows this more clearly, showing the spherically averaged power spectrum (discussed below) without enacting a low k cut and normalizing by the peak foreground power at k = 0 h Mpc −1 . This shows that the full process of measuring the sky brightness, including the instrument itself and our analysis pipeline, achieves a dynamic range of 10 9 in power between the peak foreground emission and the thermal noise floor at high k, which is a necessary but not quite sufficient criterion for an eventual 21 cm signal detection. Note that a complementary analysis of Phase I data using the bispectrum phase also achieve a dynamic range of 10 8 between peak foreground power and the recovered noise floor (Thyagarajan et al. 2020), but used less data than what is presented here.
However, while Figure 14 and Figure 15 show decent agreement with the noise floor at high k (a point we come back to in Section 5), we also clearly see evidence of foreground emission at k beyond that determined by the horizon delay (termed supra-horizon emission or foreground leakage), particularly at low k ⊥ . Some of this is simply due to the frequency tapering function's footprint in k space, which pushes k ≈ 0 foreground emission to higher k ; however, that cannot explain the full extent of the supra-horizon emission. Kern et al. (2020b) speculate that this could be due to residual instrumental cross-coupling, which is both generally most prominent and hardest to filter off at the low k ⊥ modes.
Looking at Band 2, Field 2, we also see evidence for a slight excess beyond the foreground horizon and buffer. While it is unclear exactly what this is due to, one possibility is that it is low-level spectral artifact (possibly unflagged RFI) that exhibits a baseline dependence, which has been seen before at a low-level in Phase I data. Future, more comprehensive work detailing low-level spectral artifacts in the fully averaged data, for example with the SSINS algorithm (Wilensky et al. 2019), may help us better understand these features.

Integrated Limits on the Power Spectrum
Next we spherically average the data by binning the 2D power spectra in bins of constant |k|, assum-  Figure 14. The 2D power spectra P (k , k ⊥ ) for Band 1 (z=10.4) at each field (top panels), and their ratio with the 1σ thermal noise floor (bottom panels). We also plot the horizon wedge (dashed) and the horizon buffer used when spherically binning (solid).  Figure 15. The same 2D power spectra as Figure 14 but for Band 2 (z = 7.9). Band 2 Figure 16. The amplitude of the spherically binned power spectrum without enacting a minimum k cut in the binning, normalized to the foreground power at k = 0 h Mpc −1 . This demonstrates that our analysis pipeline achieves a dynamic range of ∼ 10 9 with respect to the peak foreground power.
ing statistical isotropy of the cosmologial signal, and again weighting the average by the squared inverse of P N (k , k ⊥ ) in each pixel. We compute the spherically averaged power spectrum and propagate our measures of uncertainty using Equations 33 -38 of Dillon et al. (2014). Recall that we give zero weight to all pixels with k < k horizon +∆k , where ∆k is the foreground horizon buffer. This leaves us with three spherically averaged power spectra at each field for Band 1 and Band 2, which we show in Figure 17. The vertical error bars are the 2σ P SN error bar discussed in Section 4.2 and Tan et al. (in prep.), and the horizontal error bars are the 16 and 84th percentile of the window functions. We also plot the theoretical 1σ thermal noise floor (dashed), which is simply the P N error bar discussed in Section 4.2. Bandpowers that are measured as negative are set to zero. While coherent sky signal manifests as purely positive signal in the power spectrum, thermal noise and other uncorrelated systematics are mean-zero with non-zero variance, which can manifest as negative values in the power spectrum. Note that in the limit that the data are nearly noise-dominated, which is the most of the data shown in Figure 17, the 1σ errorbar is the same as the 1σ P N dashed line.
Broadly, we can evaluate whether the data are consistent with thermal noise fluctuations if (1) the vertical error bars reach ∆ 2 = 0 mK 2 for the majority of the data points, (2) the measured bandpowers are roughly consistent with the P N curve, and (3) we measure a roughly equal number of positive and negative band powers. However, all of this is complicated by the fact that the fully averaged products reduce us to a regime of low number statistics, in addition to the fact there will be some amount of correlation between the bandpowers in k, in part due to the frequency tapering function (the bandpowers are independent across bands and fields). With just a qualitative assessment for the time being, however, we can see general agreement of the data with the expected thermal noise level at high k modes, and increasing discrepancy at low k modes.
The discrepancy at low k is easily explained as residual foreground leakage that can be directly observed in Figure 14 and Figure 15. The fact that this leakage is stronger for Band 1 is largely due to the fact that the foregrounds are brighter at low frequencies. Furthermore, we see that Field 1 exhibits the least amount of foreground leakage at the low k modes, which is also largely due to the fact that it sees relatively dimmer foregrounds than the other fields (Figure 1 & Figure 13). Furthermore, we also see that Band 2, Field 2 sees a systematic excess in power compared to thermal noise expectation, which is likely due to a low-level time and frequency dependent systematic. We speculate that this could be the effect of low-level radio frequency interference, due in part to its spectral and temporal transience; however, more comprehensive work identifying low-level RFI is required to understand this at a deeper level.
The band powers in Figure 17 are sampled at a cadence of ∆k = 0.64 h Mpc −1 , which is twice the native spacing of the Fourier modes given our choice of spectral windows. This helps to reduce the correlations between neighboring bandpowers, especially given that we applied a tapering function across frequency before estimating the power spectra. This tapering impacts both the resultant bandpower window functions as well as the bandpower covariance matrix. Figure 18 shows the Band 2, Field 1 power spectrum with its associated window functions (bottom panel), which shows that they are well-behaved even without decorrelation; however, they do show low-level overlap between neighboring modes, which in principle should be taken into account when comparing the data to astrophysical models. The window functions are nearly identical for all The theoretical thermal noise floor (dashed) is also shown, which represents the 1σ noise floor. Bandpowers that are meausured to be negative are set to zero. The most sensitive limits come from Field 1 (for both Band 1 and Band 2), which is largely due to the fact that Field 1 has the least amount of foreground power in the field of view. W (k) Figure 18. The spherically averaged, dimensionless power spectrum from Band 2 and Field 1, showing 2σ vertical error bars, their associated window functions (lower panel), and the theoretical thermal noise floor (dashed). The horizontal error bar is the distance of the 16 and 84th percentile of the window function from its 50th percentile, equal to ∆k = 0.031 h Mpc −1 for all the bandpowers except the lowest k mode, which is slightly asymmetric due to the horizon buffer's effect on spherical binning. measured modes, 7 and are also nearly identical between Band 1 and Band 2. The 16 and 84th percentile of the window functions are ∆k = 0.031 h Mpc −1 away from their 50th percentile, which is a rough approximation of the 1σ horizontal error bar of the bandpowers. Upper limits on the 21 cm power spectrum (at a 95% confidence level) are constructed by taking the measured dimensionless power spectrum, ∆ 2 , and adding the 2σ error bar. In the case where the measured bandpower is negative, we set it to zero, which is justified by our assertion that the cosmological signal (or any coherent signal in the visibilities for that matter) is a purely real and positive quantity in the power spectrum by definition. In this case, the upper limit simply becomes the 2σ error bar.
The measured bandpowers, their uncertainties, and the derived upper limits are tabulated in Table 3 and  Table 4. The most stringent upper limits at both Band 1 and 2 are achieved from Field 1, yielding an upper limit of (95.74) 2 mK 2 at z = 10.4 and k = 0.256 h Mpc −1 and (30.76) 2 mK 2 at z = 7.9 k = 0.192 h Mpc −1 . This is the most sensitive upper limit on the EoR 21 cm power spectum at z ∼ 8 by over an order of magnitude, achieved with only a fraction of the full sensitivity of the HERA array. HERA's nominal sensitivty over a year-long observing campaign could be pushed two orders of magnitude deeper, thus reaching the fiducial signal amplitudes of standard EoR models; however, low-level systematics will need to be modeled and appropriately resolved in order to reach those sensitivities.

VALIDATION OF UPPER LIMITS
21 cm analysis pipelines are becoming increasingly sophisticated in order to beat down a host of systematic effects and reach the level of precision necessary to detect the cosmological 21 cm signal. A key concern is that these analysis choices could cause non-negligible amounts of signal loss, whether from calibration (Mouri Sardarabadi & Koopmans 2019), power spectrum estimation (Cheng et al. 2018), or other steps. The extent of any possible signal loss needs to be quantified in order to build confidence that our results are not prone to signal loss. In this analysis we use a combination of three complementary approaches to build confidence in our analysis and the robustness of upper limits on the power spectrum: simulation-based checks of the pipeline, a comparative study of uncertainty quantification, and statistical null tests on data subsets In a companion paper, Aguirre et al. (in prep.) present a validation of the Phase I analysis and power spectrum pipeline, using data simulations with realistic sky, instrument, and systematic models. In addition to tests on individual components of the pipelines, such as calibration, systematic modeling, and power spectrum estimation, they also present a near end-to-end pipeline test, demonstrating the pipeline's ability to accurately recover a realistic EoR model for k ≥ 0.192 h Mpc −1 at both of the spectral windows considered in this work. This end-to-end study is particularly important, as different steps of the pipeline could potentially interact in non-linear ways that introduce signal loss. While the instrument model used in that work are complex enough to model the most prominent systematics, there are some areas where the models can be made more sophisticated in future work to test more subtle features of the instrument, which may be necessary for validating an analysis with a putative EoR detection. In addition to testing the Phase I pipeline with simulated datsets, Aguirre et al. (in prep.) also show a semi-blinded analysis of the simulated data with an independent power spectrum pipeline as a consistency test with the Phase I power spectrum pipeline. In this exercise, a small group of people were excluded from discussions on the construction of the simulated data and systematics, and applied an independent, simplified power spectrum pipeline to the data products to test for experimenter bias in the derived limits. Their resultant power spectra are in good agreement with the output of the Phase I pipeline, in that they show unbiased detection of the simulated EoR signal for k 0.2 h Mpc −1 .
In addition to the simulated pipeline validation from Aguirre et al. (in prep.), Tan et al. (in prep.) present a comparative study of uncertainty quantification in the power spectrum pipeline. They highlight a number of techniques used in the literature for estimating uncertainty on the 21 cm power spectrum, both analytic and semi-empirical, and demonstrate that they all show relatively good agreement with each other when applied to both real and simulated HERA Phase I power spectra. Specifically, as discussed in Section 4.1, we use the P N and P SN error bars that quantify the root-mean-square (RMS) of the power spectra due soley to thermal noise (P N ), and the RMS due to thermal noise and its coupling to residual signal terms in the data (P SN ). The final error bars quoted in Table 3 and Table 4 come from this latter metric.
In the rest of this section, we discuss a series of statistical consistency tests to better understand residual systematics observed in the data. Some of these tests are formulated as true null tests, where the null hypothesis is that the spherically averaged power spectrum is consistent with noise-only fluctuations, while others are better described as consistency checks that seek to verify a particular scaling law. Such tests can help to quantify the parts of the data that are systematic or noise limited (e.g. Kolopanis et al. 2019).

Real and imaginary parts of the power spectrum
This test seeks to assess whether the final spherically averaged power spectra are consistent with thermal noise. In the absence of residual systematics, we expect the power spectra to consist of foreground emission, EoR emission, and thermal noise. Given that we have masked the foreground horizon upon spherical binning, any excess power above the predicted thermal noise variance is likely due to residual systematics, which can take may forms. Generally, these systematics leak foreground emission to k modes beyond the foreground horizon modes, thereby contaminating the lowest k modes in the EoR window. We assess whether the data are consistent with noise by evaluating the p-value of the power spectrum χ 2 statistic for each band and field combination. The statistic is defined as  Figure 17. Ther upper limit is the measurement plus the 2σ error bar. In the process, ∆ 2 is set to zero where it is measured to be negative.  Table 4. Band 2 power spectra, errors, and upper limits on ∆ 2 21 (z = 7.9) from Figure 17. The same procedure as Table 3 is used to construct the upper limits.

Field 1
Field 2 Field 3  Figure 19. The real and imaginary components of the spherically averaged power spectrum P (k). The six plots on the left show the real component (pink dots), while the six plots on the right show the imaginary component (blue dots). The error bars are 2σ and the shaded region is the PN theoretical noise expectation. Significant power in the imaginary component could indicate phase calibration errors resulting from unmodeled systematics or baseline non-redundancies. We quantify the consistency of the data with noise via p-value tests, tabulated in Table 5.
where d is the power spectrum data vector and Σ is its noise covariance. The noise covariance is nearly diagonal, with small but non-zero neighbor-to-neighbor covariance ( Figure 20). The p-value for a particular data vector is the area under the χ 2 sampling distribution that exceeds the χ 2 of the data. The χ 2 sampling distribution is derived in a Monte Carlo fashion by drawing a large number (N = 10 6 ) of random noise realizations and evaluating their χ 2 . For p-values below 0.001 we can only quote an upper limit on the p-value given the discrete number of random draws, however, this is a sufficiently small value to enable an unambiguous rejection of the null hypothesis. Note that we do not enact a hard cut on the p-value to determine whether the data reject the null hypothesis, but rather use the collection of p-values for a given band and field in both the real and imaginary components to ascertain whether the data seem consistent with thermal noise. Note-The null hypothesis is that the power spectrum is consistent with mean-zero fluctuations drawn from the propagated bandpower noise covariance ( Figure 20). Upper limits on the p-value are quoted where it is sufficiently small.
The k cuts are in h Mpc −1 .
Based on the estimator presented in Section 4.1, the power spectrum of a noise-only dataset will be meanzero process in both the real and imaginary components with equal variance. Coherent signals in the data, whether they are sky signals or instrumental systematics, are confined to the real component of the power spectrum. However, systematics with phase dif-ferences between cross-multiplied visibilties can cause excess variance in the imaginary component of the power spectrum (Kolopanis et al. 2019). Figure 19 shows the real component (pink panels) and the imaginary component (blue panels) of the power spectra in P (k), with their p-values tabulated in Table 5 showing the response of the p-value with an increasingly large k cut. We see that for both bands, most fields are inconsistent with the noise distribution at the low k modes, evidenced by their consistently low p values in both the real and imaginary components. Reassuringly, the p-values for the imaginary component are generally more consistent with noise than the p-values for the real component, even when there is strong evidence for systematics, like Band 2, Field 2. The interesting exception to the rule is Band 1, Field 2, whose imaginary component is less consistent with the noise than its real component, possibly indicative of a low level, baseline dependent systematic. Overall, the quantification provided by the p-value tests on the real and imaginary components supports the notion that the data are largely consistent with thermal noise at high k modes, while confirming our intuition that residual systematics are beginning to be detected at the lowest k modes in the EoR window.  Table 5. This shows there is a small but non-zero covariance between neighboring modes, and effectively no covariance at larger separations.

Night-to-night binning split
A check on long-term temporal stability can be made by taking the nightly data from the dataset and sep- arating them into two groups, differencing them, and then forming the power spectra of the differenced visibilities. To do this we take every other night in the 18 night dataset and form a set of "even" nights and a set of "odd" nights in terms of their Julian date. We then perform the standard night-to-night LST binning described in Section 3.6 on both the even and odd sets independently. Afterwards, we pass both data sets through the standard systematics treatment and then difference the resultant visibilities before estimating the power spectrum. This test is sensitive to systematics that exhibit non-negligible variation from night-to-night. Figure 21 shows the real component of the differenced power spectrum for Field 1. Computing the p-values of these spectra in the same manner as Section 5.1, we find that the spectra for both Band 1 and Band 2 are consistent with thermal noise for k ≥ 0.192 h Mpc −1 , which was not observed for Field 1 in the undifferenced power spectrum (Table 5). This suggests that the lowk residual systematics seen in the data are fairly stable from night-to-night.

Scaling of bandpowers under cumulative averaging
If delay modes outside the foreground horizon are dominated by thermal noise, as we would expect in the absence of residual systematics or an EoR signal detec-tion, the measured power in those modes should integrate down in a characteristic way as more time samples and baseline-pairs are averaged over. In this section, we study the scaling of the measured delay spectrum bandpowers when cumulatively averaging individual baseline-pairs in a redundant baseline group, which should in principle measure the same sky signal but have independent noise contributions. Note that this test is performed higher up in the power spectrum pipeline before cylindrical or spherical binning of the power spectra, which is necessary because we need a large number of independent samples for computing the cumulative average. We specifically target the East-West 14.6 meter baseline group for this test as it has the largest number of independent baselines.
The left and center panel of Figure 22 show the scaling of the delay spectrum band powers for Band 2, fields 1 and 2, having cumulatively averaged over the redundant baseline pairs. Each line shows the a different band power, ranging from negative delays to positive delays, while the black dashed line shows a representative 1/ √ N scaling, where N is the number of averaged baselinepairs. To reduce the scatter in this statistic, we take the absolute value of the real part of the band power and average across the remaining LST samples associated with Field 1.
While low delay modes (|τ | 200 ns) are not plotted, we see that the high delay modes are generally consistent with the expected 1/ √ N scaling of noise. As a demonstration, the right panel of Figure 22 shows the same procedure applied to a pure noise simulation for Band 2, Field 2.
The thick orange line shows the average of the band powers across delays for |τ | > 500 ns (or |k | 0.25 h Mpc −1 ). This more clearly shows the marginal detection of a systematic at deep integration levels, which is more pronounced for Band 2, Field 2, as we would expect given our conclusions from Figure 19. Nevertheless, this test shows that the low-level systematics do partially integrate when averaging across different baselines, as the mean of the band powers (i.e. the orange line) still shows a downward trend. Integrating more data will help to understand at what dynamic range these systematics hit a plateau, if at all.

Stability with respect to frequency selection
Here we seek to assess the overall containment of foreground structure within the foreground horizon as a function of frequency. To do this, we compute the delay spectrum of the data over a subband with fixed bandwidth and then iteratively shift the center of the subband through the full frequency range of the data, thereby probing for spectral discontinuities in the data at particular locations in frequency. In order to test the consistency of the measurements with noise, we simulate mock visibilities consisting of Gaussian random noise with the same sampling and flagging patterns as the real data. As a metric, we look at the mean band power within a delay range of 2000 < |τ | < 4000 nanoseconds for both the real data and simulated noise data sets. By picking out the high delay regime, we are constructing a test that is sensitive to spurious spectral structure that would cause foreground power to leak significantly beyond the foreground horizon. Similar to Section 5.3, we perform this test with the East-West 14.6 meter redundant baseline group to enable a large sample size, and focus on the LST ranges associated with Field 1. Note that this test is performed after delay-based inpainting, where the data have been reconstructed in previously flagged channels.
We use two-sample Kolmogorov-Smirnov (K-S) test (Hodges 1958) to examine if the underlying probability distribution of the measured band powers from the data and noise-only simulations are different. Again, this process is repeated as we shift the spectral window up in frequency, where the window has a width of 100 channels, and is moved up 20 channels every iteration. We compute the allowed range of the metric by taking the 5th and 95th percentiles of the two-sample KS test, having paired two independent noise simulations, and then repeating for all pair combinations of the noise simulations and averaging the result. This interval is shown as the green shaded region in Figure 23, which also shows the two-sample KS test between the data and the noise simulation (blue points). This shows that for most of the total bandwidth, the measured band powers at high delay are consistent with that of pure noise fluctuations. The Band 1 and Band 2 spectral windows are shown as grey dashed lines.
The notable exception are the band edges, which are known to be sub-optimal to due tapering effects in the process of calibration, as well as the ORBCOMM band at ∼ 138 MHz, where a large swath of channels are routinely flagged, which is known to degrade the ability of the delay-based inpainting algorithm.

Impact of systematic treatment
This null test is aimed at examining the consistency of the band powers with thermal noise before and after systematic treatment, discussed in more detail in Dibblee- Barkman & Singh (2021). We compute the delay power spectra for a single 14.6 meter, East-West oriented baseline this time spanning an LST range of 5 -5.4 hours overlapping with Field 2. We estimate the cumulative probability distribution (CDF) of the band powers at different delays, constructing the CDF across the 30 time bins in the LST range studied here. We then perform a one-sample KS test between the band power at each delay and the expected analytic distribution of the band powers under the assumption that the visibilities are drawn from random Gaussian noise. Additionally, we also construct band power CDFs having resampled the time bins with replacement, which gives us a sense for the inherent variation of the CDF throughout the LST range. We compute the one-sample KS test for each of the resampled CDFs, and then take their average.
The result is shown in Figure 24, where the solid line shows the KS statistic from the un-resampled CDF as a function of delay mode and the dashed line shows the average KS statistic from the resampled CDF, showing good agreement between the two, suggesting that variation of the KS statistic as a function of time is minimal. We repeat this test on the data before systematics treatment (to panel), after systematics treatment (middle panel), and also compute it on a purenoise simulation for validation purposes (bottom panel). The horizontal black line with a KS value of ∼ 0.25 is the analytically-computed 95th percentile of the KS values assuming the CDF is drawn from pure noise, which is validated by the results of the bottom panel showing the vast majority of the KS values falling below the solid black line. What we see from the top and middle panels of Figure 24 shows a clear discrepancy between the data and the assumed noise distribution at intermediate delay modes before performign systematics treatment (top panel), which is not surprising given that we know those modes to be systematics dominated. After systematics treatment (middle panel), these delay modes show good agreement with the noise model, except for at very low delays where we expect intrinsic foreground emission to dominate.

SUMMARY
In this work we have presented upper limits on the 21 cm power spectrum from Phase I observations of the 50-element Hydrogen Epoch of Reionization Array (HERA) at redshifts 7.9 and 10.4. The most sensitive 2σ limits achieved are (30.76) 2 mK 2 at z = 7.9 and k = 0.192 h Mpc −1 and (95.74) 2 mK 2 at z = 10.4 and k = 0.256 h Mpc −1 . At z = 7.9, the limits presented here are the most sensitive to-date within the literature by roughly an order of magnitude. Along with a series of paper describing the Phase I analysis pipeline, including redundant calibration (Dillon et al. 2020), absolute calibration (Kern et al. 2020a), systematic modeling (Kern et al. 2019(Kern et al. , 2020b, uncertainty quantification (Tan et al. in prep.), and pipeline validation (Aguirre et al. in prep.), this work details the Phase I analysis and power spectrum pipelines, and discusses a series of statistical tests that show that the data are largely thermal noise limited for k ≥ 0.5 h Mpc −1 , whereas at lower k modes the data show evidence for low-level systematics. We speculate that these residual systematics could be due to radio frequency intereference, as well as possibly residual gain errors or residual instrumental coupling effects. Future work exploring better RFI identification, as well as more comprehensive systematic models, will help to better understand the origin of these systematics.
Note that no explicit foreground subtraction or filtering was performed in the anlaysis. Instead, our analysis emphasized strong control of spectral systematics in order to keep foreground structure largely contained within the foreground horizon. Ultimately, this enabled the Phase I analysis pipeline achieved a dynamic range between the thermal noise floor and the peak foreground power of 10 9 in power. Future work will focus both on a more comprehensive analysis of low-level systematics, and the modeling and subtraction of foreground emission. Hardware upgrades in the HERA Phase II system will potentially mitigate some of the observed Phase I systematics in the field, such as cross-coupling and reflection contamination.
The overall systematic uncertainties in this analysis come predominately from the uncertainty on the ab- . One-sample K-S test at different delays, comparing CDFs of the bandpowers from the data, and the mean bootstrapped CDF with that expected from Gaussian random noise. The data without systematic subtraction (top panel) shows two clear regions, foreground and systematic dominated, where K-S values are clearly higher than the threshold, marked with the black horizontal line. Systematic subtraction (middle panel) continues to have high K-S values for foreground dominated region, while all other delay ranges seem consistent with thermal noise. As a reference, bottom panel shows the K-S values from the simulated dataset that consists of Gaussian random noise with the same variance as that of the data. Naturally, in this case, the K-S values are consistent with the reference distribution across all delays. Solid and dashed lines represent the results from K-S test on original band powers and mean of bootstrapped band powers respectively.
solute flux density scale that is known to an accuracy of ∼10%, which is pinned to the GLEAM catalogue (Hurley-Walker et al. 2017). Furthermore, nightly drift in the gain amplitude is not corrected for in this work, which Kern et al. (2020a) estimate to be a roughly 3% effect in the gains. Signal loss arising from the analysis pipeline are explicitely derived and corrected for, but in total this amounts to less than a 15% correction of the power spectrum amplitude. Cosmic variance uncertainty on the 21 cm power spectrum limits is expected to be 5.5% (1σ) given the uv sampling of the instrument for a 2 hour integration across LST (Aguirre et al. in prep.). Going forward, future Phase I analyses may include a ∼100-night dataset that theoretically stands to improve the sensitivity of our limits here by over a factor of 5, assuming the low k systematics observed in this work can be further mitigated. Additionally, Phase II commissioning and observations have already commenced, with a new front-end receiver spanning an increased bandwidth from 50 -250 MHz. Overall, the analysis pipeline presented in this work has laid a framework for future analyses of HERA data as construction and commissioning is completed.