The Dark Side of Using Dark Sirens to Constrain the Hubble-Lema\^itre Constant

Dark sirens, i.e., gravitational-wave (GW) sources without electromagnetic counterparts, are new probes of the expansion of the universe. The efficacy of this method relies on correctly localizing the host galaxies. However, recent theoretical studies have shown that astrophysical environments could mislead the spatial localization by distorting the GW signals. It is unclear whether and to what degree the incorrect spatial localizations of dark sirens would impair the accuracy of the measurement of the cosmological parameters. To address this issue, we consider the future observations of dark sirens using the Cosmic Explorer and the Einstein Telescope, and we design a Bayesian framework to access the precision of measuring the Hubble-Lema\^itre constant $H_0$. Interestingly, we find that the precision is not compromised when the number of well-localized dark sirens is significantly below $300$, even in the extreme scenario that all the dark sirens are localized incorrectly. As the number exceeds $300$, the incorrect spatial localizations start to produce statistically noticeable effects, such as a slow convergence of the posterior distribution of $H_0$. We propose several tests that can be used in future observations to verify the spatial localizations of dark sirens. Simulations of these tests suggest that incorrect spatial localizations will dominate a systematic error of $H_0$ if as much as $10\%$ of a sample of $300$ well-localized dark sirens are affected by their environments. Our results have important implications for the long-term goal of measuring $H_0$ to a precision of $<1\%$ using dark sirens.


INTRODUCTION
The detection of gravitational waves (GWs) by the Laser Interferometer Gravitational-wave Observatory (LIGO) and the Virgo detectors (Abbott et al. 2019(Abbott et al. , 2021b opened the possibility of using GWs as a new messenger to measure the geometry of the Universe. The idea was proposed in Schutz (1986) where the author suggested combining the luminosity distance of a GW source and the redshift of its electromagnetic (EM) counterpart to measure the expansion of the Universe. Such an idea, more widely known as the "standard siren" (Holz & Hughes 2005), was put to the test immediately after the detection of a binary neutron star merger (Abbott et al. 2017a,b;Hjorth et al. 2017;Abbott et al. 2017c;Fishbach et al. 2019;Abbott et al. 2021a,d;Soares-Santos et al. 2019;Palmese et al. 2020; * Boya fellow Vasylyev & Filippenko 2020;Finke et al. 2021), which by now has achieved a precision of about 10% (Abbott et al. 2021d) for the measurement of the Hubble-Lemaître constant, or better if provided additional information such as observations of jets (Hotokezaka et al. 2019;Mukherjee et al. 2021). Given the current discrepancy at 4σ confidence level between the Hubble-Lemaître constants measured from the observations of type-Ia supernovae and the cosmic microwave background (CMB) (Freedman 2017;Riess 2019;Di Valentino et al. 2021;Perivolaropoulos & Skara 2022), it is expected that standard sirens, as an independent method, could help resolve the tension (Abbott et al. 2018(Abbott et al. , 2017dReitze et al. 2019a;Punturo et al. 2010;Chen et al. 2021).
To qualify as a standard siren, the source should provide observables from which both the luminosity distance and the redshift can be obtained. While the luminosity distance is a direct observable from GW signal, the redshift is not always acquirable. In particular, mergers of binary black holes (BBHs), although predominant among the LIGO/Virgo event, are generally not expected to emit detectable EM radiation (Schutz 1986) and hence are normally referred to as "dark sirens" (or "dark standard sirens", Soares-Santos et al. 2019).
The redshifts of dark sirens need to be obtained by alternative means. One standard method is marginalizing the redshifts of the galaxies in the error volume pinned down by GW detectors (Schutz 1986;MacLeod & Hogan 2008;Petiteau et al. 2011;Del Pozzo 2012;Fishbach et al. 2019;Soares-Santos et al. 2019;Palmese et al. 2020;Abbott et al. 2021a,d). Such a marginalization could introduce large statistical errors in the resulting cosmological parameters, because there are normally 10 2 to a few×10 5 galaxies in the error volume localized by each dark siren. However, the overwhelmingly large number of dark sirens (Maggiore et al. 2020;Evans et al. 2021) could compensate for the weakness of each one and thus altogether provide a precise measurement of the cosmological parameters. As the number of dark sirens increases, the precision could be comparable with, if not better than, the precision using a more limited sample of (bright) standard sirens, as recent studies suggest (Del Pozzo 2012;Chen et al. 2018;Fishbach et al. 2019;Gray et al. 2020;Borhanian et al. 2020;Yu et al. 2020;Muttoni et al. 2022;Zhu et al. 2022b).
To better estimate the redshifts of dark sirens, more sophisticated methods have been devised (Fishbach et al. 2019;Gray et al. 2020;Finke et al. 2021;Zhu et al. 2022b;Gupta 2022). These methods are based on a key assumption, which is that the error volume localized by GW detectors truly contains the host galaxy of a dark siren. This assumption may not always be valid in light of the recent theoretical discovery that GW signals could have been distorted by the astrophysical environments in which the waves are produced and propagating (see Chen 2021, for a summary). For example, strong gravitational lensing could make a GW source appear more distant (Smith et al. 2018;Broadhurst et al. 2018). Gravitational redshift Chen 2021) or acceleration of the source around a massive body (Robson et al. 2018;Tamanini et al. 2020;Chen 2021), as well as the gas surrounding the source (Chen & Shen 2019;Chen et al. 2020;Caputo et al. 2020), could bias the measurement of the mass of the source and, in turn, induce an error in the inference of the luminosity distance. In addition, the eccentricity of the orbit of a binary (Gayathri et al. 2021(Gayathri et al. , 2022 and the nonstationary noise of detectors (Edy et al. 2021;Kumar et al. 2022), if not appropriately account for, could also result in biased localization of the source. Therefore, it is possible that unmodeled astrophysical factors could lead us to an error volume that does not contain the true spatial position of a dark siren. In this case, a redshift can still be compiled using the galaxies in the incorrect volume. But the redshift is physically unrelated to the luminosity distance (incorrectly) inferred from the GW signal. It is unclear how common such a mismatch is and how seriously it could hamper the effort of accurately measuring the cosmological parameters using dark sirens.
Here, we try to address the above questions. The paper is organized as follows. In Section 2, we introduce the Bayesian framework that we use to infer cosmological parameters from the observations of dark sirens. In Section 3, we conduct mock observations assuming nextgeneration GW detectors. The results are shown in Section 4, where we also test the methods of verifying the reliability of the candidate host galaxies of dark sirens. In Section 5, we discuss the dependence on the fraction of false candidate host galaxies for unbiased cosmological parameter estimation. Finally, in Section 6, we summarize the main findings of this work and conclude with the implications of our results.

Cosmological model
We assume that the expansion of the late Universe can be described by a flat Λ cold dark matter (flat-ΛCDM) cosmological model. The evolution of the expansion rate with redshift can be expressed as where H 0 ≡ H(z = 0) is the Hubble-Lemaître constant describing the current expansion rate of the Universe, Ω M is the matter density normalized by the critical density, and Ω Λ ≡ 1 − Ω M is the fractional density of dark energy. Because we are interested in the relationship between the luminosity distance D L of a GW source and the redshift z of its host galaxy, we use to link the two quantities, where c is the speed of light. In this cosmological model, the set of independent parameters are H ≡ {H 0 , Ω M }.

Bayesian framework
To extract the cosmic expansion information from GW and EM data, we adopt the Bayesian framework presented in Chen et al. (2018) (also see Gray et al. 2020;Finke et al. 2021;Zhu et al. 2022b where I indicates all the relevant additional information. The term p(d GW i , d EM i | H, I) denotes the likelihood of the data for the ith event. Because the data contain noise, different sources with different luminosity distances D L , redshifts z, sky longitudes α and latitudes δ may result in the same data. Therefore, the likelihood is a marginalization of all possible (D L , z, α, δ), i.e., Here, β( H|I) is a normalization factor that accounts for the selection effects in GW (Fishbach & Holz 2017;Abbott et al. 2019;Mandel et al. 2019) and EM observations (Chen et al. 2018 i |D L , z, α, δ, H, I)p 0 (D L , z, α, δ| H, I) = p(d GW i |D L , α, δ, I)p(d EM i |z, α, δ, I) × p 0 (D L |z, H, I)p 0 (z, α, δ| H, I). (5) The term p(d GW i |D L , α, δ, I) is the marginalized likelihood of the GW data (Finn 1992), and p(d EM i |z, α, δ, I) is the likelihood of the observational data of the EM counterpart. For dark sirens, there is no information for their EM counterparts, so we can set d EM i to null, as well as set the likelihood p(d EM i |z, α, δ, I) to constant (Chen et al. 2018). The term p 0 (D L |z, H, I) is a Dirac delta function, i.e., p 0 (D L |z, H, I) = δ D D L −D L (z, H) , since D L is uniquely determined by z given a cosmological model.
In Equation (5), p 0 (z, α, δ| H, I) is a prior. To calculate it, we adopt the standard assumption that it is proportional to the number density of galaxies (Schutz 1986;MacLeod & Hogan 2008;Petiteau et al. 2011;Del Pozzo 2012;Chen et al. 2018;Fishbach et al. 2019;Gray et al. 2020;Finke et al. 2021). Taking into account the fact that many galaxies are too dim to be detected, we write the prior as p 0 (z, α, δ| H, I) = p obs (z, α, δ| H, I) where p obs (z, α, δ|I) is a distribution function compiled from galaxy surveys, p miss (z, α, δ| H, I) represents the distribution of the undetected galaxies, and f compl is a fraction denoting the completeness of the galaxy catalog (also see Chen et al. 2018;Fishbach et al. 2019). The value of p 0 (z, α, δ| H, I) is determined as follows. (i) For the observed galaxy distribution function, we calculate it with where N gal is the total number of the observed galaxies and N (z|z, σ z ) is a Gaussian distribution of z with an expectationz and a standard deviation σ z . Here, we introduced two delta functions because the sky-localization errors in galaxy surveys are much smaller than those in GW observations. (ii) For the undetected galaxies, we adopt the expression from Chen et al. (2018), assuming an isotropic and homogeneous universe, where p compl (z, α, δ) is the probability of a galaxy at (z, α, δ) being in the survey galaxy catalog, V c is the comoving volume, and dΩ ≡ cos δ dαdδ represents the differential solid angle. It should be noted here that δ is defined as the angle with respect to the celestial equatorial plane. (iii) The completeness fraction is calculated with where z min (z max ) is the minimum(maximum) redshift of all the candidates of the host galaxy of a GW source, and V tot c is the total comoving volume enclosed within [z min , z max ].
To calculate the denominator β( H|I) in Equation (4), we need to consider the selection effects in observations because we can only detect those GW or EM sources that exceed certain detection thresholds (Chen et al. 2018). Since the EM data d EM i can be set to null for dark sirens, the denominator is determined solely by the GW sources. Following Chen et al. (2018), we calculate it with where p c 0 (z, α, δ| H, I) is a cosmological prior distribution of all possible galaxies.
For an isotropic and homogeneous universe, we have p c 0 (z, α, δ| H, I) = p c 0 (z| H, I)p c 0 (α, δ|I). Moreover, we define and Equation (10) then becomes The term p GW det (D L |I) describes the probability that a GW event at D L will be detected. The other term p c 0 (z|I) describes the probability that a galaxy is at redshift z and can be calculated with (Zhu et al. 2022a,b), where ∆z is a smoothing interval that needs to be much larger than the typical redshift scale of a galaxy cluster, and the expression of the integrand p 0 (z , α, δ| H, I) is shown in Equation (6).

True and false candidate host galaxies
The effectiveness of using dark sirens to constrain cosmological parameters relies on deriving a statistically correct distribution function for the redshift of a GW source. Two factors could potentially impair the reliability of the derived redshift. (i) The galaxy catalog is likely incomplete, such that it may miss the true host galaxy of a GW source. (ii) If the models of the GW signal or the detector noise are inaccurate, the spatial localization of a GW source will be biased.
The first factor is not catastrophic. Given that most galaxies form in clusters, the brightest galaxies in a galaxy cluster still provide a statistically correct redshift even if the true host galaxy of the dark siren is too dim to be observed. Although the redshifts of the brightest cluster members and the redshift of the true host galaxy are not exactly the same, the difference has been accounted for in our Bayesian framework by the completeness fraction f compl in Equation (6). For simplicity, in the following, we refer to the galaxies that are physically correlated with the true host galaxy of a dark siren as the "true candidate host galaxies".
The second factor is more detrimental. In this case, the galaxies found at the luminosity distance and sky location inferred from the GW data may be physically uncorrelated with the GW source. Therefore, the redshift distribution provided by the galaxies in the incorrect error volume is inconsistent with the true redshift of the dark siren. This mismatch could result in a large bias in the inference of the cosmological parameters. These galaxies are referred to as the "false candidate host galaxies" in this work.

MOCK OBSERVATION
To understand the impact of the false candidate host galaxies on the estimation of the cosmological parameters, we will conduct mock observations of the host galaxies of the dark sirens, and simulate future observational constraint on the Hubble-Lemaître constant using such galaxies. Our simulation is divided into three steps. (i) Generate a mock catalog of merging stellar-mass binary black holes (SBBHs) and distribute them in the galaxies selected from a cosmological numerical simulation. (ii) Conduct mock observations of the candidate host galaxies of the SBBHs to derive the statistical probability distribution of the redshift for each binary. (iii) Calculate the total posterior probability distributions of the cosmological parameters using the luminosity distances of the dark sirens and the statistical redshifts provided by the candidate host galaxies. The details are as follows.
First, we generate a population of SBBHs using the "Power Law + Peak" model presented in Abbott et al. (2021e). The direction of the orbital angular momentum is assumed to be random. The spin parameter is uniformly distributed in the range of [−1, 1]. The merger time is chosen randomly between 0 and 5 years, and the merger phase is also random. Given these parameters, the GW signal is generated using the IMRPhenomPv2 waveform model (Hannam et al. 2014;Schmidt et al. 2015).
For each binary in our mock catalog, we assign a host galaxy that is randomly selected from the galaxy catalog of the MultiDark cosmological simulations (Klypin et al. 2016;Croton et al. 2016;Conroy et al. 2009). We note that the cosmological parameters used in the simulations are H 0 = 67.8 km/s/Mpc and Ω M = 0.307. Because we are mainly interested in studying the expansion of the late Universe, we restrict the selected galaxies to z < 1.
To evaluate the detectability of a SBBH GW event, we consider a network composed of the Einstein Telescope (ET) (Punturo et al. 2010) and the Cosmic Explorer (CE) (Abbott et al. 2017d). The models for the sensitivity curves are ET-D (Hild et al. 2011) and CE2 (Reitze et al. 2019a,b). The response of the network to a GW signal is computed using the low-frequency approximation (Thorne 1987), where we have taken the rotation of the Earth into account (Zhao & Wen 2018).
Second, for each detectable SBBH, we select the candidate host galaxies from the same MultiDark galaxy catalog according to the spatial localization inferred from the GW data. These galaxies are used to compile the probability distribution of the redshift of the dark siren. There are multiple candidates because there are large errors in the inferred luminosity distance and sky location. In our simulation, these errors are derived according to the Fisher information matrix (Cutler & Flanagan 1994;Poisson & Will 1995;Vallisneri 2008) with a 3σ confidence interval (CI). In particular, we have also included additional errors in luminosity distance caused by the peculiar velocities of the galaxies and the weak-lensing effect (Kocsis et al. 2006;Hirata et al. 2010). Figure 1 shows the spatial localization errors, including the error in luminosity distance, ∆D L /D L , and the error in sky localization ∆Ω. We can see that, on average, ∆D L /D L 4% and ∆Ω 1 deg 2 . In this study, we only use those SBBHs with relatively small localization errors, namely ∆D L /D L ≤ 0.6 (equivalent to ∆D L /D L 1 at the confidence level of 90%) and ∆Ω ≤ 1 deg 2 , because the sources with larger localization errors have less constraining power on the cosmological parameters.
It is important to mention that luminosity distances are not directly observable in galaxy surveys. The real observables are the redshifts of galaxies. Therefore, we must convert the luminosity distance inferred from GW data into the information on redshift before we can select the candidate host galaxies for a dark siren. The conversion is performed using the relationship where H − (z) and H + (z) are the minimum and maximum values of H(z) given by the prior of the cosmological parameters, [D L;min , D L;max ] is the range of luminosity distance allowed by the GW data, and z min and z max are the resulting minimum and maximum values We do not consider the observational errors in the spectroscopic redshifts of the host galaxies, because they are small in the redshift range of our interest (Aghamousa et al. 2016;Gardner et al. 2006;Gong et al. 2019). Given such redshift information, the candidate host galaxies of a dark siren are selected from an "error volume" of a size of 9∆Ω × [z min , z max ] in the MultiDark galaxy catalog. Moreover, we must include the selection effect that exists in real galaxy surveys. In our simulation, the probability that a galaxy with an apparent magnitude ofm * is observed is where σ m * is the uncertainty in measuring the magnitude and m * limit is the limiting magnitude of the galaxy survey. We assume a conservative value of σ m * = 0.1 mag and use m * limit = +26 mag to mimic future observations by the Chinese Space Station Telescope (CSST) (Gong et al. 2019).
By now, we have selected the true candidate host galaxies. However, we also need to simulate the observations of the false candidate host galaxies in order to understand their potential impact. Previous studies have shown that the dark sirens at different cosmological distances will lead to systematically different constraints on the cosmological parameters (Zhu et al. 2022a). To eliminate this statistical difference correlated with distance and single out the effects caused by the incorrect identification of the candidate host galaxies, we simplify our problem by keeping the luminosity distance and its error constant, while changing only the sky localization. Therefore, we introduce a displacement to the true sky localization. Using this displaced sky localization and the same luminosity distance information, we select the false candidate host galaxies from the incorrect error volume. In this way, the selected galaxies are physically uncorrelated with the true host galaxy of the dark sirens, but in the redshift space the true and false candidate host galaxies share the same distribution. In this work, the values of the displacement, (dα, dδ), are randomly generated within the boundary of the sky area of the galaxy survey. To ensure that the displaced sky position does not overlap with the true position of the GW source, we also impose a lower limit to the displacement, which is determined by (dα, dδ)Σ −1 αδ (dα, dδ) T /2 > 11.83 where Σ αδ is the covariance matrix of the sky localization of the dark siren. This way, the inconsistency between the displaced and the true sky localization is not less than 3σ.
Third, we use the statistical redshift distributions of true and false candidate host galaxies to compute the posterior probability distributions of the cosmological parameters. The calculation is based on Equations (3). In the calculation, we use the Markov Chain Monte Carlo (MCMC) method to accelerate the calculation. Moreover, we use the emcee library, which is a Python-based affine invariant MCMC ensemble

Constraints on the cosmological parameters
One merit of using dark sirens to constrain cosmological parameters is that the precision improves with the number of dark sirens. This may no longer be true when the dark sirens with false candidate host galaxies are present. Therefore, we first study the precisions of the cosmological parameters as a function of the number of dark sirens, N .
We consider a number of values for N , namely N = 10, 30, 100, 300, and 1000, which are almost equally spanned in the logarithmic space. Our maximum number N = 10 3 is reasonable because (i) the "Power Law+Peak" population model (Abbott et al. 2021e) for SBBHs predicts that about 16, 000 − 35, 000 SBBHs at z ≤ 1 would merge within a period of 5 years, and (ii) about one-third of them could be covered by the current and future galaxy surveys (Aghamousa et al. 2016;Gong et al. 2019;Scaramella et al. 2022). boxes, while the boxes show the 25% − 75% CI based on 48 trials. When N ≤ 100, the error corresponding to true candidate host galaxies is statistically comparable with the error derived from false candidate host galaxies. However, as soon as N reaches 300, the difference between the two errors becomes statistically significant: the latter becomes substantially larger than the former. In particular, when N = 10 3 the median precisions of H 0 and Ω M become better than 1% and 10%, respectively, if true candidate host galaxies are used in the simulations. On the other hand, the median precisions corresponding to false candidate host galaxies are worse by a factor of about five. Figure 2 also shows that, as soon as N reaches 300, the spread of the 25% − 75% CI (the length of a box) becomes statistically different in the two kinds of simulations. The spread is systematically wider in the simulations with false candidate host galaxies. The reason can be found in Figure 3, where we show the posterior distributions of H 0 from the first 24 trial simulations when N = 300. We find that, when we use true candidate host galaxies (blue), the posterior distribution in each trial simulation is relatively concentrated around the true value of H 0 . Moreover, only in three trials (1, 5, and 16) do we see significant secondary peaks offset from the true value. On the other hand, the simulations using false candidate host galaxies (red) more often produce multiple peaks, and the locations of the peaks are widespread. Such randomness is a direct consequence of the incorrect redshift distribution compiled from the incorrect candidate host galaxies.
We notice that, when N = 100, the CIs given by the simulations with true candidate host galaxies are much broader than those given by false candidate host galaxies. This is because, when the number of dark sirens is small, e.g. N = 100, the GW catalog may or may not contain very precisely localized dark sirens (e.g., ∆Ω ∼ 0.01deg 2 ). Without such well-localized dark sirens, the constraint on cosmological parameters would be relatively weak. The above results suggest that, in the future, with about 300 dark sirens, we can start to assess the reliability of our modeling of the GW signals. A small dispersion in the posterior distribution of the cosmological parameters may suggest that the selected galaxies are indeed physically correlated with the true host galaxies of the dark sirens. Therefore, the model which we use to infer the spatial localization of a GW source is accurate. If the posterior distribution has a large dispersion, the model is unlikely to be accurate, and it may miss the effects caused by the astrophysical environments, as we have discussed in Section 1.

Testing the reliability of the candidate host galaxies
Besides evaluating the posterior distribution functions of the cosmological parameters, we also design two other methods that can be used to infer the reliability of the selection of the candidate host galaxies for dark sirens.
The first method is a translation of the sky position. In this method, we search for the candidate host galaxies of a dark siren in two different error volumes: an original error volume derived directly from the GW data, and the other is obtained by artificially shifting the sky position of the original error volume by a large angle. We then compute the posterior distribution function of H 0 using the galaxies selected from the above two error volumes and derive the corresponding error ∆H 0 . The reliability of the host galaxies can be inferred as follows. (i) If the original error volume contains the true host galaxy, the error ∆H 0 should increase significantly after the translation of the sky position, because the shifted error volume contains only physically unrelated false candidate host galaxies. (ii) If, on the other hand, the original error volume contains only false candidate host galaxies, the ∆H 0 before and after the translation of sky position should be statistically comparable.  ) derived using the galaxies in the original error volume inferred from GW signal and the other (∆H case 1 0 ) using the galaxies in an error volume shifted from the original one. Each error is compiled from 300 dark sirens. The red solid histogram corresponds to the scenario in which the original error volume contains true candidate host galaxies, and the black dashed histogram corresponds to the scenario in which the original error volume only contains false candidate host galaxies. Figure 4 shows the probability distribution of the ratio between the two errors, one ∆H 0 derived before the sky translation and the other after the sky translation. If the original error volumes (before the sky translation) contain the true host galaxy (red solid histogram), we find that, in all our simulations, the ratio will be greater than unity, i.e., the error ∆H 0 worsens after the sky translation. In particular, the ratio may exceed 10 in about 43% of the simulations. If the original error volume contains only false candidate host galaxies (black dashed histogram), the probability that the ratio of ∆H 0 is greater than unity is comparable to the probability of being smaller than unity. Moreover, the ratio hardly exceeds 10 in this latter case. These results are generally consistent with our earlier prediction that a sky translation worsens the constraint on cosmological parameters if the original sky localization is accurate. On the contrary, the sky translation does not significantly affect the constraint if the original sky localization is inaccurate.
Based on the above results, we suggest a value of 10 as the threshold of verifying the reliability of the sky localization of dark sirens. If the error of H 0 increases by a factor of 10 after a sky translation, the original error volumes where we search for the host galaxies of dark sirens are likely to be correct. If the error of H 0 increases by a factor less than 10, the original error volumes should be taken with caution because they may miss the host galaxies in about 57% of the cases.
We note that our method is based on ideal assumptions about the completeness of the galaxy catalog. In practice, it is important to ensure that the survey depth and dust extinction are comparable in the error volumes before and after the sky translation. Otherwise, the completeness of the galaxy catalog will also affect the errors of the cosmological parameters.
The second method we use to test the reliability of the candidate host galaxies is the consistency check (Petiteau et al. 2011). It divides dark sirens into smaller groups and performs Bayesian inference independently on each group. If the spatial localization is correct and the candidate host galaxies are reliable, the posterior probabilities for different groups should be consistent. Alternatively, if the spatial localization is mostly incorrect and the selected host galaxies are false, the posteriors will be more random. Figure 5 shows the result of such a test using 1000 dark sirens. Without dividing into groups and analyzing as a whole population, the 1000 dark sirens will result in a similar posterior distribution (see the vertical lines in the upper panel) regardless of the kind of host galaxies we use (as see Appendix A). However, if we divide them into 10 groups and perform Bayesian inference separately, the results for true and false candidate host galaxies become different. We can see that the posteriors of the 10 groups are consistent when the true candidate host galaxies are used in the simulation (blue), but the posteriors largely differ when we use false candidate host galaxies (red).

DISCUSSION
So far, we have studied the cases in which the fraction of false candidate host galaxies for dark sirens is either f = 0 or f = 100%. In reality, both true and false candidate host galaxies may exist in the sample. A higher f normally results in a larger bias in the estimation of the cosmological parameters. Here, we use a percentilepercentile (P-P) plot (Cook et al. 2006) to demonstrate the dependence of the bias on the fraction of the false candidate host galaxies.
An example is shown in Figure 6. When all the spatial localizations are accurate, the estimation of H 0 is unbiased, the fraction of trial simulations whose true values of H 0 fall within specific CIs as a function of the corresponding CIs (this is referred to as a P-P plot), should be along the diagonal line, as shown by the black dot-dashed line. Statistical fluctuation will blur the distribution, as is illustrated by the dark and light shaded areas in the plot, but the result should be still consistent with the diagonal. Fraction of injected H 0 recovered in CI 0 percent 1 percent 10 percent 50 percent 100 percent 1 , 2 CI Figure 6. Fraction of trial simulations with injected H0 within a CI as a function of CIs. The histograms of different colors show the results for different fractions of false candidate host galaxies. Each histogram is compiled from 48 independent simulations, and each simulation contains 300 dark sirens. The dark and light shaded areas show, respectively, the 1σ and 2σ CIs allowed for the statistical errors. We perform the Kolmogorov-Smirnov (KS) test, with a null hypothesis that the histogram is consistent with the diagonal. The p-values are 0.61, 0.79, 0.28, 2.4 × 10 −4 and 1.8 × 10 −28 , respectively, when f = 0%, 1%, 10%, 50%, and 100%.
When the spatial localizations for a fraction of dark sirens are false, the P-P plot will deviate from the diagonal, as is shown by the colored histograms in Figure 6. Here, we have used 48 trial simulations to compile each histogram, and in each simulation we consider 300 dark sirens. We find that when the fraction of false spatial localization is no more than about 10%, the corresponding histograms are consistent with the diagonal line within the 2σ CI. In this case, the estimation of H 0 is more or less unbiased. However, when f ≥ 10%, the histograms start to exceed the 2σ CI, indicating a significant bias in the estimation of H 0 . In particular, when f = 50% or 100%, the histogram is mostly outside the 2σ CI. This result is not surprising, because now false candidate host galaxies predominate. Figure 7 shows more clearly the bias of the estimated H 0 when half of the spatial localizations are false (f = 50%). We find that the posterior distributions no longer center around the injected value of H 0 . In particular, 12 out of the 24 simulations show a > 1σ bias between the posterior and the injected H 0 . In seven cases (18)(19)(20)(21)(22)(23)(24), we find that the posterior distribution func- tions for H 0 are more consistent with the observations of type-Ia supernovae even though the value we injected is clearly from the measurement of CMB. This result highlights the risk of using dark sirens to constrain cosmological parameters when the environmental effects on GW signals are not well understood.

CONCLUSION
Recent studies have shown that the astrophysical environments of GW sources could significantly perturb the GW signals. Neglecting such an effect would lead to systematic biases in the estimation of the distance and sky position of a GW source. In this work, we have studied the consequences of such a bias on the measurement of the cosmological parameters such as the Hubble-Lemaître constant.
Our main findings are as follows. (i) The biased spatial localization will result in an incorrect statistical probability distribution for the redshift of a dark siren, which will in turn weaken and bias the constraint on the cosmological parameters. (ii) The reliability of the statistical redshifts of dark sirens can be tested by two methods, namely a sky translation and a consistency check. (iii) The constraints on cosmological parameters by dark sirens will be significantly biased if more than 10% of the dark sirens are localized incorrectly.
The above findings are based on the assumption that more than 300 dark sirens can be detected and localized well. It is worth noting that, in our analysis, we only use the GW sources with a localization error of ∆Ω < 1 deg 2 . The total number of detected dark sirens is approximately twice the number of the above well-localized ones. If the number of well-localized dark sirens is significantly below 300, we find that the presence of false candidate host galaxies does not significantly affect the distribution of H 0 error. The same is true even in the extreme scenario that all the candidate host galaxies are inferred incorrectly. This result indicates that, when the number of dark sirens is low (e.g. 100), the distribution of precisions of H 0 is limited mainly by the intrinsic uncertainty of the method, i.e., the statistical uncertainty in compiling the redshift distribution of the host galaxy of a dark siren.
Several recent studies have predicted that the third generation of ground-based GW detectors can constrain the Hubble-Lemaître constant to an precision as high as ∆H 0 /H 0 10 −4 using dark sirens (Yu et al. 2020;Song et al. 2022). However, the sky localizations of a few to a dozen percent of the dark sirens may be biased by nearby SMBHs (Peng & Chen 2021, e.g.), strong gravitational lensing (Broadhurst et al. 2018(Broadhurst et al. , 2022Yang et al. 2021;Smith et al. 2018), or non-stationary detec-tor noise (Edy et al. 2021;Kumar et al. 2022). Without properly modeling these effects, the constraint on the Hubble-Lemaître constant is likely to be much weaker.