Do coefficients of variation of response propensities approximate non‐response biases during survey data collection?

We evaluate the utility of coefficients of variation of response propensities (CVs) as measures of risks of survey variable non‐response biases when monitoring survey data collection. CVs quantify variation in sample response propensities estimated given a set of auxiliary attribute covariates observed for all subjects. If auxiliary covariates and survey variables are correlated, low levels of propensity variation imply low bias risk. CVs can also be decomposed to measure associations between auxiliary covariates and propensity variation, informing collection method modifications and post‐collection adjustments to improve dataset quality. Practitioners are interested in such approaches to managing bias risks, but risk indicator performance has received little attention. We describe relationships between CVs and expected biases and how they inform quality improvements during and post‐data collection, expanding on previous work. Next, given auxiliary information from the concurrent 2011 UK census and details of interview attempts, we use CVs to quantify the representativeness of the UK Labour Force Survey dataset during data collection. Following this, we use survey data to evaluate inference based on CVs concerning survey variables with analogues measuring the same quantities among the auxiliary covariate set. Given our findings, we then offer advice on using CVs to monitor survey data collection.


MOORE Et al.
where p i is the (estimated) response propensity of subject i, p is the average response propensity and the numerator its standard deviation (SD). The less propensities differ the smaller the CV, and the greater dataset representativeness. Moore et al. (2018a) advise using CVs to monitor data collection instead of R-indicators (R = (1 − 2SD)) because dividing SD by p means the resulting indictors are less likely to suggest high representativeness at early calls due to low propensity variation at low response rates (see also Schouten et al., 2009). Partial unconditional and conditional CVs (CV u s and CV c s) are derived from, respectively, the between and within variance decomposition components, and are bounded by the overall CV.
CV u s quantify univariate associations between auxiliary covariates and propensity variation. The CV u for covariate Z with K categories is: where n k is the number of observations in category k and p k the mean response propensity in category k. Large values suggest substantial between category variability and non-representativeness associated with Z. Category CVs decompose and are bounded by covariate CVs. The CV u for category k of Z is: Values can be positive or negative, implying, respectively, over-or under-representation.
CV c s quantify associations between auxiliary covariates and propensity variation conditional on the other auxiliary covariates. The CV c for covariate Z is: where p l is the mean propensity of the lth of L cells in a cross-classification of x excluding Z and x is the covariate subset for the propensity modelling. The CV c for category k of Z is: where h i indicates whether subject i is in category k. Large CV c s imply substantial solely attributable non-representativeness. In addition, adjustments to correct for biases caused by estimating propensities exist, as do approximate standard errors which when converted into 95% Confidence intervals (CV ± 1.96 × standard error) enable inference regarding (comparative) representativeness (de Heij et al., 2015). Population-level analysis is also possible by applying weights.

| CV inferences about survey variable non-response biases
Overall CVs predict the maximum absolute standardised bias of survey variable means when nonresponse correlates maximally with the auxiliary covariates. Given an unknown auxiliary covariate set explaining response behaviour (ℵ), the Horvitz-Thompson estimate of the bias of a survey variable is approximated by the covariance between sample response propensities and the survey variable divided by mean response propensity (Bethlehem, 1988). This value is standardised by dividing by the survey variable sample standard deviation (S(y), for variable y with response mean ŷ r ). By replacing the numerator covariance with its absolute maxima, which by the Cauchy Schwartz inequality is the product of the two variables' standard deviations, the maximum absolute standardised bias is estimated. The overall CV approximates this if the auxiliary covariates ℵ can be replaced by the utilised set x (de Heij et al., 2015), for example, Whether auxiliary covariate set ℵ can be replaced by set x is untestable. In practice, including correlates of both response propensities and survey variables is essential, or biases may be under-estimated (see 'Section 1'). We note that another indicator studied by Nishimura et al. (2016), the survey variable absolute maximum bias (=SDS (y) ∕p), is derived similarly given S (y) (Schouten et al., 2009).
In contrast, partial CV predictions about auxiliary covariate (analogue) 'non-response biases' are not described in the literature. In terms of Equation (6), response propensity-auxiliary covariate covariance is maximal. Hence, as they are derived from the between component of the variance decomposition (see Section 2.1.1), CV u s provide inferences about covariate category (focal vs. others combined) standardised mean biases.
For two-category covariates, the covariate CV u (Equation (2)) should approximate the absolute value of this bias (which in this case is independent of the focal category). For multi-category covariates, K combinations of the focal category vs. the others exist. For these, the covariate CV u should approximate the maximum absolute value of the different biases that can be computed, a value that would be obtained if the observed degree of propensity variation were due to all category deviations from expected being identical except for the focal category.
Category CV u s ( Equation 4) concern only the deviation of the focal category from expected. Hence, they should approximate the minimum category bias, with under-estimation less when, due to category size or the deviation, its contribution to the covariate inequality is large.
To summarise, where S Z k is the standard deviation of the dummy variable indicating membership of category k, and the upper bound is attained if K = 2. CV c s are derived from the within component of the variance decomposition (see Section 2.1.1). Hence, they make similar predictions about auxiliary covariate category (focal vs. others) absolute standardised conditional mean biases (i.e. those remaining after conditioning given the other covariates). Therefore, where Bias Ẑ kr c is the focal category conditional bias, and again the upper bound is attained if K = 2. Inequalities (7) and (8) represent a further functionality of CVs of use when assessing dataset quality: if auxiliary covariate analogues measure the same quantities, partial CVs provide inferences about survey variable non-response biases (see also Sections 2.2.3 and 2.2.4).

| Using CVs to inform dataset quality improvements
Regarding modifications of collection method, CVs can be couched in terms of missing data mechanisms (Schouten et al., 2012). Overall CVs quantify deviations in response from missing completely at random (MCAR) given the auxiliary covariate set. CV u s quantify deviations from MCAR with respect to a given auxiliary covariate (category), and CV c s similar deviations from missing at random (MAR), that is, the extent to which response is not missing at random (NMAR) given the other auxiliary covariates. Hence, CV u s identify under-represented groups to target. CV c s ensure efficient targeting: non-significance implies an impact also associated with other auxiliary covariates. The suggested strategy is to target categories with significant CV c s and some with significant CV u s only if (non-significant) CV c s indicate correlations exist with other categories (Schouten & Shlomo, 2017).
Similar arguments underlie why CVs can also inform post-collection non-response bias adjustments. Non-survey variable-specific methods, including inverse response propensity weighting (e.g. Roberts et al., 1987), often assume responses are MAR given an auxiliary covariate set explaining response behaviour. To identify such sets, Särndal and Lundström (2010;see also Särndal, 2011) use the coefficient of variation of the weights as a quality measure (Lundquist & Särndal, 2013;Särndal & Lundquist, 2014 also similarly derive 'balance' indicators for assessing dataset quality). If the weights or propensities are similarly estimated (weighting often uses an identity link, in contrast to the logistic link generally used for propensities), or the sample size is large, this measure is equivalent to the overall CV: dividing the standard deviation of a set of inverse values by their mean is equal to the same calculation using the raw values (see also Schouten et al., 2016). Given this, partial CVs can be used to identify auxiliary covariates to include in the covariate sets used in weighting adjustments. CV c s quantify inequalities after adjustment assuming MAR given the other auxiliary covariates, so if the covariates with large values are excluded from such sets their impacts will not be addressed (we note here that the included covariates should also be correlated with the survey variables, or adjusted variable variances will be inflated; Little & Vartivarian, 2005). In fact, CV c s can be used when identifying modification targets to statistically select covariate set members. In contrast, Särndal and Lundström's methods, comparing all possible sets or covariate selection, use arbitrary thresholds for accepting more complex sets. This is the first time this functionality of CVs has been described.

| Using CVs to identify phase capacity (PC) points
Design phase capacity (PC) points are points in the data collection process after which further quality increases are limited and methods should be modified or collection ended (Groves & Heeringa, 2006). Moore et al. (2018a) use CVs to identify PC points in household call records in three UK social surveys (including the LFS, whose individual level dataset is studied in this paper). They identify overall CV points and CV u points for auxiliary covariates and under-represented categories: the former can be used to identify when to end collection, while impacts measured by CV u s are modification targets, potentially separately. They use numeric methods, specifically two rules that reflect whether identification is during collection (informing current efforts) or after (informing future sampling): (a) if the CVs imply quality decreases or are within threshold a of the previous call CV ('during') and (b) if CVs imply quality decreases or are within a of the best call record CV ('after').
No information existed on call costs or other methods, precluding optimising data collection given such alternatives using, for instance, the methods of Schouten et al. (2013); this is also the case for the dataset in this paper. Different thresholds a give comparable results. It should be noted that category (covariate) CV u s are decompositions, so PC points should be earlier than or similar to those given covariate (overall) CVs, although this may not always hold as the latter combine (different) multiple inequalities. An alternative to using numeric methods to identify PC points are inferential methods. Most research seeks to identify points given changes in non-response adjusted survey variables over calls (Lewis, 2017;Rao et al., 2008;Wagner & Raghunathan, 2010). Tests assess whether variable differences differ from zero, accounting in adjustment method specific ways for dataset dependencies caused by early call responses also being in later call datasets, so are not usable with CVs. CVs are the focus of Schouten et al. (2016), who to assess the representativeness-survey variable bias relationships develop a rank test that uses partial CVs given different auxiliary covariate sets and auxiliary covariate biases. This can be used to identify PC points, but only from multiple covariate CVs. Concerning identifying single CV points, ignoring dataset dependencies we suggest that one approach is to use CV 95% CIs. As with numeric methods, different rules can be constructed to reflect whether points are identified during or after data collection. A PC point is identified during collection if the CVs are non-significant (i.e. the 95% CIs include zero), imply quality decreases or the 95% CIs overlap the previous call CIs. A PC point is identified after collection if the CVs are non-significant, imply quality decreases or the 95% CIs overlap that for the call with the best CV. We use these rules for the first time below.
We note that when using inferential methods in empirical scenarios, significance levels need consideration. The CV CI widths decline as response rates increase (Moore et al., 2018a), so unless such levels are adjusted, the statistical power to identify CV differences will vary over calls (see Lewis, 2017 for discussion of similar with non-CV-based tests). This is perhaps a reason to use numeric methods: another is when decisions are optimal before CV parity, due to, for example, the costs of the alternative data collection methods. In the work in this paper though, a single significance level is not an issue: we evaluate CV performance by comparing CV PC points with those based on estimated bias (whose CIs similarly change: see Section 3).

| The Labour Force Survey (LFS) dataset
The Office for National Statistics 2011 CNRLS links January to July 2011 UK social survey samples and their survey responses to their 27th March 2011 census records, providing attribute information whether they are interviewed or not (Parry-Langdon, 2011). Linkage is via subject address and personal detail (name, gender and date of birth) matching. Survey call records are also appended. Our focus here, the LFS, samples English and Welsh individuals aged over 15 on labour market topics (see . Simple random sampling of households (HHs) is used. ONS operatives seek to interview all HH occupants. Most interviews are face to face, but a telephone interview can be chosen (see also below). The LFS is longitudinal, but we consider wave one subjects only to avoid sample attrition effects. For this wave, 96.9% of HHs and 93.3% of subjects are linked to census records (Table 1). Hence, we can study the majority of the sample using (self-reported) census responses (see ONS, 2014) which reflect their attributes at the time (though we cannot rule out biases without non-linked subject data: Moore et al., 2018a). The call record data detail outcomes of calls to HHs (up to 20), and do not exist for telephone contacted HHs and some others (29.8% of the sample; see also below). Most HH members are interviewed at the same call. However, in around 1% of HHs, two members are interviewed at different calls. For these, we use the interview order to assign members to calls.
In our analyses, we consider eight survey variable-census auxiliary covariate analogue pairs ( Table 2). All impact on LFS response propensities (Durrant & Steele, 2009;Durrant et al., 2010Durrant et al., , 2011Durrant et al., , 2013 T A B L E 1 Dataset construction and content. 'Linked to census', 'Face to face interview' and 'With call records', 'Under 65' and 'Without item NRs' are the number of (remaining) individuals and HHs with such characteristics, the last being the size of analytical dataset. 'Interviewed', 'Refusal' and 'Non-contact' are numbers of outcomes in the call 20 dataset Steele  and are likely to be associated with other survey variables. 'Tenure' is a HH response, 'HH structure' a derived response, 'Located in London/SE' a geographical identifier and the others individual responses. For a number of subjects, some responses are missing. This can reflect item non-response, but often is due to statistical disclosure control or, as with LFS subjects aged over 64, not being asked some items. Often, two or more responses are missing, so to minimise correlations we exclude these subjects from the analysed dataset. However, 'Age', 'Gender', 'Tenure' and 'HH structure' remaining item non-responders are so few that disclosure issues arise. Hence, we also exclude them so that variables/covariates lack No response (NR) categories. We exclude 24% of the sample due to missing responses. Use of the methods below shows that excluding these subjects and those without call records (see previously) from the dataset causes under-representation of those in owned HHs or Aged '16 to 27' compared to the sample (results not shown): we outline likely impacts on findings in Section 3.

| Quantifying LFS dataset representativeness and identifying PC points
We quantify LFS dataset representativeness by computing CVs from response propensities estimated using a logistic regression model with as main effects the census auxiliary covariates listed in Table 2.
At each call in the record, we compute bias adjusted overall CVs, auxiliary covariate partial CVs and CV 95% CIs. We do not conduct population-level inference as some survey subjects are not studied, so the supplied weights are not useful. We compute the CVs using the R code of de Heij et al. (2015; see www.risq-proje ct.eu). We then identify CV PC points. We identify overall CV points, and CV u points for auxiliary covariates and selected under-represented categories, using the numeric and inferential 'during' and 'after' collection identification rules described in Section 2.1.4. For the numeric method points, we use a threshold a of ±0.02: others give comparable results (not shown).

| Comparisons with census auxiliary covariate category 'non-response biases'
We evaluate CV-based inference about survey variables with auxiliary covariate analogues by first computing logistic-regression-based estimates of census auxiliary covariate standardised 'non-response biases' for comparison. We describe CV predictions about category 'non-response biases' in Section 2.1.2. To evaluate them, for the three two-category covariates and the selected under-represented multi-category covariate categories, we code new binary covariates such that where i = 1, …, n. We let r i be the response indicator for subject i at a given call, with r i = 0 indicating that they have not responded to the survey and r i = 1 that they have. Next, at each call in the record, we estimate non-respondent-respondent differences in the log-odds of category membership. We fit two statistical models. Model A estimates overall differences: where π i = Pr(y i = 1| r i ) is the category membership probability, β 0 is the non-respondent log-odds of membership and β 1 is the β 0 -respondent log-odds difference. Model B estimates differences conditional on the auxiliary covariate set d i (set x minus the covariate underlying y i ): where β is a vector of coefficients. In model B, as y i and r i are binary, a β 1 of zero implies response with regard to a category is MAR given the auxiliary covariates. Non-zero values quantify the extent to which it is NMAR (Barbosa, 2014). This provides similar information to a CV c (see Section 2.1.3). In model A, β 1 quantifies the deviation from MCAR, providing similar information to a CV u . Then, from parameter estimates, we compute standardised 'non-response biases' for the categories of y i as: where m is the number of non-respondents, r is the respondent category membership probability, nr is the non-respondent probability and S s is the sample probability standard deviation (see Groves & Couper, 1998). With model A, we back-transform parameter estimates to obtain r and nr . Model B estimates are conditional, so we compute marginal category membership probabilities to obtain r and nr , using Hastie's (1992) 'safe prediction' method in the R package 'effects 3.1.2' (Fox, 2003;Fox & Hong, 2009). To obtain S s we fit a null model and use the delta method (Oehlert, 1992) in the R package 'msm 1.6.4' (Jackson, 2011).
We also identify overall (model A) bias PC points to compare to CV points, using the same methods (see Section 2.1.4). We again utilise the delta method to estimate standard errors and 95% Cis for the bias. Concerning predictions, covariate level CV u points for two-category CVs and bias points should correspond (see Section 2.1.2). Given contributions to covariate inequalities, similarities between multi-category covariate category CV u and bias points should also exist.

| Survey variable-census auxiliary covariate analogue similarity
To assess survey variable-census auxiliary covariate analogue similarity, for studied categories we compute survey respondent proportions given each data source at each call. We compare values graphically and using Z tests for independent sample proportions.

| Response rate development and CVs
LFS responses accumulate at a decreasing rate over the call record, with minimal increases after call 9 and none after call 17 (Figure 1). The overall CVs decrease, suggesting increased representativeness, at a declining rate. The corresponding 95% CIs (see Table 1 in the online Appendix), which decrease in width over the call record (as other CV intervals also tend to do), all exclude zero, implying respondents are always significantly non-representative of the sample. We report the partial CVs in Figures 2 and 3 and the corresponding 95% CIs in Tables 1-4 in the online Appendix. The 95% CIs of most auxiliary covariate unconditional CV (CV u ) exclude zero, implying significant associated inequalities. The 'Located in London/SE' CV u s begin as the largest, decrease at a declining rate to call six, then increase slightly. The 'HH structure' CV u s increase to call four, then decrease slightly, and are largest in the final dataset. Five covariates exhibit smaller, similar final inequalities. The 'Age' and 'Activity last week' CV u s decrease at declining rates. The 'Ethnicity' CV u s decrease slightly. The 'Qualifications' CV u s decrease to call two, then increase slightly. The 'Tenure' CV u s begin non-significant, increase to call two, then decrease slightly. The 'Gender' CV u s are smallest of all, increase slightly and are non-significant to call five.
The category CV u s suggest 'Located in London/SE' and 'Age' inequalities are due to under-representation of London/SE and subjects aged under 40, although many are eventually interviewed. The 'Activity last week' inequality reflects similar Employed under-representation and increasing Student under-representation. The 'HH structure' inequality is due to initial under-representation of Single adult and Single adult with children HHs, but the latter impact declines and Other HH becomes under-represented. The 'Ethnicity' inequality reflects under-representation of Asian, Other and NR, and increasing under-representation of Mixed and Chinese. The 'Qualifications' inequality is due to initial under-representation of NVQ4+, NVQ3 and NR, but the first two impacts decline and None becomes under-represented. The 'Tenure' and 'Gender' inequalities reflect under-representation of Not owned HHs and Males.
The conditional CVs (CV c s) suggest some of these impacts are independent. Only the 'Gender' and 'Tenure' CV c s are non-significant. Some 'Qualifications' and 'HH structure' CV c s are larger than the CV u s, implying greater inequalities. The 'Located in London/SE', 'Age' and 'Activity last week' CV c s are smaller, suggesting inequalities partly correlated with the other auxiliary covariates. Of the under-represented categories, Student, NVQ3, Not owned HH, Male and most 'Ethnicity' impacts disappear: the category CV c s are non-significant. The London/SE, Employed, Mixed ethnicity, 'Qualifications' None and NR, 'Age' and 'HH structure' impacts do not. Many such impacts exist in the HH dataset, putatively due to groups being less contactable (Moore et al., 2018a). This likely also holds for (some of) those newly identified here. The Employed and not owned HH impacts are, F I G U R E 1 LFS dataset cumulative response rate (RR) over the call record and similar dataset overall CVs. See Table 1 in the online Appendix for the CV 95% CIs respectively, increased and reduced by including excluded subjects (those missing multiple responses, etc.) in the dataset (see also Section 2.2.1). Regarding improving datasets, categories with significant CV c s are method modification targets (see also Section 2.1.3). Some with significant CV u s only should also be included if their impacts may be correlated with those of other categories: for instance, Students and Not owned HHs. Similarly, all covariates except 'Gender' and 'Tenure' should be included in auxiliary covariate sets used in post-collection bias adjustments.

| CV PC points
The numeric method overall CV PC points using the 'during' and 'after' rules are at calls four and five, respectively ( Table 3). As expected, since CV u s are decompositions, most auxiliary covariate CV u points are at the same calls or earlier. The 'Gender', 'Tenure' and 'HH structure' 'during' and 'after' points are at calls two and one, respectively, and similar the 'Qualifications' points at calls three and two, because, although the CV u minima are at the earlier calls, the 'during' rule only detects later increases. The 'Ethnicity' points are at call two, the 'Located in London/SE' and 'Age' points at call four, and the 'Activity last week' points at call five. We identify (multi-category) auxiliary covariate category CV u points for the under-represented categories 'Age' 28 to 39, 'Activity last week' Employed, 'Ethnicity' Asian, and 'HH structure' Single adult and No qualifications. These points are earlier than the CV u and overall CV points, again as expected. The 28 to 39 and Employed 'during' and 'after' points are at calls three and four, respectively, due to later CV decreases detected by the 'after' rule. For the others, the 'during' points are one call later than the 'after' points (calls one and two), again due to the former rule not detecting CV minima. The inferential method PC points follow similar patterns. The overall CV points are at call five. The 'Gender' and 'Tenure' auxiliary covariate CV u 'during' points are at call one, due to CV u non-significance (the 'after' points are again at the same call). The 'Ethnicity', 'HH structure' and 'Qualifications' points are mostly earlier than the 'Located in London/SE', 'Age' and 'Activity last week' points. Some are earlier than the numeric points, due to the 95% CI overlapping at CV u differences larger than 0.02. The under-represented category Employed and 28 to 39 category CV u points are later than the No qualifications, Asian and Single adult points, with differences from the numeric points due to the non-significance of the CV u or the overlapping of CI at CV u differences less than 0.02.

| Census auxiliary covariate analogue category 'non-response biases'
We report estimated biases in Figure 4, and their 95% CIs in Tables 5 and 6 in the online Appendix. They are mostly consistent with the CVs. As expected, since the CVs predict (conditional) biases T A B L E 3 Numeric threshold and statistical inference identified 'during' and 'after' rule design phase capacity (PC) points for selected census auxiliary covariate categories based on partial CV u s and where comparable transformed bias model A estimates of the category means, overall (model A) and conditional (model B) (absolute) biases for the twocategory auxiliary covariates are quantitatively similar to the covariate CV u s and CV c s, respectively. Correspondence is close for the 'Tenure', 'Located in London/SE' and 'Gender' overall biases: conditional biases can be slightly larger than the CV c s. The 95% CI widths for bias tend to decline over calls, as with the CVs. Some significance differences exist: the 95% CIs for the 'Gender' call one overall bias and the later call 'Gender' and 'Tenure' conditional bias exclude zero. Moreover, for the studied multi-category auxiliary covariate categories qualitative similarities at least exist between the (absolute) bias estimates and the category CVs (the CVs predict bias minima, with under-estimation less when contributions to the covariate inequalities are large). The Asian and No qualifications overall biases correspond with the CV u s, with conditional biases slightly larger and smaller than the CV c s, respectively. The Single adult HH, Employed and 28 to 39 biases are larger than the CVs: the last two differences decline over the calls because contributions to the covariate inequalities increase. The widths of the 95% CIs for bias also tend to decline, with similar significance for the CVs. In addition, biases are smaller than the relevant covariate CVs, which in these cases predict category bias maxima, and all biases are smaller than the overall CVs, which predict (survey wide) category bias maxima.

PC points
The estimated overall bias and CV u PC points also mostly correspond (Table 3). With the numeric identification methods, the two-category auxiliary covariate bias and covariate CV u points are at the same calls. The same occurs for multi-category auxiliary covariate categories, except for with Employed, for which the bias points are two calls later. With the inferential methods, the two-category auxiliary covariate bias and the covariate CV u points are at the same calls except for the 'Tenure' 'during' point, which is at call two due to the significance of the call one estimate. For the multi-category auxiliary covariate categories, the points are at the same call, or the bias points are one to two calls earlier. Concerning the inferential points, though the statistical power issues associated with their identification are less problematic in our analyses (see Section 2.1.4), we note that while some of the bias points are earlier (the CIs are wider), correspondence between the CV u and bias points is similar to that reported here when subsets of the dataset with 5000 and 10,000 subjects are analysed (results not shown).

| Survey variable-census auxiliary covariate analogue similarity
We report the category proportions for the survey respondents in the two data sources in Figure 5. The values are as expected regarding the implied biases (the census sample values are mostly higher) and F I G U R E 4 Partial CVs and model (A = overall, B = conditional) estimated standardised 'non-response biases' for the auxiliary covariate categories: (a) 'Located in London/SE' Yes; (b) 'Tenure' Not owned; (c) 'Gender' Male; (d) 'Age' 28 to 39; (e) 'Activity Last Week' Employed; (f) 'Ethnicity' Asian; (g) 'Qualifications' None; (h) 'HH structure' Single adult. The first three covariates have two categories, so the covariate CVs are comparators, with (as CVs are constrained to be positive) the model bias estimate absolute values reported. The other covariates are multicategory, so category CVs are comparators. With these, the CV c is constrained to be positive, so model B based bias estimate absolute values are reported. See Tables 5 and 6 in the online Appendix for the bias estimate 95% CIs the changes over calls: they increase for categories becoming less under-represented and decrease for those becoming more so. They are also consistent with the survey variable-census auxiliary covariate analogue similarity. The Male, London/SE, Not owned HH and 28 to 39 proportions in the two sources are indistinguishable in the plots. Minor differences exist (mainly at early calls) for Single adult HH, Employed, No qualifications and Asian. For the first five categories, the Z tests for differences are all non-significant at the 0.05 level (see Table 7 in the online Appendix). For the rest, the differences are significant after calls three to four: given the point estimates, as mentioned when identifying the PC points (see Section 2.1.4), this is due to the increasing size of the respondent dataset.

| DISCUSSION
We evaluate the performance of the Coefficients of Variation of the response propensities (CVs) when monitoring the risks of survey variable non-response biases during survey data collection. CVs quantify dataset representativeness in terms of variation in sample response propensities estimated given a fully observed auxiliary attribute covariate set correlated with the survey variables: high representativeness implies low bias risk. Practitioners are interested in using CVs to monitor survey data collection, but little research exists on how well they predict observed biases. We extend work on CV predictions concerning biases and how they inform dataset improvements. Next, we use CVs to quantify (changes in) UK Labour Force Survey (LFS) dataset representativeness over data collection, utilising linked survey sample census responses as auxiliary covariates. Then, we evaluate CV inferences about survey variables with analogues estimating the same quantities among the auxiliary covariates.
Regarding bias prediction, overall CVs approximate the maximal absolute standardised bias of survey variable means when non-response correlates maximally with the auxiliary covariates (de Heij et al., 2015). We show that partial unconditional and conditional covariate CVs (CV u s and CV c s, respectively), which decompose overall CVs to measure (conditional) deviations in response with respect to auxiliary covariates, also predict similar absolute standardised 'non-response biases' of category means for two-category auxiliary covariates. For similar multi-category covariates, category (focal vs. others) bias maxima are predicted. Category CVs predict category bias minima, with less under-estimation when contributions to covariate inequalities are large. These predictions have not previously been reported, and potentially increase the utility of CVs when assessing survey datasets (and others with missing data, for example linked datasets; e.g. Moore et al., 2018b). If the survey variables and auxiliary covariate analogues measure the same quantities, partial CVs can be used to make inferences about survey variable biases. Our empirical work, which we discuss below, tests this contention in the LFS.
Concerning informing dataset improvements, CV u s and CV c s also measure deviations with regard to covariates from, respectively, MCAR and MAR given the other auxiliary covariates (Schouten et al., 2012). With statistical inference possible, they hence identify targets for collection method modifications: under-represented categories with significant CV u s and CV c s (i.e. independent impacts), although categories with impacts also correlated with other covariates, as indicated by significant CV u s but non-significant CV c s, should be considered as well. We show that for similar reasons CVs can help select auxiliary covariates to use in post-collection bias adjustments. Such adjustments generally assume response is MAR given a set of auxiliary covariates. To select the auxiliary covariate sets, Särndal and Lundström (2010) use the coefficient of variation of the weights (larger is better). This is equivalent to the overall CV when the weights and response propensities are similarly estimated (e.g. by logistic regression) or the sample size is large, so significant CV c identify covariates with independent impacts. Recognising this functionality also increases the utility of CVs when assessing dataset quality.
Our empirical work demonstrates the accuracy of CV-based inference during data collection. We quantify LFS dataset representativeness by computing the overall CVs and auxiliary covariate CV u s and CV c s after each attempt to interview non-respondents (the call record). We also identify phase capacity (PC) points after which further quality increases are limited and methods should be modified or data collection ended (e.g. Groves & Heeringa, 2006). We consider stability of the CVs compared to previous call values (of use during collection to inform current sampling), and best values over the call record (of use after collection to inform future efforts). We use both numeric methods (do the CVs fall within a threshold of relevant values), and novel inferential methods (are the CVs non-significant or do the 95% CIs overlap) that we describe in Section 2.1.4. Then, we evaluate CV-based inference about the survey variables with auxiliary covariate analogues measuring the same quantities. First, we compare auxiliary covariate partial CVs to logistic-regression-based estimates of covariate category standardised 'non-response biases'. Second, we assess the survey variable-auxiliary covariate similarity by comparing the survey respondent category proportions given each data source. Pertinent to the performance of the CVs (we discuss other findings below), inference matches that from estimates of bias. The two-category auxiliary covariate CVs and estimated biases (and the PC points) correspond. The multi-category auxiliary covariate category CVs are smaller than the estimated biases (the PC points are mostly similar), and the covariate and overall CVs are larger. Moreover, the differences in the category proportions for the survey respondents between the data sources are slight, implying generalisability of inferences.
These findings indicate CVs are of utility as tools for monitoring survey data collection. Valid inference about the non-response biases of survey variables enables informed decisions about the methods to use to maximise final dataset quality. We hence recommend them to practitioners, and Table 4 provides guidance on using them to monitor data collection in empirical scenarios in the form of a set of steps that should be included in the analyses (see Schouten et al., 2012 for similar advice on assessing final dataset quality). Depending on the aims of monitoring, not all steps will be relevant. We note though that such aims are likely to depend on analysis findings: for example, without a PC point existing, practitioners may not have the resources to modify collection methods. We also note that if the aim is to modify methods to improve the dataset, after implementing modifications the CVs can be computed again to quantify their impact.
We do though make several comments about our evaluations and their implications. First, one limitation is that we do not evaluate CV-based inferences about survey variables without auxiliary covariate analogues. Often, but not always (e.g. the LFS is used to estimate UK employment rates; see ONS, 2014), these are the main focus of a survey. We will be undertaking these evaluations in the following research. Second, it should be noted that auxiliary covariate analogue partial CVs will perform best in predicting biases in survey variables when data sources do measure the same quantities. Dissimilarities may occur due to non-contemporary sources, or if the information requested or reported differs: the latter, caused by the LFS interviewers eliciting more accurate responses than the self-reported census, explains the slight differences found in our work between survey and census 'Ethnicity' Asian and No 'Qualifications' survey respondent category proportions (see Moore et al., 2018c). Hence, if possible survey variable-auxiliary covariate analogue similarity should be assessed before using CVs for this purpose.
Regarding our other findings, we study the LFS individual dataset, extending work on the household dataset (Moore et al., 2018a) to the sample unit. The overall CVs imply dataset non-representativeness decreases at a declining rate over calls, and is substantial when collection ends. The partial CVs suggest inequalities (biases) associated with six of the eight auxiliary covariates, with a range of under-represented categories (see Section 3.1 for details and causes). Some impacts decline, others do not, and they are often independent. Regarding improving datasets, such categories are targets for data collection method modification (similarly, covariates should be used in post-collection adjustments). The identified PC points inform on when modifications should take place. The overall CV points are at calls four to five. The partial CV points, of use if separate targeting is possible, vary depending on category from calls one to six: similar variation in when estimate stabilise is found in other studies monitoring survey data collection (e.g. Petychev et al., 2009). The 'during' and 'after' rule points exhibit some differences, as do the points identified by the numeric and inferential methods. The latter have received little attention in the context of CVs: as found in research using other estimators (Lewis, 2017), our work, utilising simple 95% CI based tests, suggests that selecting significance levels suitable over all respondent dataset sizes is an issue with their use in empirical scenarios (see also Section 2.1.4).
We lack information on alternative collection methods, so cannot advise further on improvements to the LFS dataset. What is useful though is to utilise overall CV points to identify when to end current data collection, so resources can be otherwise invested to improve quality or make cost savings. The identified points are slightly earlier than the LFS household dataset points (see Moore, et al., 2018a), in part due to us excluding subjects aged over 64 (who do not answer some survey items) from analyses, and represent reductions in calls made of 12%-19%. Substantial savings are likely from such reductions. Similar CV-based results are also found for other European and UK surveys (Correa et al., 2016;Lundquist & Särndal, 2013;Moore et al., 2018a). Hence, we end by recommending that more attention is paid to whether the number of calls currently made to social survey non-respondents are needed to maintain dataset quality.