Understanding women's wage growth using indirect inference with importance sampling

Summary The goal of this work is to investigate the effects of time out of the labor market for childcare on women's lifecycle wage growth. We develop a dynamic lifecycle model of human capital, fertility, and labor supply for women. We estimate by indirect inference using importance sampling and formalize the use of this procedure. The results indicate a modest effect of fertility-induced non-employment spells on human capital accumulation. The difference in human capital among prime-age women would be approximately 2.4% higher at its peak if the relationship between fertility and working were eliminated, and 4.7% higher if the relationship between marriage and fertility was also eliminated.


| INTRODUCTION
The main goal of this work is to understand the relationship between female human capital accumulation and fertility. It is well known that women have a less steep wage profile than men. Presumably some of this difference is due to the fact that women take time out of the labor market during pregnancy and for childcare. We quantify the importance of this effect on human capital accumulation. Specifically, we estimate a Markov model and then simulate the difference in wage growth under the counterfactual that women no longer take time out of the labor market for children and marriage. We find that at its peak, human capital would be 2.4% higher if the fertility effect were eliminated and 4.7% higher if the marriage effect was also eliminated. Although these effects are substantial, they are small compared to the raw difference in wages between men and women. This leaves plenty of scope for other channels such as discrimination in the form of glass ceilings.
A second goal of the study is to formalize the use of importance sampling to estimate indirect inference models. We develop a general version of indirect inference and an estimator using importance sampling. We show that this method produces consistent estimates and derive the standard errors for this promising new technique.
The basic empirical motivation can be seen in Figure 1. We run a regression of log wages on dummy variables for years of potential experience and individual fixed effects for White men and White women using the Survey of Income and Program Participation (SIPP). The predicted profiles are plotted, normalizing log wages at entry to 0. Two things can be seen from the figure. First, as has been previously established, 1 wages increase more quickly for men than for women at the beginning of the lifecycle. 2 Second, although wages diverge in the middle, they eventually converge 1 See, for example, Gladden and Taber (2000) among a large literature. 2 This difference is smaller than what Gladden and Taber (2000) find for the NLSY though the samples are directly comparable and the SIPP covers a later period. towards the end of the lifecycle. One possible explanation for this pattern is fertility-when women have children they often leave the labor market and then re-enter as their children age. This could cause wage growth to slow during child rearing and then pick up again after re-entry.
The figure raises a fundamental question in labor economics: What leads to curvature in wage growth? Relatedly, why does wage growth slow more quickly for men than for women? If the curvature is driven by actual experience then one might expect this gender pattern. When women re-enter the labor force they have less actual experience than men and thus their wages will grow faster. This is consistent with the wage growth of women with potential experience over 20 years being faster than the wage growth of men. In contrast, if it is potential experience or age that is driving the curvature, then women who re-enter would not see faster wage growth. Our empirical specification below allows for both possibilities and measures their quantitative importance.
The model is estimated using indirect inference which is an increasingly common way to estimate complex econometric models. Similar to simulated method of moments, it is a computationally simple technique because it relies on unconditional simulations of the model to obtain structural estimates. However, one of the main practical problems with indirect inference is the computational difficulty of optimizing the objective function when the structural model contains discrete choices (see, e.g., An and Liu, 2000;Magnac et al., 1995;Nagyp al, 2007). In this case, a step function arises because a small change in structural parameters causes a jump in the metric of distance between the two sets of auxiliary model parameter estimates. A non-smooth objective function precludes the use of gradient-based numerical optimization methods which can often lead to much faster convergence.
In this paper, we explain how the problem of non-smoothness can be solved using Monte Carlo importance sampling (see, e.g., Kloek and van Dijk, 1978) in a general class of indirect inference models. We smooth the objective function by making use of importance sampling weights in estimation of the auxiliary model on simulated data. The denominator of the weight is the likelihood contribution of each observation in the simulated sample, at an initial trial vector of structural parameters. The denominator remains fixed during minimum distance iterations. The numerator of the weight is the likelihood contribution at the updated trial vector of parameters. The importance sampling weights can be formed with either the exact likelihood of the structural model or a simulated likelihood in case the former is difficult to construct. We show that this alternative technique which is explained in the context of simulated method of moments by Gourieroux and Monfort (1996) and Ackerberg (2009) can be extended to indirect inference to yield structural parameter estimates that are consistent. Although this extension is straightforward, in our view it is a very useful approach which should be more widely adopted.
The rest of this paper is organized as follows. In the next section, we briefly discuss the two relevant literatures. In Section 3, the Markov lifecycle model is presented. In Section 4, we show formally how to incorporate importance sampling into indirect inference. Section 5 presents the data. Section 6 discusses how to implement the procedure in the current context. Section 7 presents the empirical results. Section 8 concludes.

| Female wage growth
There is a large literature on male-female wage differentials. This study differs from the vast majority of previous work in this area because we focus on wage growth rather than wage levels. Hill (1979) was one of the first to examine the effect of motherhood on wage levels. She initially finds a 7% motherhood wage penalty for White women, but after controlling for productivity characteristics it nearly disappears. She concludes that "the number of children is a good proxy variable for differential work history and labor force attachment for White women" (p. 591). We use this idea for identification in our model. Becker (1985) suggests that a part of the wage gap observed between single and married mothers arises from the choice by married mothers to work in less intensive and more convenient jobs (p. S54). Korenman and Neumark (1992) find no significant effect on wages of having a first child, but large effects from the second child (between a 10% and 20% penalty). Waldfogel's (1998aWaldfogel's ( , 1998b findings suggest a motherhood wage penalty of 4.6% for the first child and 12.6% for two or more children. She also finds that women who have access to family leave upon childbirth are more likely to return to their pre-childbirth employer and, consequently, receive a wage boost that partially offsets the motherhood wage penalty (75% of the wage penalty is eliminated). Anderson et al. (2002) find no evidence (in a panel framework) that reduced work effort is at the root of the wage gap. They estimate the wage gap to be 3% for mothers with one child and 6% for mothers of two or more children. They posit that the wage gap is largely caused by high costs of flexible work schedules for women holding medium office jobs with standard work hours. Adda et al. (2017) formulate and estimate a dynamic structural model of female labor supply, marriage and fertility choices and use it decompose the career costs of children into several different components. Using data from Germany, they find that roughly three quarters of the 35% reduction in lifetime income derives from foregone earnings while out of the labor force. The remainder is due to lower wages while working, less work experience, and depreciation of skills. In addition, Adda et al. (2017) find that skill depreciation rates are higher in mid-career and differ across occupations. Loughran and Zissimopoulos (2007) concentrate on the effect of marriage and fertility on the wage growth of men and women. Fixed-effects regressions show that not only does marriage reduce female wage levels, but it also reduces female wage growth by four percentage points. A first birth lowers female wages by between two and three percentage points but does not affect wage growth for males or females. Daniel et al. (2013) estimate fixed-effects regressions on Spanish data to explore the effects of childbirth on female wages. The results indicate that, compared to childless women, "mothers to be" experience earnings increases of up to six percentage points prior to a first birth. The earnings advantage is then wiped out. It takes another 9 years on average for a mother's earnings to return to pre-birth relative levels (relative to childless women). Roughly half of the earnings loss upon becoming a mother is due to less accumulated work experience, as mothers switch to part-time work or take a leave of absence. Weiss and Gronau (1981) provides a human capital model showing why wage growth might be lower for women. Polachek (1981) presents a model and evidence that women choose occupations with lower depreciation of human capital. Like us, Light and Ureta (1995) use a more complicated model for experience. They take advantage of the NLSY79 and the long histories. Baum (2002) looks directly at the effect of work interruptions on wages for women. Wilde et al. (2010) emphasize the difference between low and high skilled workers in the impact of child bearing.

| Indirect inference and importance sampling
Indirect inference has become a very important tool for estimation of complex econometric models. Key papers are Smith (1990Smith ( , 1993 and Gourieroux et al. (1993). The econometrics is discussed in detail in Gourieroux and Monfort (1996). The basic idea is to estimate auxiliary parameters from the data and match them to the model as we discuss in detail in the next section. The main issue that we address in this paper is one in which the mapping between the underlying parameters and simulated auxiliary parameters is not smooth, which complicates estimation and inference. We show how to use importance weight sampling to smooth the objective function. An alternative approach to importance sampling and smoothing is the method of generalized indirect inference (GII) proposed by Bruins et al. (2018). As it will be easier to discuss that paper after introducing our notation, we defer a detailed discussion of GII until Section 4.3.
Importance sampling has a long history. The use of importance sampling with Monte Carlo to simulate expectations goes back at least to Kloek and van Dijk (1978). It has been used for smoothing objective functions in simulated maximum likelihood (see Keane & Sauer, 2010, for discussion). It was discussed by McFadden (1989) as a way to smooth his simulated method of moments approach. A particularly relevant paper is Ackerberg (2009) which we discuss in in Section 4.3 after introducing our notation. Our main methodological contribution is to extend the importance sampling methodology to a general class of indirect inference models as well as providing some particular examples. Lee (2012), Han (2016), and Fu and Gregory (2019) have applied this approach based on earlier versions of this paper.

| THE MARKOV MODEL FOR WOMEN'S WORK, MARRIAGE, AND FERTILITY PATTERNS OVER THE LIFECYCLE
The model is a continuous time Markov model in which women transition between several states. Individuals can move into and out of work and into and out of marriage. They also potentially give birth to children which influences other variables. Human capital increases while individuals work and falls when they do not. The state variables are where t is time since labor market entry (i.e., potential experience), L it is a dummy variable for having a job, M it is a dummy variable for being currently married, H it is human capital, K it is the number of children, and A jit is the age of each child. The last two variables in (1) do not change over time. The first is education E i , which is observed in the data, and the second is unobserved heterogeneity ν i . The latter is a two dimensional normal random variable, with one dimension loosely anchored to wages and the other to labor supply. Our model starts at the point a woman finishes school and we assume they are unmarried with no children. They may have a job, which is determined by a logistic function of (E i , v i ). The transitions are governed by five different hazard rates; the hazard rate for job arrival among the non-employed, λ J S it ð Þ, the hazard rate for job destruction (leading to non-employment), λ N S it ð Þ, the rate of marriage formation, λ M S it ð Þ, for divorce, λ D 1 S it ð Þ, and finally for birth of children, λ K S it ð Þ. With some probability women drop out of the labor market precisely at the time of having children which is specified as a logistic function of (E i , v i ). The ages of both the woman and her children increase with time. Human capital evolves deterministically as a function of the state variables as described below.
All five hazard rates take the basic form, for R J, N, M, D, K f gwhere X R it is a vector of covariates that are functions of the underlying state variables (observable and unobservable). Not all hazards depend on all of the state variables, rather there is a different subset specified for each outcome. The exact specifications are shown in Table 3a. We discretize the continuous state variables in (2).
The human capital accumulation function allows for curvature of the profile either through age t, or human capital H it using a log-linear functional form. Specifically, for workers, human capital accumulates according to where H i is the maximum level of human capital (and _ H ¼ ∂H=∂tÞ. H i is allowed to vary across people depending on their education using a log-linear functional form. In this specification, as H it approaches H i , human capital accumulation slows. The other force that allows for human capital accumulation to slow down is the potential experience term e Àμ i t (where μ i varies with education). As discussed above, the distinction between the two is very important for mothers who take time out of the labor force. With a long spell out of the labor force to care for children, at the time of re-entry these mothers will have relatively high t but relatively low H it . If the first effect is important, mothers should see large wage growth upon re-entering, but for the second, they will not. This intuition is key for distinguishing between these explanations in the data. The key auxiliary parameter is the coefficient on women with children over the age of 18 in the wage growth regression. We put high weight on this parameter to make sure the model fits it very well. We also parameterize log a S it ð Þ ð Þto be linear in the state variables. Note that unlike the classic Mincer model, our specification does not allow human capital to fall for older women when they work. This is broadly consistent with the data (see Figure 1).
When women do not work their human capital depreciates according to the formula, where δ is a single parameter. 3 Finally, wages depend on human capital as well as some of the other state variables, Because H it is an element of S it , the notation is general enough that we could have incorporated it into X it S it ð Þ 0 γ. It is shown explicitly in (5) only to clarify that its scale is determined by the wage equation because it is restricted to have a coefficient equal to 1. We also assume that ε it is i.i.d. standard normally distributed with mean 0 and variance σ 2 ε :

| Basic setup for indirect inference
Our framework is very similar to Chapter 4 in Gourieroux and Monfort (1996) but we focus on a narrower (though still large) set of problems for which the importance weight sampling approach is natural. We also focus on the cross section/panel data versions rather than the time series version. We explicitly derive the asymptotic properties using importance weights. The basic properties are quite similar. We assume that the econometrician observes (Y i , X i ) for sample i ¼ 1, …N. The observations are i.i.d. and both X i and Y i are potentially large dimensional (K x and K y ). X i is exogenous in the sense that it is determined outside the model and is i.i.d. coming from underlying distribution Ξ 0 .
The data generating process is where u i is an i.i.d. vector error term with distribution Both Ψ and y are known up to parameter θ Θ & ℜ K θ , where the true value is θ 0 . This notation is general enough to represent a complicated system with lagged dependent variables and/or equilibrium conditions, but we assume it can be written in reduced form y. 4 To apply indirect inference, assume the auxiliary model is We define the population value ofβ to be β 0 argmin β FðE gðX i , Y i , βÞ ½ , βÞ: These functions are general enough to incorporate many estimators.
Simulated method of moments is a special case in which the auxiliary parameters are moments:β ¼ 1 It can also capture much richer auxiliary parameters. For example, it would be maximum likelihood when g is the negative of the log-likelihood function and F is be the identity function, that is, F(x) = x. It can also incorporate a generalized method of moments type estimator in which g is the moments such that E gðX i , Y i ,βÞ ½ ¼0 and F is the function F( g, β) = g 0 Wg where W is some weighting matrix. This can incorporate OLS, IV, difference-in-differences, and regression discontinuity. We could also use it to represent a quantile or quantile regression.
To see the idea of indirect inference in this context, let Ξ be a potential distribution of X i , then the data generation process is known up to Ξ, θ ð Þ: Define the population functions and what Gourieroux, Montfort and Renault (1993) refer to as the binding function in our case is Note that Bðθ 0 Þ ¼ β 0 : Identification and estimation of Ξ 0 is straightforward because X i is observable, so we mostly abstract from it. Essentially, what one needs for point identification of θ is that B(θ 0 ) is invertible so that knowledge of β 0 is sufficient for knowledge of θ. In that case, because the model is known up to parameter θ, the function B(θ) is identified. Thus, we could just invert B θ 0 ð Þ to identify θ 0 . In practice we typically do not have a closed form for B. Instead, we use simulation estimators in order to approximate B(θ). A typical approach is to generate H different simulated samples each with size S. For each observation we draw u hs (θ) randomly from the distribution Ψ and calculate X hs from the empirical distribution of X i . 5 where Ω is a weighting matrix. We refer to this as the base approach which is why we use the subscript B.

| Using importance sampling weights
A major problem with the base approach is that often some components of the dependent variable vector, Y hs (θ) is discrete so that a small change in the parameters leads to jumps in y(X hs , u hs ; θ). This causes the objective function to be discontinuous. In principle, with enough simulations we could make this as smooth as we would like, but in practice this can make minimizing the objective function very time consuming. For example, we attempted to estimate our model with 1,580,000 simulations. This was still not sufficiently smooth to use gradient methods or obtain reliable standard errors. Using importance sampling weights smooths the objective function and worked very well for us in practice. 6 This problem has been found in other indirect inference cases as well. 7 Before describing the method, we introduce notation for a key component of the analysis: Υ hs . This is essentially a superset of the data including additional components such as unobserved heterogeneity and state variables. The empirical economist typically does not get to observe its sample analog, but because it can be simulated, Υ hs is something on which the empiricist can condition. In a typical problem there are multiple ways to choose Υ hs and finding the best one will be computationally very important. This is essentially what Ackerberg (2009) discusses as a "change in variables." 8 The data generating process for Y i (y(Á) in Equation (6)) is augmented to include the intermediate variable Υ i and is expressed as where both functions are known up to parameter θ. Although the distinction between Υ i and Y i may seem arbitrary at this point, its usefulness will become clear in the examples below. Let ℓ(Á ; X i , θ) be the likelihood function for Υ i . The key for this to work well is that ℓ and y Υ should be differentiable in θ and that ℓ should be easy to compute. Our importance weighting estimator is the following: Obtain the values of X hs from the empirical distribution of X i . Generate Υ hs ex ante without regards to θ using some data generating function leading to likelihood ℓ 0 (Υ hs ; X hs ). This typically involve simulating using a pre-chosen parameter θ * which results in ℓ 0 ðΥ hs ; X hs Þ ¼ ℓðΥ hs ; X hs , θ * Þ, but this is not necessary.
Fixing the values of (Υ hs , X hs ) we usẽ To understand the basic intuition of the approach, suppose that Υ hs has a continuous distribution and ignore the X 0 s. Let E s denote the expected value from the simulation. Because in the simulation Υ hs was drawn from the density ℓ 0 , We approximate this integral using a Monte Carlo procedure where Υ hs is drawn from the distribution ℓ 0 (Υ hs ), that is, This gives us a consistent estimate of G(θ, β) as S gets large. Critically, as long as ℓ(Υ hs ; θ) and y Υ Υ hs ; θ ð Þare smooth functions of θ, this approximation is a smooth function of θ.
First, note that standard indirect inference is a special case. To avoid jumps in the objective function, researchers typically draw the random variables that determine outcomes first and then fix these values throughout the estimation process. For example, if the distribution of an underlying random variable u hs does not depend on θ, one would draw the u hs once, at the beginning of the estimation procedure, and we would choose Υ hs = u hs . In this case, ℓðΥ hs ; X hs , θÞ ¼ ℓ 0 ðΥ hs ; X hs Þ, so the ratio of the likelihoods would just be 1, and this would be the standard estimator. When u hs does depend on parameters, typically one would draw underlying random variables that do not depend on θ and write u hs as a parametric function of those underlying variables. We discuss this point below.
The key improvement of this approach relative to the base model is that if we choose Υ hs in the appropriate way, y Υ Υ hs , X hs ; θ ð Þand thusB I ðθÞ will be continuous and differentiable functions of θ. This makes both estimation and formation of standard errors much easier. To keep our results general enough to cover the base case, in our formal results we do not impose that y Υ Υ hs , X hs ;θ ð Þis continuous. In our supplementary Appendix A in Sauer and Taber (2021) we show consistency and asymptotic normality. The asymptotic variance is 4.3 | Relationship with other work Gourieroux and Monfort (1996) present a very general indirect inference framework in Chapter 4. They consider a broader definition of the auxiliary estimator than Equation (8), which can be written in notation similar to ours aŝ where X and Y are matrices of the exogenous and endogenous data. They derive some general properties of the estimator. Our estimator is a special case (though it still includes a large set of auxiliary models) that makes clear where importance sampling enters. We derive the asymptotic properties for our case. Gourieroux and Monfort (1996) also discuss importance sampling for an array of estimators, but not explicitly for the indirect inference estimator.
The main difference between our estimator and Ackerberg (2009) is that we consider a more general estimator. The method of simulated moments is a special case of our model when the auxiliary model is a moment of the data, that is, The main point of Ackerberg (2009) is to emphasize an additional advantage of indirect inference (other than smoothness). Using our notation, evaluating y(X i , u i ; θ) can be time consuming as it might involve solving a dynamic programming problem or for equilibrium. If one can find an appropriate change of variables, Υ i , the problem can be simplified. If we can write y Υ (Υ i , X i ; θ) in such a way that it does not depend on θ then one does not need to resolve the model when one does a function evaluation. In our Markov model, calculating y Υ is not difficult so we have not taken advantage of this feature.
Generalized indirect inference (GII) proposed by Bruins et al. (2018) is an alternative method to smooth the objective function. We briefly, and loosely, introduce GII with our notation. They consider a case in which outcomes are discrete so we can write the outcome for simulation h and s as yðX hs , u hs ;θÞ ¼ ðy 0 ðX hs ,u hs ;θÞ, …, y T ðX hs , u hs ; θÞÞ with y t ðX hs , u hs ; θÞ ¼ argmax where j is a discrete option chosen from the set {0, … , J À 1} and v j is a utility function. The discontinuity in the objective function comes from the jump in the value that maximizes it. They define a smooth function of latent utilities g(Á ; θ) such thatỹ t X hs , u hs ; θ, λ ð Þ g v 0t X hs , u hs ; θ ð Þ , …, v JÀ1t X hs , u hs ; θ ð Þ ; λ ð Þ converges to y t (X hs , u hs ; θ) as the smoothing parameter λ goes to 0. GII substitutesỹ t X hs , u hs ; θ, λ ð Þfor y t (X hs , u hs ; θ) in computation ofB G θ ð Þ (which can then be used analogously toB B θ ð Þ andB I θ ð ÞÞ. Thus, the objective function is smooth andB G θ ð Þ converges to β 0 as λ goes to 0 and N goes to infinity, with H and T fixed. The g Á ð Þ Bruins et al. (2018) used in their Monte Carlo experiments is the logistic kernel. The main differences between the two approaches is that importance weighting involves calculating and weighting by the likelihood function, whereas GII depends on the choice of the smoothing parameter.
GII is not suited for our Markov model for two reasons. The first is that the Markov model is in continuous time rather than discrete time. The second is that our model of human capital, which is crucial to the analysis, depends on the full labor market history up to that point. Jumps in previous labor supply lead to jumps in human capital and we do not see a computationally feasible way to use GII to smooth human capital accumulation. Bruins et al. (2018) does discuss how to handle a lagged dependent variable within the context of GII. They employ the same smoothing function that applies to the contemporaneous choice in the Monte Carlo experiments and obtain good results. However, they do not discuss how one would go about smoothing a function of the history of lagged dependent variables as with accumulated capital accumulation. Altonji et al. (2013) use a version of GII that does use the history, but their approach will not work for a continuous state variable like human capital.

| Logit model
In order to demonstrate the method of indirect inference with importance sampling, we explain how to use it to estimate a logit model using a linear probability model as the auxiliary model. Of course, one would never need to estimate a simple logit model using this technique but it provides a good illustration of the method in a simple case. The true model is where Λ denotes the logit c.d.f.. The auxiliary model is a linear probability model. This can be put into our notation by taking F to be the identity function and choosing The simulated data are generated in the following way: 1. Choose X hs by drawing randomly from the empirical distribution of X i 9 9 As discussed above, sampling could be with or without replacement. Of course, if it is done without replacement and the simulation sample is larger than the original one, one would have to replenish it after running through the full sample.
2. Choose some initial logit value θ * 3. Simulate Y hs so that In this simple case choose Υ hs ¼ Y hs: For this model, note that ℓðΥ hs ; X hs , θÞ ℓ 0 ðΥ hs ; X hs Þ ¼ Clearly this is just H weighted regressions with weights ℓðΥ hs ;X hs ,θÞ ℓ 0 ðΥ hs ; X hs Þ . Also, because the weight is differentiable in θ, so isB I θ ð Þ: To see why this gives a consistent estimate, note that 1 S X S s¼1 ℓðΥ hs ; X hs , θÞ ℓ 0 ðΥ hs ; X hs Þ X hs X 0 hs ! p S!∞ E ℓðΥ hs ; X hs , θÞ ℓ 0 ðΥ hs ;X hs Þ X hs X 0 and at the true value θ ¼ θ 0 Thus, the simulator yields a consistent estimate as S grows (i.e., plim Bðθ 0 Þ ¼ β 0 Þ:

| Discrete time Markov model and Monte Carlo results
In this second illustration of the technique, the model is a discrete time Markov model of d it which is binary (0 or 1). Everyone begins with d i0 ¼ 0: Then the transition model is where the distribution of u i ¼ u i0 ,u i1 ð Þis G(Á ; θ 3 ). We do not observe people throughout the lifecycle and define F i as the first period in which we observe data for individual i and L i as the last period. Thus, for each individual we observe X i ,d iF i ,…, d iL i ð Þ : Note that because of the initial conditions problem and the unobserved heterogeneity, the likelihood of the data is computationally intensive when F 1i is large. To economize on notation, let The the likelihood function of the data would be Here we can simplify the likelihood used in the importance weights substantially by choosing Υ hs ¼ X hs , u hs, d hs1 , …, d hsL hs ð Þ for simulation hs. Then we no longer need to integrate when computing the likelihood function, ϱðΥ hs ;X hs ,θÞ ¼ ϱ 0, d hs1 ;θ,X hs , u hs ð Þ ϱ d hs1 , d hs2 ; θ,X hs , u hs ð Þ Â …: Â ϱ d hsL hs À1 ,d hsL hs ; θ, X hs , u hs ð Þ ð 33Þ To get a sense of the performance difference between different methods, we used a simple version of this model as the basis of a Monte Carlo study. We specify a Markov model with no unobserved heterogeneity, F i ¼ 2 and L i ¼ 3: As an auxiliary model, we consider three regressions (linear probability models): d i2 i on X i (initial condition) and a linear probability model of d i3 on X i conditional on the two different values of d i2 . For each run, we consider the same auxiliary model but simulate it in three different ways: the base model, the GII procedure, and by importance weighting (i.e.,B B θ ð Þ,B G θ ð Þ, andB I θ ð ÞÞ . We also consider two different estimators: a gradient-based estimator (LFGBS) and Nelder-Mead from the Optim package in Julia. We tried two different dimensions of X i : 5 and 15 (which results in 12 and 32 parameters with two set of parameters and intercepts) where the X i are drawn from a joint normal distribution. The parameters θ are uniformly distributed. We also used 3 different simulation sizes: 20,000, 100,000, and 500,000 for all cases with H ¼ 1: The results are presented in Table 1. For each case, we present 3 summary statistics: the mean squared error, the computation time (relative to using Nelder-Mead in the base model), and the mean of the minimized objective function. Despite the simplicity of the model, for a Monte Carlo study it takes a fair amount of time to estimate the base model with 1 iteration of the six different estimators taking roughly 24 hours on an Amazon AMD processor when we use 500,000 simulations.
Comparing the base estimator with importance weighting, one can see the advantages. First, notice that even with 500,0000 simulations the gradient-based estimator does not work as it does not find the minimum of the function consistently. However, the importance weighting estimator works very well. Comparing the indirect inference estimator to the simplex estimated standard estimator, both perform well in terms of mean squared error though the gradient indirect inference estimator performs slightly better in some cases. The most important result is that indirect inference/ gradient is much faster than the base/ simplex method-roughly 5 times faster with 12 parameters and 100 times faster with 52 parameters (in the 100,000 and 500,000 simulation cases).
For the Generalized Indirect Inference model, because calculating the optimal smoothing parameter is time consuming, we only do it once for each model (i.e., 6 times for the covariate/simulations specification). We do not follow precisely Bruins et al. (2018) but something very close that fits our approach and seemed to work well. Although GII sees less time savings in the smaller model, it is also much faster and converges better than the simplex method in the base model.
Given that this is only one specification, and also a very simple model, we cannot come to general conclusions in the comparison between the importance weighting procedure and GII. However, it is clear that for large problems there are substantial computational advantages to estimating with a smoothed model and using a gradient-based method.

| DATA AND AUXILIARY MODEL
We use data from the Survey of Income and Program Participation (SIPP). Alternative data sets we could have used are the National Longitudinal Survey of Youth 1979 (NSLY79) as well as the older National Longitudinal Surveys of Young Women and Mature Women (NLSW). We would not argue that SIPP clearly dominates the NLSY79, but rather there are tradeoffs between the two data sets and the vast majority of previous work discussed above has focused on the NLSY79 or NLSW. The advantage of the NLSY79 is it is a much longer panel, but the disadvantage is that it contains a small number of individuals (at most around 6,000 women which gets smaller over time due to attrition from the survey).
The SIPP is a very large data set with short panels-we use observations from almost 100,000 different women. The challenge with the SIPP is that one does not observe the full lifecycle profile for each woman. One must piece together the panel data for different people at different ages. This requires an econometric model, and we use our Markov model presented in Section 3. Estimating such a model by maximum likelihood is extremely difficult given the severe initial conditions problem. Indirect inference is a feasible alternative. We estimate the model using the last four panels of the Survey of Income and Program Participation 1996, 2001, 2004, and 2008 This survey interviews individuals every four months and we only use data pertaining to the survey 10 We do not use earlier years because the nature of the survey changed around 1996. These later panels are substantially longer than the previous ones. month. The sample includes White women who are 18 years or older and have at most 35 years of potential experience. Table 2 presents summary statistics of the main variables used in the analysis. Details of the data are discussed in Supporting Information Appendix B (Sauer & Taber, 2021). The auxiliary model is constructed using the following auxiliary parameters (the full description along with their empirical values is in Supporting Information Appendix C, Sauer & Taber, 2021): • Regression of log wages on potential experience dummies and state variables with individual fixed effects • Within and between variance of the error term from the previous regression • Regression of estimated wage fixed effect on education • Linear probability regression of working on potential experience dummies and state variables with individual fixed effects • Between variance of the error term from the previous regression • Regression of wage fixed effect on work fixed effect • Linear probability regressions of whether a woman is currently married/currently unmarried and divorced from second wave of survey on potential experience dummies • Linear probability regressions of whether an unmarried woman gets married/becomes unmarried between waves, on potential experience dummies and state variables • Fraction of mothers who are married at childbirth • Regression of having a child on 1 year lagged work status wages of mothers who work (with other covariates) • Age difference between youngest and oldest child • Linear probability regression of any children/two children/number of kids, on potential experience dummies and state variables • Linear probability regression of working in one wave conditional on working in the previous wave, on potential experience dummies, state variables, and work fixed effect • Fraction of mothers who work in interview before giving birth • Regression of wage gains between periods for women who are employed between periods • Change in log wages for women with non-employment spells divided by difference in potential experience dummies.
The key parameters are the effects of the number of children on various outcomes and can be seen in the "data" part of the tables and figures. Most surprisingly, in the fixed-effects wage regression (Supporting Information Appendix Table C1, Sauer & Taber, 2021), there is no evidence of a children penalty relative to many of the papers cited previously. This is in large part because this is a very short panel. We also find that having young children is an important determinant of working in the fixed effect regression shown in Supporting Information Appendix away as the children age. We also include age 7 and above in this regression but it is statistically and economically insignificant-the coefficient is À0.0044. For this reason, in the model we impose that only children less than seven influence labor supply decisions. There are also substantial effects of marriage on labor supply in this fixed effect regression.
Another key parameter is children over 18 in the wage growth regression. This variable is the key to identification of experience versus age effects. This is a proxy for actual experience because we know women who have more children have less experience (making use of the Markov model in the sense that it tells us how much less experience). The coefficient is interacted with education. The results are in Supporting Information Appendix Table C8 (Sauer & Taber, 2021). The effect at 12 years of school is positive (and for college even larger) indicating that women at the same age who have had more children have faster wage growth. This is evidence that actual experience matters for the curvature. The magnitude of this effect will dictate the magnitude of the actual experience effect in the model.
We included other children variables in the log wage growth equation but did not find significant results so we do not include them here and do not incorporate motherhood directly into the human capital production function.

| IMPLEMENTATION OF APPROACH IN PRACTICE
To formally describe Υ hs for our model, some new notation is introduced. In practice we take the number of bootstrap estimates H ¼ 1, so we abstract from including h in the subscripts. Let N w s be the number of work transitions and the dates (in terms of actual experience) of these transitions be τ w s1 , …, τ w sN w s : Let L s0 be labor force status upon labor market entry. Given L s0 , we can keep track of the state so we know the direction of the transition. Similarly for marriage, let N m s be the number of marriage transitions and τ m s1 , …, τ m sN m s be their dates. Analogously, let N k s be the number of children and τ k s1 ,…, τ k sN k s the dates when the children were born. Note that knowledge of the dates when children were born and labor force transitions also tells us if women stopped working right after childbirth. Finally, we write the measurement term on wages as ε st ¼ σ ε ϵ st where ϵ st is standard normal. Let ϵ s be the vector of these objects for the periods in which the wage is observed by the econometrician. Then we take our Υ s to be What is crucial for our approach is that the likelihood function ℓ(Υ s j X s ; θ) is smooth as a function of θ and the rest of the variables used to produce the auxiliary model are smooth in θ after conditioning on Υ s .
Note that the labor market, marriage, and birth transitions are perfectly pinned down by Υ s , but human capital and thus wages is not. The reason that we do this is that conditional on the work transitions, human capital is a smooth function of the parameters so there is no reason to include human capital as part of Υ s . Note as well that we could have included the ε st in Υ s rather than ϵ st . In the former case, the importance weights would change with σ ε , parameterizing it with ϵ st instead, they do not. Our experience is that the procedure works better in the latter case for reasons which are discussed below.
Writing the likelihood in terms of Υ hs simplifies its computation over the likelihood of the underlying data for two reasons: integrating over unobserved heterogeneity and solving the initial conditions problem. In this case, the difficult integration problem arises due to the initial conditions problem. Unobserved heterogeneity by itself is not that computationally difficult because the measurement error ϵ s is i.i.d. over periods, and ν s is only a two dimensional and permanent object. It is important to point out that there are many other models for which the initial conditions problem might not exist, but integrating over the distribution of unobservables would be much more complicated making maximum likelihood computationally infeasible. For example, we could relax the i.i.d assumption on ϵ i , allow ν i to evolve, or include an additional transitory error term that affects human capital accumulation each period, as in Fan et al. (2019). In our empirical problem the main difficulty with maximum likelihood is the initial conditions problem, but the general approach is useful for a much broader set of applications.
In practice, because the model is complicated, if the estimation procedure runs long enough so that the parameters change substantially, the likelihood can get very small for many observations. As a result, the weight ℓ(Υ hs ; X hs , θ)/ ℓ 0 (Υ hs ; X hs ) becomes approximately 0 for a large number of the observations. In theory, there is no problem with this, as if S is large enough, the law of large numbers still works. However, as a practical matter, one is using effectively a much smaller sample to approximate the auxiliary moments. Note as well that if one simulates the model using parameter value θ 0 then ℓ 0 ðΥ hs ;X hs Þ ¼ ℓðΥ hs ; X hs , θ 0 Þ, so if we simulate at this parameter value, the weights are all equal to 1. Putting these together, it is useful for this method to occasionally re-simulate the model with intervening estimates.
After experimenting with different approaches, we settled on the following iterative procedure. Start with some initial value and then: • Letθ jÀ1 be the last estimated value of the parameter (or initial value when j ¼ 1) • Simulate the model with estimateθ jÀ1 and let ℓðΥ s ; X s ,θ jÀ1 Þ be the likelihood of this simulated data with this parameter • Use a Newton method and to minimize the distance between the auxiliary and simulated parameters with at most 100 Newton steps wherẽ and Ω is a diagonal weighting matrix • Let the parameter that minimizes this beθ j .
• Repeat this procedure until the fit stops improving Once fit stops improving, we switch to a simplex method without importance weighting. We use our current best guess of θ as a starting value and then choose our estimate to minimize the unweighted objective functioñ whereB B ðθÞ is the base simulator defined in Equation (11). The reason for the final step with the simplex method is that if we stopped at iteration j the parameter estimate isθ j but the model that simulated the data for that estimator wasθ jÀ1 : This means that for counterfactuals, if we just simulate fromθ j we will not get exactly the same simulation as our final estimates. Using the standard method in the final step guarantees that the fit of the data from our final estimates comes fromθ j alone. As a practical matter this final stage did not lead to a substantial change in the parameter values but was time consuming.
The weights for the parameters of the auxiliary model were chosen in a somewhat ad hoc manner. We chose a diagonal weighting matrix Ω where for most auxiliary parameters we divided by the variance of the estimated parameter. The problem with this default approach, or more generally efficient weighting, is that it does not put the proper weight on the moments we are most interested in fitting. For example, most of our regression models contain a full set of potential experience dummy variables which gives 35 parameters, but there are only a few variables related to fertility (the fixed effect wage regression has a single one). This means that the statistical criterion will put much more weight on fitting the experience profile because this is 35 parameters rather than fertility which is only 1. We adjust for this by overweighting the fertility parameters. Although ad hoc, we think it provides a better objective function than a purely statistical one. The precise scales are presented in Supporting Information Appendix C (Sauer & Taber, 2021).

| EMPIRICAL RESULTS
In the model, not all state variables affect all outcomes. The specification was chosen in order to fit the data in as parsimonious a way as possible and is motivated by patterns found in the data. In Tables 3a-3d we present the estimated parameters of the model. The parameters themselves are more easily understood through simulations rather than on an individual basis. For this reason, we do not offer a detailed discussion of them. Most of the parameters have the signs one would expect.
The fit of the model is presented in Supporting Information Appendix Tables C1-C8 and Figure C1-C5 (Sauer & Taber, 2021). One can see in the tables that with only a few exceptions, the fit is excellent. We also attempt to fit the lifecycle profile of working, wages, marriage and children. Given the coarseness of the model, the relationship between hazard rates and potential experience does not fit perfectly, but in general the overall lifecycle patterns are fit very well. The first issue of main interest is the determinants of the curvature of human capital, which is important for understanding the shape of female wage growth. Recall that our baseline model is where curvature can come from two different sources. The first source is from the term H À H it ð Þwhich leads to human capital slowing down as it approaches H . The second source is from the μ i term which leads to human capital slowing down as workers age. The former is analogous to curvature due to "actual experience," whereas the latter is analogous to curvature due to "potential experience." As mentioned previously, we believe this difference is identified by the coefficient on "kids greater than 18" and its interaction with education in the wage growth regression. One can see from Supporting Information Appendix Table C8 (Sauer & Taber, 2021) that these are matched well. 11 To better understand this distinction, we graph two alternative versions of the human capital production function: for married women with the average amount of education (13.5 years of education). Because the point of this exercise is to explore curvature rather than rate of wage growth, a A and a B are calibrated to values that keep human capital growth in the first 10 years the same as in the base case. The first model eliminates the age effect through μ, whereas the second eliminates the curvature in the human capital production function. Figure 2 presents the results in which Model A is labeled "No Age Effect," and Model B is labeled "No Direct Human Capital." Note that the distinction 11 Note that the standard errors on these parameters are large but partly because the level and interaction with education are collinear. The p value on the joint test that both are 0 is 0.002. between the Baseline and Model A is barely visible, whereas Model B is distinctly different. Clearly μ i is largely irrelevant and the curvature derives from human capital accumulation.
Next we turn to the main aim of this paper and simulate a model in which the relationship between fertility and work is relaxed to see how that affects human capital accumulation. That is, we compare our baseline model to a counterfactual in which children at home have no effect on working. Specifically, the effect of "Number of kids <7" on finding a job and leaving a job are set to 0 (see Table 3a) and the probability of leaving the work force immediately upon having a child is also set to 0 (Table 3b). We also find large effects of marriage on labor supply (Table 3a) so we also consider a counterfactual in which we eliminate this effect. Our third counterfactual eliminates both the marriage and fertility effects.
The direct effects on labor supply are presented in in Figure 3. First, examining the fertility effects, one can see that eliminating these would lead to a a considerable increase in labor force participation during the prime childbearing years that dissipates as women age. The difference peaks at around 10 years of potential experience at a level of roughly 10%. In our view, this is a substantial effect, but it is not enormous. This is not that surprising based on the raw data. 12 Many women stop working while they have young children, but most do not. The marriage pattern is (not surprisingly) quite different over the lifecycle. It is substantially smaller early in the lifecycle, but persists much longer.
We next examine the effect of these labor supply effects on human capital accumulation. These simulations are not analogous to those in Figure 1, as to be in the actual wage regression, a woman needs to be working. This means that the shape of the profile in Figure 1 depends not just on human capital accumulation but also on selection into who is working. Because our counterfactual involves a change in working, there will be a selection effect that will influence the profile. We can avoid this problem when we simulate the model because we can simulate a counterfactual wage and a level of human capital for everyone-those working and those not working. This is what we present. Figure 4a presents a simulation of the level of human capital (H it in our model) at different ages. The labels are analogous to the counterfactuals presented in Figure 3. One can see that the loss in labor supply from fertility and marriage does suppress human capital but it is relatively modest. To put this simulation in a more familiar context, we calculate the difference in human capital between each of the three counterfactuals and the baseline and plot it in Figure 4b. For the fertility counterfactual, the difference in wages peaks around experience levels 10-20 at a difference of somewhat over 0.024. This suggests that, on average, wages of women at these ages would be about 2% larger if there was no effect of fertility on labor supply. Again, this is a non-trivial effect, but when compared to the difference in log wages between men and women in Figure 1 it is quite modest. 13 When adding the marriage effect as well, one gets more substantial effects of up to 4.7%. Although these effects go in the right direction, they are small relative to the wage growth 12 One can see in the fixed effect work regression in Table C2 (Sauer & Taber, 2021) that the coefficients on young children are of a similar magnitude. 13 The 2% motherhood wage penalty we find is greater than the near-zero penalty found in Hill (1979) and Korenman and Neumark (1992), but less than the penalty among mothers with two or more children found in Waldfogel (1998a), Waldfogel (1998b), and Anderson et al. (2002). Our estimate is similar in magnitude to the wage penalties for the first child found in Loughran and Zissimopoulos (2007) and Miller (2011). Our other estimates are less directly comparable to previous findings but are consistent with results in Adda et al. (2017) and Braga (2013).
F I G U R E 2 Curvature in human capital [Colour figure can be viewed at wileyonlinelibrary.com] differences between men and women leaving room for other channels such as discrimination (perhaps in the form of glass ceilings) to be important.

| CONCLUSION
In this paper, a continuous time Markov model of female work, marriage, and fertility is estimated using data from the Survey of Income and Program Participation. The model provides a good fit of the data. Two different types of counterfactuals are simulated with the estimated parameters. The first seeks to understand whether the curvature in female wage growth is determined primarily by curvature in the human capital accumulation function as a function of previous human capital, or if it is primarily driven by age. The results strongly suggest that curvature in the human capital production function is the driving force. The second counterfactual attempts to uncover the extent to which dropping out of the labor force among females, for fertility or marriage related reasons, suppresses human capital accumulation. Our finding is that it does so to a modest extent. Wages among prime age women would be approximately 2.4% higher if the relationship between fertility and working were eliminated and up to an additional 4.7% higher if the marriage effect were also eliminated.
The study also illustrates how to use importance sampling weights to smooth the objective function for indirect inference with discrete endogenous variables. Our procedure requires calculating the likelihood contribution for each observation in the sample at an initial trial vector of structural parameters. This constitutes the denominator of the weight, which remains fixed during minimum distance iterations. The numerator of the weight is the likelihood contribution at the updated vector of trial parameters. At each iteration, the likelihood ratio is the importance sampling weight used in estimation of the auxiliary model. The importance sampling weights can be formed with either the exact likelihood of the structural model or a simulated likelihood in case the former is difficult to construct. Our Monte Carlo study suggests potentially large gains in speed can be gained from smoothing the objective function in such a way.

OPEN RESEARCH BADGES
This article has been awarded Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. Data is available at [http://qed.econ.queensu.ca/jae/datasets/sauer001/].