The Molecular Genetic Architecture of Self-Employment

Economic variables such as income, education, and occupation are known to affect mortality and morbidity, such as cardiovascular disease, and have also been shown to be partly heritable. However, very little is known about which genes influence economic variables, although these genes may have both a direct and an indirect effect on health. We report results from the first large-scale collaboration that studies the molecular genetic architecture of an economic variable–entrepreneurship–that was operationalized using self-employment, a widely-available proxy. Our results suggest that common SNPs when considered jointly explain about half of the narrow-sense heritability of self-employment estimated in twin data (σg 2/σP 2 = 25%, h 2 = 55%). However, a meta-analysis of genome-wide association studies across sixteen studies comprising 50,627 participants did not identify genome-wide significant SNPs. 58 SNPs with p<10−5 were tested in a replication sample (n = 3,271), but none replicated. Furthermore, a gene-based test shows that none of the genes that were previously suggested in the literature to influence entrepreneurship reveal significant associations. Finally, SNP-based genetic scores that use results from the meta-analysis capture less than 0.2% of the variance in self-employment in an independent sample (p≥0.039). Our results are consistent with a highly polygenic molecular genetic architecture of self-employment, with many genetic variants of small effect. Although self-employment is a multi-faceted, heavily environmentally influenced, and biologically distal trait, our results are similar to those for other genetically complex and biologically more proximate outcomes, such as height, intelligence, personality, and several diseases.


Introduction
Economic variables such as income, education, and occupation are well-known to be related to health outcomes and longevity [1][2][3][4][5][6][7][8][9][10]. Specifically, there is a consistent inverse relation between indicators of socioeconomic status and cardiovascular disease [11]. For example, occupational choice is associated with the incidence of coronary heart disease among women [12]. Intriguingly, health outcomes, longevity, income, educational attainment, and occupational choice have all been shown to be partly heritable (see ref. [13] for complex diseases, refs. [14][15][16][17] for longevity, refs. [18][19][20][21][22] for education, refs. [23][24][25] for income, and refs. [26][27][28] for occupational choice). This suggests that the same genetic factors could be linked to socioeconomic status and health outcomes, or that indirect causal pathways from genetic variants to health outcomes exist that are mediated by individual behavior and the environment. For example, a potential mismatch between personal disposition and occupational choice may result in stress and decreased happiness, which have been shown to negatively affect (cardiovascular) disease incidence and longevity [29][30][31][32]. Therefore, knowledge about the specific molecular genetic architecture of socioeconomic variables and about the effects of mismatches between genetic predispositions and realized choices could yield important insights for epidemiology and public health policy. Unfortunately, most efforts to investigate the influence of genes on economic variables were until now limited to candidate gene studies that often failed to replicate later [33,34].
This study reports results from the first large-scale collaboration that studies the molecular genetic architecture of a specific economic behavior-entrepreneurship-using data from high-density SNP arrays. Entrepreneurship has been associated with poor health [35], increased stress [36], relatively low average incomes [37], but also with greater job and life satisfaction [38][39][40]. The analysis of entrepreneurship is complicated by the fact that it is a multi-faceted phenomenon [41]. Individuals may engage in entrepreneurial activity for a variety of reasons. For example, certain individuals may be motivated to pursue a business opportunity or to gain independence, whereas others may do so because of unemployment and a lack of viable alternatives in paid employment. Despite this complexity, empirical evidence suggests that entrepreneurship tends to run in families [42][43][44][45][46][47], and recent twin studies consistently estimate the heritability of this behavior to be on the order of 50% [26][27][28]. As these results suggest that entrepreneurship is partly influenced by genetic variation, specific markers that are associated with entrepreneurship should, in principle, exist. Research that is aimed at discovering these specific markers has thus far been limited to one candidate gene study. This study [48] found evidence for an association between a specific genetic variant in the DRD3 gene and entrepreneurship in a sample of n = 1,335. However, a more recent study [49] failed to replicate this association in three larger samples of n = 5,374, n = 2,066, and n = 1,925.
The molecular genetic architecture of entrepreneurship therefore remains largely unknown. A variety of alternative architec-tures could account for heritable variation. For example, there may be a small number of rare variants with strong effects, multiple common variants with small or modest effects, or some combination of these possibilities [50,51]. Therefore, we aimed to identify the molecular genetic architecture of entrepreneurship to facilitate a more sophisticated understanding of the nature of the associated heritable variation.
We use self-employment as a proxy for entrepreneurship in this study, which is the most widely available proxy for entrepreneurship. Self-employment is defined as having started, owned, and managed a business. Initially, we used a classical twin design to estimate the heritability of the tendency to engage in selfemployment. We performed this analysis to determine the comparability of our results with (1) estimates of previous twin studies, and (2) estimates from a novel method from molecular genetics. This recently described method [52] is used here to quantify the proportion of variance that is explained by common SNPs (and unknown causal variants that are in linkage disequilibrium with these SNPs) in the tendency to engage in selfemployment.
Furthermore, we performed a meta-analysis of genome-wide association studies (GWASs) of self-employment from sixteen studies to identify genetic variants that are robustly associated with self-employment. Together, these studies comprised 50,627 participants of European ancestry who are part of the Gentrepreneur Consortium [53,54]. This study is the first large-scale effort to identify common genetic variants that are associated with an economic variable. We also tested whether self-employment could be predicted out-of-sample solely using genotype data and the results of our meta-analysis.
Theoretical and empirical evidence from entrepreneurship research suggests that there may be differences between males and females with respect to the type of businesses they start. These differences also extend to individuals' motivations, goals, and resources [55][56][57][58][59] and exist because women face different-and typically more-barriers to entrepreneurship than men [60][61][62]. Therefore, we performed both pooled and sex-stratified analyses for all of our investigations.

Participating studies and self-employment measures
The analyses were performed within the Gentrepreneur Consortium [53,54], which included two out of the five studies that participate in the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium [63] (THI-SEAS), the UK Adult Twin Registry (TwinsUK), and the Cardiovascular Risk in Young Finns Study (YFS). The Swedish Twin Registry (STR) served as an in silico replication study, as genome-wide data were only available following the completion of the discovery stage.
The studies collected data regarding occupational status using questionnaires or interviews, from which self-employment status was distilled. Self-employment measures were defined in collaboration with the consortium leaders to minimize heterogeneity across participating studies. The cases were defined as individuals who were self-employed at least once, and the controls were defined as individuals who were never self-employed during their working life. However, for a number of studies, reliable data regarding work-life history were unavailable, possibly resulting in the inclusion of previously self-employed individuals in the control group. The details regarding the background and self-employment measures of each of the discovery studies and of the replication study are given in Table S1.

Ethics statement
All participating studies were approved by the relevant institutional review boards or the local research ethics committees, including the Icelandic National Bioethics Committee (VSN: 00-063), the Icelandic Data Protection Authority, and the Institutional Review Board for the National Institute on Aging (AGES); the Ethics Committee of the Medical Faculty of the University of Graz (ASPS); the Medical Ethics Committee at Erasmus University which approved the protocols for the ascertainment and examination of human subjects (ERF); the local ethics committee and data safety commissioner, the sampling design was approved by the federal data safety commissioner (GHS); the Ethics Committee for Epidemiology and Public

Genotyping, imputation, and quality control
The seventeen participating studies used a variety of commercially available SNP genotyping platforms to genotype their participants. Each study performed quality control of their genotypic data and imputed the genotypes of each participant to a common set of approximately 2.5 million SNPs from the HapMap CEU population. The exceptions to this were THI-SEAS, which only supplied results for directly genotyped SNPs, and HRS, which imputed to the 1,000 Genomes Project Phase I v3 panel. Prior to the meta-analysis, we performed parallel quality control of the association results for each study. SNPs were excluded on the basis of minor allele frequency (MAF,0.01 or MAF,0.05 if deemed necessary) and if the imputation quality (a measure of the observed variance divided by the expected variance of the imputed allele dosage from the imputation software output) was less than 0.4. Following these exclusions, approximately 2.4 million SNPs remained. Study-specific details regarding the genotyping, imputation, and quality control are given in Table S2.

Statistical analysis
Tetrachoric correlations were used to calculate self-employment correlations for MZ and DZ twin pairs. This analysis assumes a latent normally distributed tendency to engage in self-employment. We estimated the heritability of the tendency to engage in self-employment in the replication study using standard twin study methods, which were implemented in the program Mx [64]. Only complete twin pairs with data regarding self-employment status were included in the analysis and opposite-sex DZ twin pairs were excluded, resulting in a final sample size of 4,464 individuals. Specifically, for pooled males and females, males only, and females only, we fitted the three following nested models using the maximum likelihood approach on the raw data: (1) a model including an additive genetic effect, a shared common environment effect, and an individual-specific environment effect (the ACE model); (2) a model that included only an additive genetic and an individual-specific environment effect (the AE model); and (3) a model including only a common environment effect and an individual-specific environment effect (the CE model). For all of the samples, we controlled for a z-score of age by estimating agespecific thresholds. For the pooled sample, we additionally controlled for sex in a similar way.
We used the method that was recently developed by Yang et al. [52] to estimate the proportion of variance in the tendency to engage in self-employment that is explained by all of the common genotyped SNPs. The method is implemented in the GCTA software [65] and hinges on the assumption that in a sample of unrelated individuals, environmental factors segregate independently in the pedigree from the degree of genetic relatedness. In contrast to the twin study design, genetic relatedness is not inferred from the pedigree but is estimated directly from genome-wide SNP data. Under the assumption of no confounding by environmental variables, we can then estimate the accounted-for variance by relating the estimated genetic relatedness between pairs of individuals to their phenotypic correlation. The resulting estimate is actually a lower bound of the heritability that is estimated from classic twin and family studies. The reason for this is that twin and family studies capture the variation that is due to all of the additive causal variants, whereas the more recently developed method only captures the variants that are either directly genotyped or in linkage disequilibrium.
We used a combined sample of individuals from one of the discovery studies (RS-I) and the replication study (STR) to estimate the accounted-for variance. We restricted the sample from each study to individuals for whom data regarding selfemployment were available. Additionally, we included only one randomly selected individual from each family in the STR sample. A second round of quality control of the genotypic data was then performed for both studies. In the RS-I sample, we excluded 3,748 SNPs because they failed a test of Hardy-Weinberg equilibrium at p,1610 26 . We removed 24,993 SNPs with minor allele frequencies that were lower than 0.01 and another 6,665 due to data missingness greater than 5%. In total, 5,374 individuals and 561,466 autosomal SNPs were included in the analysis. In the STR sample, we removed two SNPs because they failed a test of Hardy-Weinberg equilibrium at p,1610 26 . Another 628 SNPs with a minor allele frequency lower than 0.01 were removed, as were two SNPs with data missingness greater than 5%. Therefore, 643,924 autosomal SNPs and 2,589 individuals were included in the analysis.
We then estimated the genetic relationships among 7,963 individuals in the combined sample from the 301,115 common autosomal SNPs. We dropped one of any pair of individuals with an estimated genetic relationship that was .0.025 while maximizing the remaining sample size to exclude the possibility of ascribing shared environmental effects to genetic effects and/or including the effects of causal variants not correlated with the genotyped SNPs but captured by the pedigree. The maximum relatedness in the remaining sample of 6,223 individuals therefore approximately corresponds to cousins two to three times removed [52].
Next, the linear mixed model y = m+g+e was fitted, where y is the binary phenotype, g the total additive genetic effect of the SNPs, and e is a residual effect. The restricted maximum likelihood (REML) was used to estimate the variance of the total additive genetic effect s g 2 of the SNPs by fitting the genetic relationships as the covariance structure. Because the analyzed phenotype is binary, s g 2 is the variance of the total additive genetics effects on the observed 0-1 scale. A latent normally distributed tendency to engage in self-employment was assumed when transforming the explained variance from the observed 0-1 scale to the latent scale using the transformation that is derived in the appendix of Dempster and Lerner [66]. For all of the analyses, we controlled for a z-score of age, study, and the first ten principal components of the genetic relationships of the combined sample. In the pooled sample, we also controlled for sex.
In addition to the Yang et al. [52] method, we employed a novel method developed by So et al. [67] that serves the same purpose, i.e., estimating the proportion of variance in the tendency to engage in self-employment that is explained by all of the common SNPs. However, in contrast to the Yang et al. [52] method, So et al.'s method does not require raw genotype data but attempts to recover the accounted-for variance from the meta-analysis results. Using PLINK [68], we restricted the meta-analysis results to SNPs that were present in the HapMap Phase II CEU panel (release 23a) and pruned those in strong linkage disequilibrium with other SNPs using a pairwise r 2 threshold of 0.25 in a window of 100 SNPs that slides in 25 SNP increments. After this procedure 172,742, 175,970, and 172,989 SNPs remained in the pooled males and females, males only, and females only sample, respectively. We used the Gaussian Kernel function, considered under the null-hypothesis of no association, and ran the simulation 500 times in each sample.
The genome-wide association analysis of self-employment was independently performed by each study according to a predefined analysis plan. The analyses were performed for pooled males and females, males only, and females only using an additive genetic model, controlling for age (#29 [reference]; 30-39; 40-49;$50) and sex in the pooled sample. To control for population stratification, the first four principal components of the genotypic data were also included if available. We provide details regarding the statistical analysis within each study in Table S2.
Following the association analyses, the genomic inflation factor l was calculated for each sample to quantify any remaining population stratification or cryptic relatedness. The lowest inflation factor was 0.989, and the highest was 1.156, although this latter value was for a study that did not include the first four principal components of the genotypic data in the analysis (Table  S3). Genomic control [69] was applied in samples with inflation factors that were greater than one by adjusting the test statistics.
We next performed fixed-effect meta-analyses of the association results from the discovery studies for pooled males and females, males only, and females only using METAL software [70]. Although the phenotype was defined as self-employment in each participating study, we could not harmonize the exact wording of the question on which the self-employment measure was based. In addition, the connotations of self-employment may depend to some extent on the level of economic development and culture. This may lead to unobserved gene-environment interactions that could introduce additional noise in the GWAS results pooled across studies. We combined the association results using weighted z-scores that were based on the p-values and the direction of the effects. This method first computes a per-study signed z-score for each SNP based on its p-value and the effect direction. The zscores are then summed with weights that are proportional to the square root of the sample size of each study. Following the metaanalyses, only autosomal SNPs that were present in the Hapmap Phase II CEU panel (release 22, NCBI build 36) and in at least half of the contributing samples in each meta-analysis were retained prior to both reporting p-values and the creation of the Q-Q and Manhattan plots. We a priori set the genome-wide significance threshold to p,5610 28 . SNPs with p,1610 25 were considered suggestive and also carried forward to the replication stage. The heterogeneity of the test statistics between the studies was assessed using the I 2 metric [71,72] and Cochran's Q statistic [73].
Replication was attempted for significant and suggestive SNPs from each meta-analysis using an in silico replication study comprising 3,271 individuals. The association results for these SNPs were looked up in the replication study and meta-analyzed together with the discovery samples for pooled males and females, males only, and females only. To adjust for family relationships in the replication study, we performed family-based association tests implemented in the MERLIN software [74].
We used the discovery meta-analyses results to calculate genebased p-values using the VEGAS program [75]. The positions of the UCSC Genome Browser hg18 assembly were employed to assign SNPs to genes, which included regions that were 650 kb from the 59 and 39 UTRs.
For the prediction analyses, we followed the approach that was pioneered by The International Schizophrenia Consortium [76] and used the association results from the discovery meta-analyses to predict self-employment in the STR. Specifically, twelve overlapping sets of SNPs that were nominally associated in the discovery meta-analyses were created for different significance thresholds (p T ,0.01, p T ,0.05, p T ,0.1, p T ,0.2, p T ,0.3, p T ,0.4, p T ,0.5, p T ,0.6, p T ,0.7, p T ,0.8, p T ,0.9, and p T #1). These sets were used as inputs for score calculation in the STR. We restricted the STR sample to individuals for whom data regarding selfemployment were available and included only one randomly selected individual from each family, resulting in a final sample size of 2,589 individuals for the prediction analyses.
Prior to calculating the scores for each individual in the STR, we followed [76] and selected all of the autosomal SNPs, pruning those in strong linkage disequilibrium with other SNPs. This process was performed using a pairwise r 2 threshold of 0.25 in a window of 200 SNPs that slides in five SNP increments. Following this exclusion process, 135,823 SNPs remained. The PLINK [68] 'score' function was then used to calculate the total score for each individual in the STR. The score is defined as the sum of the number of score alleles, weighted by the estimated coefficients from the discovery meta-analyses, divided by the number of nonmissing genotypes. If an individual was missing a genotype, it was imputed as the mean genotype based on the score allele frequency in the STR. On average, the score was calculated from approximately 120,000 SNPs given that (1) the coefficients were only estimated for SNPs in the HapMap CEU population in the discovery meta-analyses, and (2) the overlap with the genotyped SNPs was not perfect. Lastly, we regressed self-employment onto the score using a logistic regression model. The variance that was explained by the score was estimated using the Nagelkerke pseudo-R 2 of the fitted model. We also calculated the area under the receiver operating characteristic curve (AUC) to evaluate the prediction accuracy.

Results
Heritability of self-employment and the degree of variance that is accounted for by common SNPs We used data from the Swedish Twin Registry (STR) and the classical twin design to estimate the heritability of the tendency to engage in self-employment. We computed the tetrachoric correlations between the tendencies to engage in self-employment within monozygotic (MZ) and dizygotic (DZ) twin pairs. Table 1 indicates that the correlations within the MZ twin pairs were consistently higher than within the DZ twin pairs for males only, for females only, and for pooled males and females. We note that the correlation within DZ twin pairs in the pooled sample was higher than for the DZ correlations in males and females when the two sexes are considered separately. This effect most likely results from imprecise estimation of the tetrachoric correlations due to the small number of cases. When we computed Pearson correlations, the pooled DZ twin pairs correlation was in between the male and female DZ twin pairs correlations. Applying Falconer's formula [77] to the correlations in Table 1, yields h 2 estimates of 0.39 for pooled males and females, 0.69 for males only, and 0.34 for females only.
A maximum likelihood approach was employed to estimate the relative contributions of the additive genetic (A), shared common environment (C), and individual-specific environment (E) components. This approach was performed using an ACE model and two nested submodels for pooled males and females, males only, and females only. Table 2 gives the estimates of the A component as 0.54 for pooled males and females, 0.67 for males only, and 0.38 for females only. The estimates of the C component were 0.01 for pooled males and females, 0.00 for males only, and 0.02 for females only. The A component was significant at the 95% confidence level for pooled males and females, and for males only, although the confidence intervals were very wide. This component was not significant for the females only analysis. However, the x 2 test for goodness-of-fit and Akaike information criterion indicated that the AE model was the best-fitting model in all samples. In this submodel, the estimate for the A component for females only did not change markedly compared to the ACE model but was significant at the 95% confidence level. The estimates of the A component for pooled males and females, and males only were 0.55 and 0.67, respectively; these results were significant.
The recently developed method by Yang et al. [52] was employed to estimate the degree of variance in the tendency to engage in self-employment that is explained by all of the genotyped autosomal SNPs in the GWAS datasets. The proportion of the explained variance was estimated for pooled males and females, males only, and females only. To maximize the power of the analysis, we used a combined sample of one of the discovery studies (Rotterdam Study Baseline [RS-I]) and the STR. We estimated that 25% (p = 0.032) of the variance in the tendency to engage in self-employment could be explained by the common genotyped autosomal SNPs for pooled males and females ( Table 3). The variance that could be explained for males only and for females only was 25% (p = 0.152) and 0% (p = 0.499), respectively. The estimates for males and females separately were not significantly different from one other. The fact that the variance that is explained was zero for females is most likely due to the very low number of female cases (n = 353) compared to the number of controls (n = 3,482). The estimation of the explained variance is therefore very imprecise. We also estimated the variance that was explained for pooled males and females, males only, and females only in the RS-I and the STR separately. The estimates were not significant because the standard errors of these estimates depend heavily on the sample size. However, considered in their entirety, the results were consistent with the estimates that we present for the combined RS-I and STR samples. Overall, the results for pooled males and females and for males indicated that the degree of variance in the tendency to engage in self-employment that is explained by all of the common autosomal SNPs simultaneously is only approximately half of the narrow-sense heritability that is estimated using the STR and the classical twin design. Furthermore, estimates using the method developed by So et al. [67] also provide non-zero estimates for heritability. Specifically, the accounted-for variance was 7% for pooled males and females, 21% for males only, and 15% for females only. However, confidence intervals and standard errors could not be calculated for these estimates because not all raw genotype data were available, prohibiting further interpretation of these results.

Meta-analyses of genome-wide association studies
We performed genome-wide association analyses of selfemployment using the data from sixteen discovery studies. These studies comprised 7,734 participants who had been self-employed at least once and 42,893 participants who did not report being selfemployed. Table 4 includes the descriptive statistics for the studies. The mean ages in the pooled samples of males and females ranged from 31 to 68.8 years, and the average age across all of the studies was 53.4 years. Following independent association analyses for each study, we performed a fixed-effect meta-analysis of the studylevel results for approximately 2.4 million SNPs using a pooled zscore approach.
The discovery meta-analysis Q-Q plot ( Figure 1A) did not indicate a strong deviation for the lowest p-values. However, no confounding issues related to population stratification, cryptic relatedness, or genotyping errors were detected, as no systematic deviation from the expectation under the null hypothesis of no association was observed [78]. As illustrated in the Manhattan plot ( Figure  2A), we observed twenty SNPs with 4.1610 26 #p,1610 25 (Tables 5 and S4). The SNP with the lowest p-value, rs6906622 (p = 4.10610 26 ), was located near the RNF144B gene, with most studies indicating that the minor allele increased the probability of being self-employed (Table 5).
We next attempted to replicate in silico the twenty suggestive SNPs in the STR (n = 3,271). Two of the twenty SNPs associated with self-employment were statistically significant at the 5% level in the replication study. However, the SNP effects were not in the same direction as in the majority of the discovery studies (Table  S4), indicating that these SNPs were potential false positives. We then performed a combined meta-analysis of the discovery and replication studies. For all SNPs, the p-values were larger in the combined sample than in the discovery sample and did not reach genome-wide significance (Table S4).
The Q-Q plot for the male only meta-analysis ( Figure 1B) gave a certain degree of suggestive evidence of association; however, no evidence of population stratification, cryptic relatedness, or genotyping errors was observed, as only certain SNPs-those with particularly low p-values-deviated from their expectation under the null hypothesis of no association. The female only metaanalysis Q-Q plot ( Figure 1C) did not indicate a strong deviation for the lowest p-values and no evidence of population stratification, cryptic relatedness, or genotyping errors was observed. No SNPs reached genome-wide significance in the sex-stratified metaanalyses (Table 5), as can be observed in the Manhattan plots ( Figures 2B and C). The male meta-analysis resulted in 22 suggestive SNPs with p,1610 25 , and the female meta-analysis resulted in sixteen suggestive SNPs (Tables 5, S5, and S6). The top SNP in males, rs6738407 (p = 1.52610 27 ), was located in the HECW2 gene, and most studies reported that carrying the minor allele decreased the probability of being self-employed. The top SNP in females, rs2331548 (p = 1.93610 26 ), was located near the CBR4 gene, and most studies estimated that carrying the minor allele decreased the probability of being self-employed.
The replication strategy for the 38 suggestive SNPs from the sex-stratified meta-analysis that were carried forward into the replication stage was similar to that used for the meta-analysis replication of the pooled data. We performed an in silico replication study using the data from the STR. None of the SNPs reached nominal significance (p,0.05) in the replication study for males only (n = 1,409, Table S5) and females only (n = 1,862, Table S6). In addition, for the majority of the suggestive SNPs, the direction of the effect was not consistently in the same direction as was reported in the majority of the discovery studies, again indicating that these SNPs were potential false positives. We meta-analyzed the results from the sex-stratified discovery meta-analysis and the replication study in a combined meta-analysis. For males, five  The genetic relationships were estimated from 301,115 directly genotyped autosomal SNPs that were available in both studies. All analyses controlled for age, study, and the first 10 principal components of the genetic similarity matrix of the combined sample of RS-I and STR. In the pooled sample we also controlled for sex. The results did not change markedly when 4 or 20 principal components were included; s g   SNPs had lower p-values compared to the male discovery metaanalysis, although none reached genome-wide significance (Table  S5). In the combined meta-analysis for females, we observed that one SNP, rs562487, had a smaller p-value in this combined metaanalysis; however, this SNP did not reach genome-wide significance (p = 4.01610 26 ; Table S6).
To identify novel genes that may be associated with selfemployment, we tested 17,697 genes for pooled males and females, 17,698 genes for males only, and 17,699 genes for females only, implying a significance level of p,2.8610 26 . None of the analyzed genes reached this predetermined significance level (Tables S8, S9, and S10). The gene with the lowest p-value was SLC15A3 for the pooled male and female analysis (p = 1.63610 24 ). For males only, the lowest p-value was for TMEM156 (1.61610 24 ), and for females only, the lowest p-value was for PCP4 (p = 4.70610 25 ).
We also sought to replicate the association that was reported by Nicolaou et al. [48] to exist between a common variant, rs1486011, which is located in the DRD3 gene, and the tendency to be an entrepreneur. The SNP was nominally significant in the discovery meta-analysis (p = 0.011; Table S11); however, most studies reported a positive effect of the C allele-opposite to that reported by Nicolaou et al. [48], corroborating the results from an earlier replication study [49]. We also sought to replicate this SNP in the sex-stratified discovery meta-analyses. In this analysis, we observed a certain degree of evidence for a positive effect of the C allele in males (p = 0.046; Table S11) but not in females (p = 0.112; Table S11).

Predicting self-employment from genotype data
We examined whether the results from the discovery metaanalyses could be used to predict self-employment in the replication study [76]. We pruned the set of autosomal SNPs to a subset of approximately 120,000 SNPs that are in approximate linkage equilibrium. In an initial prediction analysis, we included only the subset of these 120,000 SNPs that reached a 1% significance level. We calculated a predictive score for each individual in the replication study by determining, for each SNP, the product of the individual's number of effect alleles and the estimated regression coefficient from the discovery meta-analysis. This product was then summed across the included SNPs and divided by the number of included SNPs. We evaluated the predictive power of the SNPs by calculating the degree of variance in the tendency to engage in self-employment that was explained by the score and the area under the receiver operating characteristic curve (AUC). We repeated this prediction analysis eleven additional times, each time with a less stringent significance threshold required for a SNP to be included in the score. Hence, each time this analysis was performed, a larger subset of the 120,000 SNPs was analyzed.
For the pooled analysis of males and females (n = 2,589), the variance that was explained by the score reached a maximum of 0.184% when all SNPs were included (p = 0.039; Table S12). The scores for males only (n = 1,110) and for females only (n = 1,479) showed no evidence for association with self-employment (all p$0.144, Table S12). Furthermore, we did not observe a consistent positive relationship between the variance in the tendency to engage in self-employment that was explained by the score and the significance threshold p T (Figure 3).

Discussion
We present results from four methods of analysis, three of which are based on genome-wide molecular genetic data, to investigate the molecular genetic architecture of self-employment.
First, using a classical twin design, we report that 55% of the variance in the tendency to engage in self-employment is due to additive genetic effects, with higher heritability for males (67%) than for females (40%). Our estimates are in agreement with those of previous twin studies. These earlier studies suggested heritabilities of 48% in a sample of primarily female British twins [26] and  of 38% in a sample of US twins [28]. In addition, Zhang et al. [27] estimated the heritability of current business ownership and selfemployment in a sample of Swedish twins and observed evidence of a significant additive genetic effect for females but not for males. Our results suggest significant heritability among males as well; however, the confidence intervals of the estimates are very wide for both our study and for that of Zhang et al. [27]. At least a portion of the differences between these two studies may be explained by imprecision and/or by the different samples and definitions of entrepreneurship that were used. Second, by applying a method that was recently developed by Yang et al. [52] to entrepreneurship, we estimate that approximately 25% of the variance in the tendency to engage in selfemployment (about half of the h 2 estimated in twin studies) could in principle be explained by the additive effects of common SNPs that are in linkage disequilibrium with the unknown causal variants. These results are in line with previous studies, which have estimated that common SNPs account for one-quarter to half of the narrow-sense heritability for height [52], intelligence [80,81], personality [51,82], several common diseases [83], schizophrenia [84], and recently for several economic and political preferences [22].
Several explanations may explain why the heritability estimate for self-employment using common SNPs is approximately half of the estimate that was obtained using the classical twin design. First, the causal variants may be in regions of the genome that are currently not covered by the available SNP arrays. Second, it is possible that the genotyped SNPs and the causal variants are not in complete linkage disequilibrium because, for example, the true causal variants have on average lower minor allele frequencies than the genotyped SNPs. Yang et al. [52] provide evidence for this in the case of human height. They estimated that 45% of the variance in height is accounted for by common SNPs, while the heritability of height is consistently estimated to be approximately 80%. The authors then developed a method that estimated the variance that was accounted for by common SNPs, assuming imperfect linkage disequilibrium between the genotyped SNPs and the unobserved causal variants. This method revealed that 84% of the variance in height, the complete heritability, could be explained by the causal variants. Twin and family studies do not suffer from this issue, as genetic relatedness is inferred from the expected relationships within the pedigree and include all of the additive genetic variation. Both of these explanations imply that the estimates that we obtained for self-employment using the more novel method are at the lower bounds of the heritability that is commonly estimated in twin and family studies. A third, alternative, explanation for the different results that were obtained using these techniques is that the twin-based heritability estimates are biased upwards because of, for example, genetic interactions [85] or a violation of the identical common environment assumption in twin studies [86].
Third, we perform the first meta-analysis of GWASs of an economic behavior (i.e., self-employment) using data from sixteen studies that together comprise approximately 50,000 participants. The discovery stage had 80% power to detect a variant at genomewide significance with a minor allele frequency of 0.25 and odds ratios of approximately 1.11 for pooled males and females, 1.15 for males only, and 1.17 for females only [87], assuming we had a non-noisy, harmonized measure of self-employment across studies. Yet, we do not identify genome-wide significant associations. This result suggests that there are no common SNPs for selfemployment with moderate to large effect sizes, thus placing an upper bound on the effect sizes of common SNPs that we can expect to exist. Gene-based tests for approximately 17,700 genes, including several candidate genes for entrepreneurship that have been previously suggested in the literature [48,79], do not reveal significant associations. In addition, we are unable to replicate a previously reported correlation, namely, rs1486011, a SNP that is located in the DRD3 gene. This common variant was identified by Nicolaou et al. [48], who reported its association with the tendency to be an entrepreneur. The non-replication of associations is common in candidate gene studies of human traits and behaviors. This failure to identify replicable associations is likely due to a combination of underpowered sample sizes (due to optimistic assumptions regarding plausible effect sizes) and publication bias [88]. Examples of non-replication of candidate genes studies on complex human traits include general intelligence [81], personality [89][90][91][92][93][94], and trust [95,96]. We therefore stress that caution is warranted when interpreting claims from candidate gene studies of SNPs or genes with strong effects on complex behavioral traits like self-employment.
Finally, we report that a genetic score that was estimated in our meta-analysis sample has only limited predictive power in our replication study. The variance that was explained by the score was always lower than 0.26%. However, this result does not contradict our finding that approximately half of the narrow-sense heritability can be explained by common SNPs. This latter heritability analysis uses the measured SNPs to estimate realized relatedness between individuals, and given the large number of SNPs in a dense SNP array, realized relatedness can be estimated fairly accurately. In contrast, estimating a strongly predictive score from a sample requires good estimates of the effects of individual SNPs. If our discovery sample was infinitely large, it would have been possible to precisely estimate all of the SNP effects and to obtain a score with the theoretically highest possible predictive power, as estimated using the Yang et al. [52] method. The smaller the discovery sample, the noisier the estimates of the individual SNP effects; therefore, the predictive power of the score will be lower [97,98]. Our estimates of the effects of the individual SNPs are still too imprecise to allow out-of-sample prediction with SNP data that would have practical utility. Together, our results demonstrate that common SNPs jointly account for a substantial share of the variance in the tendency to engage in self-employment (s g 2 /s P 2 = 25%). However, because we do not find specific SNPs in our large-scale meta-analyses of GWASs that examined self-employment, this heritability is not due to SNPs with moderate to large effects. A plausible interpretation of these results therefore appears to be that the molecular genetic architecture of self-employment is highly polygenic, implying that there are hundreds or thousands of variants that individually have a small effect and which together explain a substantial proportion of the heritability. We cannot rule out the possibility that rare genetic variants, or other, currently unmeasured, variants that are insufficiently correlated with the SNPs on the genotyping platforms, have large effects on an individual's tendency to be self-employed. However, if these genetic variants are rare, they would still not contribute a great deal to the population-based variance in self-employment, and large samples would still be required to identify these variants [51,83,99].
Our results are similar to those that have been reported for biologically more proximate human traits [51,52,[80][81][82] and diseases [76,83,84] for which a polygenic molecular genetic architecture has also been suggested. One implication of this similarity is that, with sufficiently large sample sizes, SNPs that are associated with self-employment-and possibly also other economic variables-can in principle be discovered, as has been the case for, e.g., height [100] and BMI [101]. However, a discovery sample of approximately 50,000 individuals is apparently still too small for a meta-analysis of GWASs on a biologically distal, complex, and relatively rare human behavior such as self-employment. A potential opportunity for future research are GWASs of endophenotypes such as risk preferences, confidence, and independence. The effect sizes of individual SNPs on these endophenotypes may be larger because of their greater biological proximity. However, these variables are difficult to measure reliably and not (yet) available in many genotyped samples.
Given the need for very large samples in meta-analyses of GWASs on complex traits, an important challenge of the present study was to identify a measure of entrepreneurship that is available in a sufficiently large sample. We opted to maximize the available sample size in this study and operationalized entrepreneurship as self-employment, which is also the most frequently used measure of entrepreneurship in the economics literature [102].
We included every study we were aware of in the analysis that included a measure of self-employment and which was willing to contribute data, although this approach necessitated that data from diverse populations (e.g., Eastern German self-employed individuals and US business owners) were pooled. The available measures of self-employment varied across studies, including different single-and multiple-item measures, data from standalone surveys, and data from repeated measures or retrospective employment histories of the participants. For a number of studies, this approach resulted in a lack of detailed and reliable data regarding work-life history. Substantial measurement error, especially with respect to the definition of the control group, was therefore unavoidable. Ideally, the control group would encompass only participants who had never been self-employed and who will never be self-employed. Such an analysis would have required data regarding the complete work-life history of participants and participants who had reached an appropriate age. However, only data regarding current employment status were available in the majority of the contributing studies. It is therefore possible that there was a certain degree of misclassification in the studies that included only single-item, single-response measures of self-employment, thereby adding noise to the phenotype definition and potentially reducing the statistical power with respect to association detection.
Statistical power may have also been reduced by heterogeneity within the case group, as this group comprised individuals who became self-employed for very different reasons. For example, certain individuals may have chosen self-employment because they had no viable alternatives in paid employment, whereas others may have done so because of their desire to pursue a business opportunity. The motivations, goals, and resources of these two groups of individuals are obviously very different, and the genetics underlying these various characteristics may likewise differ greatly. Unfortunately, more detailed information regarding the motivations, activities, and success of entrepreneurs was unavailable for most of the genotyped samples.
In general, GWASs face a practical trade-off between phenotype quality and sample size. Surprisingly, statistical power calculations suggest that studying a more noisy phenotype in a larger sample is often more likely to be successful than studying a perfect phenotype in a small sample. For example, assume that a common SNP exists with a minor allele frequency of 0.5 that increases the odds for all types of entrepreneurship by a factor of 1.13 on average (assuming 15% of the population are entrepreneurs and the data are population samples). The required sample size to detect this SNP with 80% power for a perfectly-measured outcome is approximately 30,000. Measuring entrepreneurship perfectly would require a lengthier survey that is administered more than once. Such a large genotyped sample with perfect measures of entrepreneurship does not currently exist. Smaller samples with perfect measures would be underpowered to detect the SNP. In contrast, if the available measures for entrepreneurship are noisy and have a test-retest reliability of only 0.6-which is typical for behavioral traits measured by brief surveys [103][104][105]280% power to detect this SNP requires a discovery sample of approximately 50,000 individuals. Thus, our study was wellpowered to detect effects of this magnitude even if there was substantial measurement error and noise in the data.
The results of our study have three implications for this future research agenda. First, the high share of variance in selfemployment that can be attributed towards interpersonal differences in common SNPs suggests that this research agenda is in principle feasible. Second, to investigate if and how genes that are related to economic variables influence medical outcomes, it will be necessary in the future to identify either the specific genetic variants that are underlying the heritability of economic variables (i.e., to investigate causal pathways from genes to medical outcomes), or to calculate genetic scores that have at least moderate out-of-sample predictive power (i.e., to investigate the medical consequences of a mismatch between genetic predisposition and economic outcomes). Even larger samples than what we had available in our present study will be needed to identify genome-wide significant SNPs and to estimate more accurate genetic scores for economic variables. Third, our results suggest that the effects of single SNPs on self-employment are likely to be very small. Given these effect sizes, statistical power calculations suggests that a research strategy that aims to maximize sample size by pooling data with slightly inaccurate measures of selfemployment is more likely to be successful than a research strategy that aims to collect perfect phenotype measures in a much smaller sample. If successful, this research could shed new light on the complex interaction of genes, environment, and personal choices on health and longevity.     Table S8 Gene-based p-values for the top 25 genes associated with self-employment in the discovery metaanalysis for pooled males and females.

(DOC)
Table S9 Gene-based p-values for the top 25 genes associated with self-employment in the discovery metaanalysis for males only.

(DOC)
Table S10 Gene-based p-values for the top 25 genes associated with self-employment in the discovery metaanalysis for females only.

(DOC)
Table S11 Meta-analysis association results for SNP rs1486011 for pooled males and females, males only, and females only.