CRUMBLER: A tool for the prediction of ancestry in cattle

Tamar E. Crum; Robert D. Schnabel; Jared E. Decker; Luciana C. A. Regitano; Jeremy F. Taylor

doi:10.1371/journal.pone.0221471

Abstract

In many beef and some dairy production systems, crossbreeding is used to take advantage of breed complementarity and heterosis. Admixed animals are frequently identified by their coat color and body conformation phenotypes, however, without pedigree information it is not possible to identify the expected breed composition of an admixed animal and in the presence of selection, the actual composition may differ from expectation. As the roles of DNA and genotype data become more pervasive in animal agriculture, a systematic method for estimating the breed composition (the proportions of an animal’s genome originating from ancestral pure breeds) has utility for a variety of downstream analyses including the estimation of genomic breeding values for crossbred animals, the estimation of quantitative trait locus effects, and heterosis and heterosis retention in advanced generation composite animals. Currently, there is no automated or semi-automated ancestry estimation platform for cattle and the objective of this study was to evaluate the utility of extant public software for ancestry estimation and determine the effects of reference population size and composition and number of utilized single nucleotide polymorphism loci on ancestry estimation. We also sought to develop an analysis pipeline that would simplify this process for members of the livestock genomics research community. We developed and tested a tool, “CRUMBLER”, to estimate the global ancestry of cattle using ADMIXTURE and SNPweights based on a defined reference panel. CRUMBLER, was developed and evaluated in cattle, but is a species agnostic pipeline that facilitates the streamlined estimation of breed composition for individuals with potentially complex ancestries using publicly available global ancestry software and a specified reference population SNP dataset. We developed the reference panel from a large cattle genotype data set and breed association pedigree information using iterative analyses to identify purebred individuals that were representative of each breed. We also evaluated the numbers of markers necessary for breed composition estimation and simulated genotypes for advanced generation composite animals to evaluate the precision of the developed tool. The developed CRUMBLER pipeline extracts a specified subset of genotypes that is common to all current commercially available genotyping platforms, processes these into the file formats required for the analysis software, and predicts admixture proportions using the specified reference population allele frequencies.

Citation: Crum TE, Schnabel RD, Decker JE, Regitano LCA, Taylor JF (2019) CRUMBLER: A tool for the prediction of ancestry in cattle. PLoS ONE 14(8): e0221471. https://doi.org/10.1371/journal.pone.0221471

Editor: Raluca Mateescu, University of Florida, UNITED STATES

Received: March 18, 2019; Accepted: August 7, 2019; Published: August 26, 2019

Copyright: © 2019 Crum et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Allele frequency data are provided with CRUMBLER. Project Name: CRUMBLER; Project Home Page: https://github.com/tamarcrum/CRUMBLER; Programming Language: Python; Other Requirements: PLINK, EIGENSOFT, and SNPweights License: GNU GPL.

Funding: JT and RS appreciate the support of NIH-USDA Dual Purpose with Dual Benefit Program grant number NIH 1R01HD084353. JT and RS are supported by USDA-NIFA grants 2013-68004-20364, 2015-67015-23183, 2016-67015-24923 and 2017-67015-26760. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Estimation of the breed composition of individuals with complex ancestries has utility for estimating breed direct and heterosis effects as well as for the estimation of the additive genetic merit of these individuals. It also has value for identifying the breed composition of training populations used for genomic selection and hence the identification of target breeds in which the developed prediction equations may have some relevance. Visual classification of cattle based on breed characteristics suffers from similar problems as the self-identification of ethnicity in humans [1], as most visible breed characteristics are determined by alleles at relatively few loci. For example, recent extensive crossing with Angus cattle in the U.S. produces a black hided animal which masks all other solid coat colors found in other breeds and requires only a single dominant allele at the MC1R locus. As a consequence, black-hided cattle have a “cryptic” population structure [1,2] and the visual classification of black-hided animals for branded beef programs can result in the marketing of animals with vastly different Angus genome content.

In the U.S. and many other countries, the breed of an animal is associated with its being registered with a breed association which requires that both parents of the animal be identified and also registered with the association. For the previous 50 years, parentage has been validated by each breed association using blood or, more recently, DNA typing. Many breed associations have closed herdbooks which means, in theory, that the pedigrees of all animals can be traced back to the animals that founded the breed’s herdbook. Other breed associations have open herdbooks, which means that crossbred animals can be registered with the breed if they have been graded up by crossbreeding to purebred status with the expectation that a certain percentage of their genome (e.g., 15/16ths) originates from the respective breed based upon pedigree records and parentage validation. Pedigree errors that occurred prior to, or that were not identified following the implementation of blood typing and DNA testing, lead to admixed animals being incorrectly classified as fullblood and incorrectly identified admixture proportions in purebred animals. The effects of recombination, random assortment of chromosomes into gametes and selection can also lead to considerable variation in the extent of identity by descent between relatives separated by more than a single meiosis and can also lead to admixture proportions that differ substantially from expectation based on pedigree.

Crossbreeding is extensively used in commercial beef production and in other livestock species production systems to capitalize on the effects of breed complementarity and heterosis resulting in herds of females that may have very complex ancestries that frequently use fullblood or purebred bulls sourced from registered breeders. Changes in the decision as to which breed of bull to use can result in large changes in admixture proportions of replacement cows and marketed steers between years and large differences can occur between herds for the same reason. When commercially sourced animals are used to generate resource populations to study the genomics of economically important traits such as feed conversion efficiency [3,4] or bovine respiratory disease [5], the presence of extensive admixture in the phenotyped and genotyped animals may impact the genome wide association analysis (GWAA) [3,4] and leads to the training of genomic prediction models in populations for which the breed composition is not understood. As a consequence, the utility of these models in other industry populations, including the registered breeds in which the majority of genetic improvement is generated is also not understood.

As the number of genotyped beef animals has increased, the need to classify the breed composition of these animals has necessitated the development of a precise and accurate method for estimating breed composition in cattle based on single nucleotide polymorphism (SNP) data. Iterative ancestry estimation analyses performed using different software input parameters may identify those that cause output sensitivity and can lead to an interpretation of population structure that is close to the truth [6]. We developed the CRUMBLER analysis pipeline to streamline the genomic estimation of breed composition of crossbred cattle using high-density SNP genotype data, publicly available software, and a reference panel containing genotypes for members of cattle breeds that are numerically important in North America. The CRUMBLER pipeline is species agnostic and could be adapted for breed composition estimation in other species. CRUMBLER and the reference panel data are available on GitHub (https://github.com/tamarcrum/CRUMBLER). This pipeline tool is released under the GNU General Public License.

Materials and methods

Genotype data

From among the numerically most important cattle breeds in North America, in terms of their annual numbers of animal registrations, a list was compiled to define the target breeds for reference panel development. Composite breeds, such as Brangus and Braford, were not included in this list due to lack of available genotype data, but the progenitor Angus, Hereford and Brahman breeds were included. Breeds such as N’Dama, representing African taurine, and Nelore and Brahman, representing Bos taurus indicus cattle, were included. We also initially included breeds that were likely to be involved in early crossbreeding of cattle in the U.S. (Texas Longhorn).

From the 170,544 cattle with high-density SNP genotypes stored within the University of Missouri Animal Genomics genotype database, we extracted genotypes for 48,776 animals identified as being registered with one of the numerically important U.S. Breed Associations or belonging to other world breeds. Pedigree data were also obtained for these animals from each of the Breed Associations, where available (Table 1). These individuals had been genotyped using at least one of 9 different genotyping platforms currently used internationally to genotype cattle including the GeneSeek (Lincoln, NE) GGP-90KT, GGP-F250, GGP-HDV3, GGP-LDV1, GGP-LDV3, and GGP-LDV4 assays, the Illumina (San Diego, CA) BovineHD and BovineSNP50 assays, and the Zoetis (Kalamazoo, MI) i50K assay. The numbers of variants queried by each assay and the number of individuals genotyped using each platform are shown in Table 2.

Download:

Table 1. Genotype data for 48,776 registered individuals from 20 breeds were used to establish the reference population.

https://doi.org/10.1371/journal.pone.0221471.t001

Download:

Table 2. The number of variants queried by each assay and the number of individuals from the 20 reference breeds genotyped using each assay.

https://doi.org/10.1371/journal.pone.0221471.t002

Marker set determination

To maximize the utility of the developed breed assignment tool, we identified the intersection set of SNP markers located on the bovine assays for which we had available genotype data (Table 2). To retain as many SNP markers as possible for subsequent analysis, we identified the intersection of markers present on the GGP-90KT, GGP-F250, GGP-HDV3, GGP-LDV3, GGP-LDV4, BovineHD, BovineSNP50 and i50K assays. However, during the process of identifying the animals that would define the breed reference panel, only 16 individuals had been genotyped using the GGP-LDV4 (n = 2) and GGP-LDV3 (n = 14) assays and no animals had been genotyped using the GGP-LDV1 assay. This intersection set included 6,799 SNP markers (BC7K). The intersection of the markers representing 5 assays (GGP-90KT, GGP-F250, GGP-HDV3, BovineHD, and BovineSNP50) was 13,291 markers (BC13K). By removing only the 16 individuals from the breed reference panel that had been genotyped on the GGP-LDV3 and GGP-LDV4 assays, we were able to compare ancestry predictions using two marker set densities (BC13K and BC7K).

Pipeline

The developed CRUMBLER pipeline integrates the tools and the computational efficiency of publicly available software, PLINK [7,8], EIGENSOFT [9,10] and SNPweights [11] to generate ancestry estimates (Fig 1). The pipeline integrates the often cumbersome processes of data reformatting and sequentially processing the data using analytical tools to generate ancestry proportions for targeted individuals based on a curated breed reference panel.

Download:

Fig 1. Flow diagram of the breed composition pipeline.

https://doi.org/10.1371/journal.pone.0221471.g001

PLINK

PLINK PED formatted genotypes are required as input to the pipeline. PLINK (v1.90b3.31) was used for data filtering and formatting. Genotypes can arise from any of the common bovine genotyping platforms (Table 2), provided that a PLINK compatible MAP file is provided for each assay and data produced using only a single genotyping assay is included in each PED file. The pipeline utilizes the PLINK marker filtering tool (—extract) to extract the user-specified marker subset for ancestry analysis. For analyses of animals genotyped on different genotyping platforms, the marker list representing the intersection of the platforms can be provided to extract the markers that are common to all assays. The pipeline allows multiple input genotype files and uses the PLINK merge genotype files tool (—merge) to combine genotypes into a single file for downstream analysis.

EIGENSOFT

The EIGENSOFT convertf package is used to convert all genotypes from PLINK PED format into EIGENSTRAT format which is required by the SNPweights software. To process the reference panel data, principal component analysis using EIGENSOFT smartpca is used to generate the eigenvalues and eigenvectors that are required to calculate SNP weights using SNPweights. However, the smartpca package included in EIGENSOFT versions beyond 5.0.2 is not compatible with SNPweights. SNPweights requires an input variable, “trace”, to be located in the log file output from the smartpca analysis. For versions of EIGENSOFT beyond 5.0.2, the source code can be edited to ensure that the log file output is compatible with the SNPweights software (S1 File).

SNPweights

SNPweights implements an ancestry inference model based on genome-wide SNP weights computed using genotype data for an external panel of reference individuals. The SNP weights file only needs to be recalculated if the reference panel is changed. EIGENSTRAT formatted target animal genotypes are input into SNPweights, along with the precomputed reference panel SNP weights. The SNP weights are then applied to the target individuals to estimate their ancestry proportions [11].

Reference panel development

The definition of a set of reference individuals that define the genotype frequencies at each SNP variant for each reference breed is technically demanding, but vitally important to the process of defining ancestry. This process assumes that selection has not operated to change gene frequencies between target and reference population animals, and that each population is sufficiently large that drift has not impacted allele frequencies. It also assumes that migration between different countries does not influence population allele frequencies when registered animals are imported or exported. FastSTRUCTURE [12] analysis and iterations of animal filtering using SNPweights was performed using the genotypes of candidate reference panel individuals to remove individuals with significant evidence of admixture from the reference breed panel. An overview of the processes and iterations of filtering conducted in the development of this reference panel set is shown in S1 Fig and Table 1.

FastSTRUCTURE analysis to identify candidate reference panel individuals

Genotype data for 48,776 registered individuals produced by one of 8 different genotyping assays were available for fastSTRUCTURE analysis (Table 1) [12]. We initially performed focused fastSTRUCTURE analyses using small numbers of reference breeds including Angus and Simmental; Angus and Gelbvieh; Angus and Limousin; Angus and Red Angus; Red Angus, Hereford, Shorthorn and Salers; Red Angus, Hereford and Shorthorn; and N’Dama, Nelore and Brahman (S2–S8 Figs). Individuals possessing an ancestry assignment of at least 97% to their designated breed were retained for subsequent analysis (S2 File and Table 1). Following filtering based on fastSTRUCTURE breed assignment, 17,852 individuals representing 19 of the original breeds remained for further analysis (S2 File and S2–S8 Figs). All of the Salers animals were removed in this filtering analysis which is consistent with previous work that found that Salers and Limousin were very similar [4]. Variation in reference population sample sizes has been shown to substantially influence the estimation of the number of ancestral populations (K) in ancestry analyses [6,13,14]. To minimize this effect and produce similar sample sizes for each of the reference breeds, we randomly sampled 200 individuals from each reference breed for which at least 200 individuals remained after filtering on an ancestry assignment of at least 97%, otherwise all remaining individuals were included for the breed (Table 1). Following fastSTRUCTURE analysis using K = 19 after removal of Salers and using the BC7K marker set, Texas Longhorn was also removed from the reference panel breed list due to the inability to distinguish Texas Longhorn as a distinct population (Fig 2). Further, due to the known common ancestry [15] and similarity between Nelore and Brahman (Fig 2), the breeds were combined to represent Bos taurus indicus.

Download:

Fig 2. FastSTRUCTURE results for a random sample of ≤200 individuals per breed from the pool of 17,852 potential reference individuals at K = 19.

Breed identification is shown below each colored block and each animal is represented as a vertical line within the block (JBlack = Japanese Black, Rom = Romagnola, TXL = Texas Longhorn). Blocks are segregated by thick black vertical line to indicate the end of one breed and start of another.

https://doi.org/10.1371/journal.pone.0221471.g002

SNPweights analyses to refine and validate reference panel members

Random sampling of reference breed individuals was performed to create sample sets containing ≤n individuals per breed, for n = 50, 100, 150 and 200 individuals (Fig 3A and 3B and S9 and S10 Figs). Sampling was performed such that if a reference breed had ≥n candidates then n individuals were randomly sampled, otherwise, all available individuals were sampled. An analysis was performed using the BC7K marker set, SNPweights was used to assign reference breed ancestries to the same sample of individuals that was used to produce the SNP weights self-assignment for each of the four samples of individuals (Fig 3A and 3B and S9 and S10 Figs). In the self-assignment analyses conducted using the reference breed sample sets of ≤100 individuals per breed and ≤50 individuals per breed, 7 individuals were removed due to their estimated breed ancestry being ≤60% to their registry breed (Holstein n = 3, Jersey n = 1, Japanese Black n = 3) (Fig 3A and 3B).

Download:

Fig 3. SNPweights self-assignment analysis results for reference panel sample sets.

Reference panel sets consisting of: (A) ≤100 individuals per breed, or (B) ≤50 individuals per breed. Seven individuals were filtered for ≤60% ancestry to their breed of registry (Holstein n = 3, Jersey n = 1, Japanese Black n = 3).

https://doi.org/10.1371/journal.pone.0221471.g003

Breeds with open herdbooks

For the Gelbvieh, Limousin, Shorthorn, Simmental, and Braunvieh breeds that have open U.S. herdbook registries, fullblood or 100% ancestry individuals were identified based on pedigree data obtained from the respective breed associations (Table 1). The term “fullblood” is used to identify cattle for which every ancestor is registered in the herdbook and can be traced back to the breed founders. The term “purebred” refers to animals that have been graded up via crossbreeding to purebred status. Charolais also has an open herdbook registry in the U.S., however, access to Full French imported Charolais breed members was limited. As a result, all Charolais individuals identified as purebred in the association registry were retained for downstream analysis, however, these individuals could contain up to 1/32 introgression from another breed. A random sample of 200 individuals was taken for each breed with more than 200 identified fullblood individuals, otherwise all animals were sampled. Individuals previously included in the candidate reference panel following preliminary fastSTRUCTURE filtering for the open herd book breeds were removed and replaced with the fullblood individuals.

Additional reference panel filtering using SNPweights

After filtering animals identified to not be fullblood based on their pedigree information, we randomly sampled ≤50 individuals per reference breed and utilized SNPweights to estimate weights for each sample and also to estimate breed ancestries for members of the same sample that was used to generate the SNP weights. Based on these analyses, we created 5 overlapping reference breed sets, each containing individuals with ≥90%, ≥85%, ≥80%, ≥75%, or ≥70% ancestry assignment to their registry breeds (Table 3).

Download:

Table 3. Number of individuals for each reference breed assigned to their breed of registration by minimum ancestry threshold.

https://doi.org/10.1371/journal.pone.0221471.t003

Simulated genotypes

Using the phased BC7K genotypes for the final reference population of 803 individuals (3 Nelore genotyped with the BovineHD assay were removed because they were determined to cause problems for the phasing software), we simulated genotypes for 803 individuals each generation (N = 1, 3, 5 and 10) by randomly sampling two individuals as parents from generation N-1 and using a Poisson distribution to sample at random a single recombinant chromosome from each parent. The number of recombination events for each sampled chromosome was sampled from a Poisson distribution with mean equal to chromosome length in Mb/100 (i.e. 1.58 Morgans for chromosome 1). Simulated genotypes were produced for individuals 1 generation removed from the fullblood/purebred reference population animals (i.e., 50% breed A and 50% breed B), 3, 5, and 10, generations, respectively, to evaluate the ability of CRUMBLER to detect large through to small admixture proportions in animals with increasing numbers of breeds represented in their ancestry. Breed composition estimates for these animals were obtained by tracing the breed of origin of every allele present in each generation N animal. For each marker, we attributed the genomic fragment from the center points of the intervals on each side of each marker to the breed of origin of the two alleles at each marker and summed these across all loci. Finally, we normalized these sums by dividing by the autosomal genome size using UMD3.1 coordinates.

Results and discussion

The concept of breed and breed membership is man-made and does not inherently exist in nature. Moreover, the formation of breeds of cattle is very recent, as cattle domestication began about 10,000 years ago but the formation of herdbooks has occurred only during the last 200–250 years [16]. Nevertheless, the effects of drift and human selection over the last 200 years have caused sufficient divergence among breeds that breed differences are identifiable at the molecular level. Such signals are essential for breed ancestry analyses to be effective in modern admixed animals. Previous work on assigning breed composition in admixed cattle utilized 50K genotype data and a reference panel of 16 breeds, with the basis for reference panel inclusion being breed association registration [17]. However, the continual evolution of genotyping assays has led to content changes resulting in only a relatively small proportion of markers in common among assays. Consequently, there is a need to evaluate whether these markers are sufficient for breed content estimation, leading to their conservation in the design of future assays. Furthermore the development of an analytical pipeline based on these markers would simplify analysis for end-users and the use of a single reference panel would allow the direct comparison of results between applications.

Reference panel development

Previously developed cattle reference panels have relied on pedigree accuracy and breed association registration for their definition [17]. Conversely, we used an iterative approach for reference population curation that was able to validate the accuracy of the pedigree information used to identify candidates. FastSTRUCTURE analyses performed using the candidate individuals for each of the initial 19 reference breeds suggested population subdivision in both the Hereford and Simmental (Fig 2). Pedigree analysis for the Herefords within each subpopulation indicated that the subpopulations comprised animals from the highly inbred USDA Miles City Line 1 Hereford population (L1) and other individuals representing broader U.S. Hereford pedigrees. Since the founding of the L1 Herefords, the migration of germplasm has been unidirectional from L1 into the broader U.S. industry, as the L1 population has been closed since its founding [18]. L1 Herefords do not segregate for recessive dwarfism, which has been a threat to Hereford breeders since the 1950s, and this has led to L1 cattle becoming popular in the process of purging herds of the defect [19]. In 2008, the average proportion of U.S. registered Herefords influenced by L1 genetics was 81% [18].

The detected subpopulation division within the Simmental breed (Fig 2) represents the differentiation between purebred and fullblood animals. For example, progeny of a popular fullblood Simmental sire are present in both subpopulations, however, in one subpopulation the family members are all fullblood and in the other they are all purebred or percentage Simmental animals. This result supports the need to identify fullblood animals as reference panel breed representatives for breeds with open herdbooks.

Reference population sample size

By randomly sampling individuals from the candidate reference breed set and using SNPweights to assign these individuals to reference populations, we found that reference panel breed sample sizes of ≤50 or ≤100 individuals appeared to capture the diversity within each breed and appropriately determined the ancestry of the tested individuals (Fig 3A and 3B). For each breed, the percent ancestry predicted for the tested reference samples was, on average, 3.86% higher when the SNP weights were estimated using ≤50 individuals per breed than when ≤100 individuals per breed were used (Table 4). This reflects the increased homogeneity of individuals within each breed and a greater genetic distance between individuals from different breeds as smaller samples of individuals from each breed are used to define the reference panel. Further, due to limitations in the number of genotyped individuals for some breeds (Table 1), as the sample size was increased globally, imbalances were created between the reference panel breed sample sizes which impacted breed composition estimation (S9 and S10 Figs). It has previously been shown that the power to detect population structure improves as the reference population sample sizes become more similar [6,14].

Download:

Table 4. Ancestry proportion statistics for the self-assignment of reference panel members from samples of ≤50 or ≤100 individuals from the candidate reference breed individuals.

https://doi.org/10.1371/journal.pone.0221471.t004

Marker density

After the replacement of reference breed individuals with those identified to be fullblood based on pedigree analysis for the open herdbook Gelbvieh, Simmental, Limousin, Braunvieh, Shorthorn, and Charolais breeds, additional self-assignment analyses were conducted to evaluate the effects of marker set size on ancestry prediction. Breed reference panels were again constructed by randomly sampling ≤50 individuals per breed and SNP weights were calculated using both the BC13K markers and BC7K markers. The estimated SNP weights were then used to self-assign ancestry to members of the reference panel animals representing the reference breed set. The ancestry predictions for the reference breed individuals using either the BC7K (Fig 4A and S11 Fig) or BC13K (Fig 4B and S12 Fig) marker sets indicate that use of the BC13K marker set did not significantly impact the ancestry predictions. Consequently, the use of the 6,799 markers common to the 8 commercially available genotyping platforms appears to be sufficient to assign breed ancestry for the majority of animals produced in the U.S. The CRUMBLER pipeline can accommodate samples genotyped using alternative assays, however, the produced breed composition estimates will be based on the intersection of markers on the assay and the BC7K marker set.

Download:

Fig 4. SNPweights self-assignment of ancestry for candidate reference breed individuals following evaluation of open herdbook breeds using: (A) the BC7K, or (B) the BC13K marker panels.

Reference breed panels were constructed by random sampling ≤50 individuals per breed and SNP weights were estimated using the BC7K and BC13K marker sets.

https://doi.org/10.1371/journal.pone.0221471.g004

Assignment thresholds

We next examined the effects of reference breed homogeneity on ancestry assignment by identifying reference panel members that had been assigned to their breed of registry using SNPweights with probabilities of ancestry of ≥90%, ≥85%, ≥80%, ≥75%, and ≥70%, respectively (Table 3). From these individuals, reference breed panels were obtained by randomly sampling ≤50 individuals per breed, until each individual was represented in at least one sample set. SNP weights were then estimated using the BC7K marker set and ancestry was assigned for these individuals using SNPweights (Figs 5 and 6 and S13–S15 Figs). Limiting the reference breed panel members to those individuals with ≥90% ancestry assigned to their breed of registry produced a reference panel that did not represent the extent of diversity within each of the breeds (Fig 5). On the other hand, using an ancestry assignment of ≥85% clearly captured greater diversity within each breed (Fig 6) and maximized the self-assignment of ancestry to the breed of registration (Table 5).

Download:

Table 5. Average predicted ancestry and variance in predicted ancestry for candidate reference breed individuals when filtered on minimum predicted ancestry.

https://doi.org/10.1371/journal.pone.0221471.t005

Download:

Fig 5. Reference breed panel constructed by the random sampling of ≤50 individuals per breed from individuals with ≥90% ancestry was self-assigned to reference breed ancestry using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.g005

Download:

Fig 6. Reference breed panel constructed by the random sampling of ≤50 individuals per breed from individuals with ≥85% ancestry was self-assigned to reference breed ancestry using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.g006

Reference panel definition

To examine whether the specific individuals represented in the reference panel sample influenced the self-assignment of ancestry to the sampled individuals, a second sample of ≤50 distinct individuals per breed was obtained from the individuals with ≥85% assignment to their breed of registration and analyzed with SNPweights (Fig 7). Fig 7 indicates that the ability to predict ancestry was not influenced by the specific individuals sampled from the set of animals with ≥85% ancestry to their breed of registration.

Download:

Fig 7. Reference breed panel constructed by the independent random sampling of a second sample of ≤50 individuals per breed from individuals with ≥85% ancestry after eliminating individuals represented in the first sample was self-assigned to reference breed ancestry using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.g007

Additionally, Figs 6 and 7 suggest that the use of a reference breed panel constructed by the random sampling of ≤50 individuals per breed from individuals with ≥85% self-assigned ancestry to their breed of registration maintained sufficient within-breed diversity to accurately estimate the ancestry of target individuals. However, these figures also reveal small amounts of apparent introgression from other reference panel breeds within each of the breeds. This does not appear to be an issue of marker resolution since the analyses performed with the BC7K and BC13K marker sets generated similar results (Fig 4). We conclude that these apparent introgressions are either due to a lack of power to discriminate among breeds using the common markers designed onto commercial genotyping platforms, or represent the presence of common ancestry among the breeds prior to the formation of breed herdbooks ~200 years ago. Molecular evidence for this shared ancestry exists, for example, Hereford and Angus cattle share the Celtic polled allele [20] and the segmental duplication responsible for the white anterior, ventral and dorsal coat color pattern occurs only in Hereford and Simmental cattle and their crosses [21]. These data clearly indicate that crossbreeding was widespread prior to the formal conceptualization of breeds.

Reference panel validation

To evaluate the ability of the selected reference breed panel to identify breed composition, an analysis was conducted for all 170,544 samples in the database (Figs 6 and 7). We extracted animals with pedigree information including fullblood and purebred animals registered with open herdbook breed associations and 2,243 crossbred animals with varying degrees of admixture. Considering the amount of available data, the number of pedigreed admixed animals was very limited and the purebred animals all had similar expected admixture proportions. Consequently, we next simulated genotypes for animals by assuming the random mating of members of the reference breed panel for 1, 3, 5 and 10 generations assuming non-overlapping generations to generate generations of animals with different numbers of breeds and breed proportions represented in their genomes.

Registered fullblood animals

For the Gelbvieh, Limousin, Shorthorn, Simmental, and Braunvieh breeds that have open herdbook registries, fullblood or 100% ancestry individuals were identified based on pedigree data obtained from the respective breed associations (Table 1). CRUMBLER estimates were obtained for these fullblood individuals and the distribution of estimates by breed are in Fig 8. For all breeds except Charolais, >50% of the individuals had CRUMBLER estimated percentages of ≥80% to their respective breeds. Average percentage estimates for fullblood Gelbvieh, Limousin, Shorthorn, Simmental, and Braunvieh individuals were 76%, 78%, 83%, 79%, and 85%, respectively (Fig 8B). However, the number of genotyped imported Full French Charolais animals was limited and so we also analyzed all purebred Charolais individuals which could contain up to 1/32^nd of their genome introgressed from another breed. The average Charolais breed assignment was 72% and the distribution of estimates was more variable than for the fullblood animals from the other breeds (Fig 8B).

Download:

Fig 8. SNPweight ancestry assignments for 2,2408 registered fullblood animals from open herd book breeds.

(A) Distribution by breed of SNPweights ancestry assignment results for 2,408 registered fullblood animals from open herd book breeds. (B) Pictorial representation of CRUMBLER estimates for 2,408 registered fullblood animals from open herd book breeds.

https://doi.org/10.1371/journal.pone.0221471.g008

Pedigreed crossbred animals

Based on pedigree, 2,005 individuals were identified as being primarily Hereford but with varying degrees of Red Angus, Salers, Angus or unknown other breed influence. The analysis results agreed with the pedigree data (Fig 9A and 9B) To investigate the correlations between pedigree and CRUMBLER estimated breed proportions, we removed proportions for breeds that were less than 3% and normalized the remaining values. CRUMBLER estimates were then correlated with the pedigree predicted estimates of the proportion of Hereford in these individuals (Fig 9C). CRUMBLER tended to underestimate the Hereford proportion as the pedigree estimated Hereford proportion tended to 100%.

Download:

Fig 9. SNPweights results for 2,005 crossbred Hereford individuals.

(A) SNPweights ancestry results for 2,005 crossbred Hereford individuals with a-priori breed composition estimates determined by pedigree. (B) Breed assignment reference breed key. (C) Hereford SNPweights estimated proportions using CRUMBLER are plotted against the pedigree estimates. Data point color indicates the breed for which SNPweights assigned the highest proportion for each individual.

https://doi.org/10.1371/journal.pone.0221471.g009

The remaining 238 crossbred individuals were commercial, advanced generation animals with an expected 50% Angus and 50% Simmental ancestry based on pedigree data. Results of the CRUMBLER analysis again support the pedigree data (Fig 10). The presence of Red Angus ancestry in these animals reveals the inability of the analysis to fully differentiate between Angus and Red Angus, which only diverged in the U.S. in 1954, and also the influence of Red Angus in the U.S. Simmental breed (S16 Fig).

Download:

Fig 10. SNPweights results for 238 crossbred individuals.

(A) SNPweights ancestry results for 238 crossbred individuals with a-priori breed composition estimates of 50% Angus and 50% Simmental based on a reference panel with ≤50 individuals per breed sampled from individuals with ≥85% assignment to their breed of registry. (B) Breed assignment for the crossbred individuals can be determined using this reference breed key.

https://doi.org/10.1371/journal.pone.0221471.g010

Simulated genotypes

Genomes were simulated using the phased genotypes for 803 individuals from the reference breed panel to contain varying breed numbers and admixture proportions after 1, 3, 5, and 10 generations of random mating with nonoverlapping generations. In generation 1, the admixed individuals were F₁ individuals with a 50:50 autosomal genome composition unless both parents were randomly sampled from the same breed. CRUMBLER estimates of breed composition using the simulated genotypes were strongly correlated with the simulated compositions, especially for generations 1 and 3 (Fig 11). As the number of generations increased, the number of breeds represented in the simulated genomes tended to increase and the proportion of the genome originating from any one breed tended to decrease and the correlation between the simulated proportions and CRUMBLER estimates also decreased. Nevertheless, by generation 10 44% of animals had their genome proportions estimated with a correlation of at least 70%. In the U.S. commercial crossbreeding does not usually involve the use of more than 3–4 breeds of cattle and while the number of generations of crossbreeding may very well be 10 or perhaps more, many generations will involve the mating of animals with similar genome ancestries and the proportions for each breed will be much greater than present in the generation 10 animals in Fig 11. Consequently, the achieved accuracies are likely to be closer to the generation 3 or 5 results where 99% and 68% of animals, respectively, had their genome proportions estimated with a correlation of greater than 80%.

Download:

Fig 11. SNPweights results using simulated genotypes.

Genotypes were simulated for the indicated number of generations of random mating, with generation 1 (G1) animals being 50:50 proportion except when two parents from the same breed were mated. SNPweights results were obtained using CRUMBLER pipeline parameters correlations between these estimates and the known simulated breed compositions were produced and the proportion of individuals within each correlation class is indicated.

https://doi.org/10.1371/journal.pone.0221471.g011

Advanced generation composite animals

The ancestry model assumes that neither drift or selection has acted to alter the allele frequencies from those created by the initial admixture proportions. We examined CRUMBLER estimates of breed composition for advanced generation members of the Brangus (n = 11,362), Beefmaster (n = 3,832) and Santa Gertrudis (n = 2,010) composite breeds where selection has had the opportunity to change breed composition from expectations at breed formation. Brangus individuals are expected to be ⅝ Angus and ⅜ Brahman, Beefmaster individuals ¼ Hereford, ¼ Shorthorn and ½ Brahman, and Santa Gertrudis ⅝ Shorthorn and ⅜ Brahman, respectively. These breeds use mating strategies that produce individuals that are expected to possess these proportions for registration within each of the respective breed’s herdbook. However, registerable animals are ultimately advanced generation composites and so drift, meiotic sampling of parental chromosomes and selection are all expected to create individual variation in these ancestry proportions. CRUMBLER results for these advanced generation composites, also known as the American breeds, are shown in Fig 12. Table 6 contains the average breed proportion estimates assigned to each of these breeds by CRUMBLER and their standard deviations across the animals analyzed for each breed. In every instance, CRUMBLER underestimates the expected proportions for each of the American breed populations, however, the ancestral breeds clearly dominate the assignments (Table 6). Interestingly, on average, CRUMBLER estimated proportions of Holstein ancestry for advanced generation Beefmaster and Brangus animals (Fig 12 and Table 6). These American breeds do not contain any Holstein introgression and they do not contain ancestry from a “Ghost Population”, a population that is not present in the reference set, which would lead to a breed assignment to a reference breed that it most closely resembled [6]. We speculate that this effect is caused by selection creating a deviation in allele frequencies from those found in the founder breeds which the model explains by an introgression from a distantly related breed, in this case, Holstein. Stratifying these genotyped animals according to the number of generations from foundation fullblood animals and examining the extent of estimated Holstein introgression, which would be expected to increase with generation number, would enable this to be tested, but we did not have access to the necessary data. However, this hypothesis is supported by the fact that the Santa Gertrudis had the least estimated Holstein introgression and the breed has published estimates of additive genetic merit for many fewer years than the Beefmaster or Brangus.

Download:

Table 6. Average breed ancestry percentages assigned to American Breed individuals.

https://doi.org/10.1371/journal.pone.0221471.t006

Download:

Fig 12. SNPweights results for American Breed populations, Brangus, Beefmaster, and Santa Gertrudis.

(A) SNPweights ancestry results using CRUMBLER pipeline for 11,362 Brangus, 3,832 Beefmaster, and 2,010 Santa Gertrudis individuals. (B) Breed assignment for these advanced generation composite animals can be determined using this reference breed key.

https://doi.org/10.1371/journal.pone.0221471.g012

Admixture

We also tested the ADMIXTURE software [22] for ancestry estimation and integration into the CRUMBLER pipeline using the same reference breed panel that was developed for use with SNPweights. ADMIXTURE uses maximum likelihood estimation to fit the same statistical model as STRUCTURE, however, STRUCTURE does not allow the specification of individuals of known descent to be used as a reference panel [22]. ADMIXTURE allows a supervised analysis, in which the user can specify a reference set of individuals, by specifying the “—supervised” flag and requires an additional file with a “.pop” suffix to specify the genotypes of the reference population individuals [22]. Unlike SNPweights, the reference population individuals’ genotypes must be provided in a genotype file for each analysis.

We first conducted an ADMIXTURE analysis in which we self-assigned ancestry for the animals in the reference breed set formed with ≤50 individuals per breed from the individuals that had ≥85% assignment to the breed of registration (Fig 13). The results shown in Fig 13 are similar to those in Fig 6 for the same reference panel, albeit with perhaps less evidence of background introgression. We next conducted an analysis using the reference panel used in Fig 13 merged with data for the 2,005 high percentage crossbred Herefords animals. The results shown in Fig 14, reveal a significant change in the ancestry proportions estimated for the reference panel Guernsey, Gelbvieh and Romagnola individuals between the two analyses which used exactly the same reference panel, but differed only in the number of individuals for which ancestry was to be estimated. This suggests that ADMIXTURE may use the target individuals to update information provided by the reference panel individuals specified in the “.pop” file. Consequently, the ADMIXTURE estimated ancestry proportions appear to be context dependent and may vary based on the other individuals included in the analysis.

Download:

Fig 13. Self-assignment of ancestry for the animals in the reference breed set formed with ≤50 individuals per breed from the individuals that had ≥85% assignment to their breed of registration using ADMIXTURE.

https://doi.org/10.1371/journal.pone.0221471.g013

Download:

Fig 14. ADMIXTURE analysis conducted using the same data as shown in Fig 13 (first four rows), merged with an additional 2,005 high percentage crossbred Hereford target individuals (last row).

Here, the 2,005 Hereford crossbred individuals appear after the reference individuals in the input genotype file.

https://doi.org/10.1371/journal.pone.0221471.g014

Moreover, the order in which the target individuals appear in the genotype input file also appears to affect ADMIXTURE estimates of ancestry proportions for the target individuals. Fig 15 shows the results of an ADMIXTURE analysis in which the target individuals were identical to those shown in Fig 14, but for which the order of the reference individuals and the 2,005 Hereford crossbred individuals was reversed in the input files. In Fig 14, the reference individuals appear before the 2,005 Hereford crossbred individuals in the input file, whereas in Fig 15, the 2,005 Hereford crossbred individuals appeared before the reference individuals in the input file. The results reveal a significant change in ancestry proportions for Guernsey and Gelbvieh, but the Romagnola now appear to be non-admixed. Finally, we performed an ADMIXTURE analysis for these animals in which the order of animals in the input genotype file was completely randomized (Fig 16). Following analysis, the individuals were sorted to generate Fig 16. Again, the ancestry proportions for the Guernsey, Gelbvieh and Romagnola individuals suggest these breeds to be admixed.

Download:

Fig 15. ADMIXTURE analysis conducted using the same data as shown in Fig 14.

Here, the 2,005 Hereford crossbred individuals appear before the reference individuals in the input genotype file. The first row represents the 2005 Hereford crossbred samples. Rows 2 to 5 show the reference panel individuals.

https://doi.org/10.1371/journal.pone.0221471.g015

Download:

Fig 16. ADMIXTURE analysis conducted using the same data as shown in Figs 14 and 15, but with the order of the individuals in the input genotype file randomized.

The animals were sorted following analyses to generate this figure where the first four rows represent the reference panel individuals, the fifth row shows the 2,005 Hereford crossbred animals.

https://doi.org/10.1371/journal.pone.0221471.g016

STRUCTURE and ADMIXTURE are widely used for characterizing admixed populations [6], however, we have not found any reports in the literature that indicate that the software is sensitive to the input order of individuals. However, we suspect that the majority of users would have no need or motivation to run the software with permuted data input files. Nevertheless, because of these inconsistencies between results, we chose to not use ADMIXTURE for ancestry estimation within the CRUMBLER pipeline.

Broader application using additional commercially available assays

To broaden the spectrum of data from different commercially available assays that can be evaluated, an additional intersection of markers was obtained using 11 commercially available bovine assays including the GGP-90KT, GGP-F250, GGP-HDV3, GGP-LDV3, GGP-LDV4, BovineHD, BovineSNP50, i50K, Irish Cattle Breeding Federation (Cork, Ireland) IDBv3, and GeneSeek (Lincoln, NE) BOVG50v1 assays. The intersection SNP set included 6,363 SNPs (BC6K). A SNPweights self-assignment analysis using the reference set of individuals with ≥85% assignment to their breed of registration was conducted to assess the effects of the reduction in number of markers used for ancestry assignment. The ancestry proportions assigned based on the BC6K marker set (Fig 17) did not differ appreciably from those obtained using the BC7K marker set (Fig 6). This result indicates the utility of CRUMBLER and the reference panel breed set across the spectrum of commercially available genotyping platforms.

Download:

Fig 17. Reference breed panel constructed by the random sampling of ≤50 individuals per breed from individuals with ≥85% ancestry was self-assigned to reference breed ancestry using the BC6K marker set.

https://doi.org/10.1371/journal.pone.0221471.g017

Conclusions

The determination of a set of reference population breeds and individuals that define allele and genotype frequencies at each variant for each of the breeds is arguably the most important, yet technically difficult step in the process of ancestry estimation. We employed several iterations of filtering to remove recently admixed individuals and identify a relatively homogeneous set of individuals that nevertheless represented the variation that might be expected among individuals within a breed. Once determined, the reference panel genotype data need only be processed once to obtain SNP weights removing the need to share genotype data for reference individuals in subsequent studies [11]. The upfront development of an external reference breed panel capitalizes on the rich ancestry information available in large available datasets, and relatedness, variation in sample sizes and diversity among the target individuals does not affect the inference of ancestry [11].

In cattle, the visual evaluation of breed characteristics is a poor method for evaluating the ancestry of individuals. Breed association pedigrees can be used to estimate expected breed compositions, however, the random assortment of chromosomes into gametes and selection can lead to ancestry proportions that differ from those expected based upon pedigree. Moreover, the vast majority of commercial beef cattle in the U.S. have no or very limited pedigree information and since these animals are frequently used for genomic research [3–5], there is a need for a tool that can routinely provide ancestry estimates for downstream use in GWAA or other genetic studies.

We tested ADMIXTURE and SNPweights and found that results from ADMIXTURE appear to depend on the ancestry and order of appearance of individuals within the genotype input file. We therefore developed an analysis pipeline, CRUMBLER, based upon PLINK, EIGENSOFT and SNPweights to automate the process of ancestry estimation. The developed bovine pipeline utilizes the 6,799 SNPs present on 8 commercially utilized bovine SNP genotyping assays and results using these SNPs are consistent with results obtained when 13,291 SNPs were used. From an available 48,776 genotyped individuals, we also developed a reference panel of 806 individuals sampled from 17 breeds to have ≤50 individuals per breed that had ≥85% assignment to their breed of registration. This panel appears to allow the robust estimation of the ancestry of advanced generation admixed animals, however, all breeds share some common ancestry which predates the recent development of breed association herdbooks [16,23]. The greatest constraint that we faced in the development of the reference panel was the unequal sample sizes for genotyped registered animals and the sensitivity of ancestry software to unequal sample sizes. If access could be gained to the very large samples of genotyped animals from each of the U.S. breed associations, simply taking a random sample from each population such that the sample size within each breed was 4-5X the effective population size (N_e~100 for each breed), we suspect that we would have effectively created representative samples for the reference panel. Furthermore, our panel of 6,799 SNPs represents the the intersection of markers on the commonly used genotyping assays and these SNPs are probably represented on all assays because they have high call rates and minor allele frequencies in the majority of breeds. Consequently, these SNPs are probably far from being desirable as ancestry informative markers. A future research direction would be to identify a relatively small set of ancestry informative markers available either on currently available genotyping assays or, preferably, from whole genome sequencing projects such as the 1000 Bull Genomes project and include these ancestry informative markers on future design iterations of bovine genotyping assays.

CRUMBLER is not limited to application in cattle and with the provision of suitable reference breed allele frequencies can be applied to other species for ancestry estimation. CRUMBLER pipeline scripts and reference panel breed SNP weights are available on GitHub (https://github.com/tamarcrum/CRUMBLER).

Supporting information

S1 File. Supplementary information (PDF).

This file contains the source code changes in SMARTPCA within versions of EIGENSOFT beyond 5.0.2 to enable compatibility with SNPweights.

https://doi.org/10.1371/journal.pone.0221471.s001

(PDF)

S2 File. Supplementary methods (PDF).

This file describes the preliminary fastSTRUCTURE analyses conducted on subsamples of breeds in the development of the reference breed panel.

https://doi.org/10.1371/journal.pone.0221471.s002

(PDF)

S1 Fig. An overview of the processes and iterations of filtering conducted in the development of the reference panel.

https://doi.org/10.1371/journal.pone.0221471.s003

(PDF)

S2 Fig. Preliminary FastSTRUCTURE analysis of candidate Angus and Simmental reference population animals.

https://doi.org/10.1371/journal.pone.0221471.s004

(PDF)

S3 Fig. Preliminary fastSTRUCTURE analysis of candidate Angus and Gelbvieh reference population animals.

https://doi.org/10.1371/journal.pone.0221471.s005

(PDF)

S4 Fig. Preliminary fastSTRUCTURE analysis of candidate Angus and Limousin reference population animals.

https://doi.org/10.1371/journal.pone.0221471.s006

(PDF)

S5 Fig. Preliminary fastSTRUCTURE analysis of candidate Angus and Red Angus reference population animals.

https://doi.org/10.1371/journal.pone.0221471.s007

(PDF)

S6 Fig. Preliminary fastSTRUCTURE analysis of candidate Red Angus, Hereford, Shorthorn and Salers reference population animals.

https://doi.org/10.1371/journal.pone.0221471.s008

(PDF)

S7 Fig. Preliminary fastSTRUCTURE analysis of candidate Red Angus, Hereford and Shorthorn reference population animals.

https://doi.org/10.1371/journal.pone.0221471.s009

(PDF)

S8 Fig. Preliminary fastSTRUCTURE analysis of candidate N’Dama, Nelore and Brahman reference population animals.

https://doi.org/10.1371/journal.pone.0221471.s010

(PDF)

S9 Fig. SNPweights self-assignment analysis for the reference sample set containing ≤200 individuals per breed analyzed using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.s011

(PDF)

S10 Fig. SNPweights self-assignment analysis for the reference sample set containing ≤150 individuals per breed analyzed using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.s012

(PDF)

S11 Fig. SNPweights self-assignment analysis for the reference sample set containing ≤50 individuals per breed analyzed using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.s013

(PDF)

S12 Fig. SNPweights self-assignment analysis for the reference sample sets containing ≤50 individuals per breed analyzed using the BC13K marker set.

https://doi.org/10.1371/journal.pone.0221471.s014

(PDF)

S13 Fig. SNPweights self-assignment analysis for the reference sample set with ≥80% ancestry to breed of registry and ≤50 individuals per breed using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.s015

(PDF)

S14 Fig. SNPweights self-assignment analysis for reference sample set with ≥75% ancestry to breed of registry and ≤50 individuals per breed using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.s016

(PDF)

S15 Fig. SNPweights self-assignment analysis for the reference sample set with ≥70% ancestry to breed of registry and ≤50 individuals per breed using the BC7K marker set.

https://doi.org/10.1371/journal.pone.0221471.s017

(PDF)

S16 Fig. SNPweights self-assignment analyses using a reference panel with ≤50 individuals per breed and sampling from the individuals with ≥85% assignment to their breed of registry but with (a) Red Angus or (b) Angus excluded from the reference panel.

https://doi.org/10.1371/journal.pone.0221471.s018

(PDF)

References

1. Burnett MS, Strain KJ, Lesnick TG, de Andrade M, Rocca WA, Maraganore DM. Reliability of self-reported ancestry among siblings: implications for genetic association studies. Am J Epidemiol. 2006 Mar 1;163(5):486–92. pmid:16421243
- View Article
- PubMed/NCBI
- Google Scholar
2. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000 Jun;155(2):945–59. pmid:10835412
- View Article
- PubMed/NCBI
- Google Scholar
3. Saatchi M, Beever JE, Decker JE, Faulkner DB, Freetly HC, Hansen SL, et al. QTLs associated with dry matter intake, metabolic mid-test weight, growth and feed efficiency have little overlap across 4 beef cattle studies. BMC Genomics. 2014 Nov 20;15:1004. pmid:25410110
- View Article
- PubMed/NCBI
- Google Scholar
4. Seabury CM, Oldeschulte DL, Saatchi M, Beever JE, Decker JE, Halley YA, et al. Genome-wide association study for feed efficiency and growth traits in U.S. beef cattle. BMC Genomics. 2017 May 18;18(1):386. pmid:28521758
- View Article
- PubMed/NCBI
- Google Scholar
5. Neibergs HL, Seabury CM, Wojtowicz AJ, Wang Z, Scraggs E, Kiser JN, et al. Susceptibility loci revealed for bovine respiratory disease complex in pre-weaned holstein calves. BMC Genomics. 2014 Dec 22;15:1164. pmid:25534905
- View Article
- PubMed/NCBI
- Google Scholar
6. Lawson DJ, van Dorp L, Falush D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat Commun. 2018 Aug 14;9(1):3258. pmid:30108219
- View Article
- PubMed/NCBI
- Google Scholar
7. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007 Sep;81(3):559–75. pmid:17701901
- View Article
- PubMed/NCBI
- Google Scholar
8. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015 Feb 25;4:7. pmid:25722852
- View Article
- PubMed/NCBI
- Google Scholar
9. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006 Dec;2(12):e190. pmid:17194218
- View Article
- PubMed/NCBI
- Google Scholar
10. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006 Aug;38(8):904–9. pmid:16862161
- View Article
- PubMed/NCBI
- Google Scholar
11. Chen C-Y, Pollack S, Hunter DJ, Hirschhorn JN, Kraft P, Price AL. Improved ancestry inference using weights from external reference panels. Bioinformatics. 2013 Jun 1;29(11):1399–406. pmid:23539302
- View Article
- PubMed/NCBI
- Google Scholar
12. Raj A, Stephens M, Pritchard JK. fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets. Genetics. 2014 Jun 1;197(2):573–89. pmid:24700103
- View Article
- PubMed/NCBI
- Google Scholar
13. Wang J. The computer program structure for assigning individuals to populations: easy to use but easier to misuse. Mol Ecol Resour. 2017 Sep;17(5):981–90. pmid:28028941
- View Article
- PubMed/NCBI
- Google Scholar
14. Puechmaille SJ. The program structure does not reliably recover the correct population structure when sampling is uneven: subsampling and new estimators alleviate the problem. Mol Ecol Resour. 2016 May;16(3):608–27. pmid:26856252
- View Article
- PubMed/NCBI
- Google Scholar
15. Sanders JO. History and Development of Zebu Cattle in the United States. J Anim Sci. 1980 Jun 1;50(6):1188–200.
- View Article
- Google Scholar
16. Decker JE, McKay SD, Rolf MM, Kim J, Molina Alcalá A, Sonstegard TS, et al. Worldwide patterns of ancestry, divergence, and admixture in domesticated cattle. PLoS Genet. 2014 Mar;10(3):e1004254. pmid:24675901
- View Article
- PubMed/NCBI
- Google Scholar
17. Kuehn LA, Keele JW, Bennett GL, McDaneld TG, Smith TPL, Snelling WM, et al. Predicting breed composition using breed frequencies of 50,000 markers from the US Meat Animal Research Center 2,000 Bull Project. J Anim Sci. 2011 Jun;89(6):1742–50. pmid:21278116
- View Article
- PubMed/NCBI
- Google Scholar
18. Leesburg VLR, MacNeil MD, Neser FWC. Influence of Miles City Line 1 on the United States Hereford population. J Anim Sci. 2014 Jun;92(6):2387–94. pmid:24867928
- View Article
- PubMed/NCBI
- Google Scholar
19. McCann LP. battle of bull runts. 1974; http://agris.fao.org/agris-search/search.do?recordID=US201300539215
20. Wiedemar N, Tetens J, Jagannathan V, Menoud A, Neuenschwander S, Bruggmann R, et al. Independent polled mutations leading to complex gene expression differences in cattle. PLoS One. 2014 Mar 26;9(3):e93435. pmid:24671182
- View Article
- PubMed/NCBI
- Google Scholar
21. Whitacre L. Structural variation at the KIT locus is responsible for the piebald phenotype in Hereford and Simmental cattle. 2014; http://search.proquest.com/openview/45eba5fa3c5757a2c4c2ab18af1a8a98/1?pq-origsite=gscholar&cbl=18750&diss=y
22. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009 Sep;19(9):1655–64. pmid:19648217
- View Article
- PubMed/NCBI
- Google Scholar
23. Decker JE, Pires JC, Conant GC, McKay SD, Heaton MP, Chen K, et al. Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proc Natl Acad Sci U S A. 2009 Nov 3;106(44):18644–9. pmid:19846765
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Burnett MS, Strain KJ, Lesnick TG, de Andrade M, Rocca WA, Maraganore DM. Reliability of self-reported ancestry among siblings: implications for genetic association studies. Am J Epidemiol. 2006 Mar 1;163(5):486–92. pmid:16421243
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000 Jun;155(2):945–59. pmid:10835412
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Saatchi M, Beever JE, Decker JE, Faulkner DB, Freetly HC, Hansen SL, et al. QTLs associated with dry matter intake, metabolic mid-test weight, growth and feed efficiency have little overlap across 4 beef cattle studies. BMC Genomics. 2014 Nov 20;15:1004. pmid:25410110
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Seabury CM, Oldeschulte DL, Saatchi M, Beever JE, Decker JE, Halley YA, et al. Genome-wide association study for feed efficiency and growth traits in U.S. beef cattle. BMC Genomics. 2017 May 18;18(1):386. pmid:28521758
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Neibergs HL, Seabury CM, Wojtowicz AJ, Wang Z, Scraggs E, Kiser JN, et al. Susceptibility loci revealed for bovine respiratory disease complex in pre-weaned holstein calves. BMC Genomics. 2014 Dec 22;15:1164. pmid:25534905
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Lawson DJ, van Dorp L, Falush D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat Commun. 2018 Aug 14;9(1):3258. pmid:30108219
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007 Sep;81(3):559–75. pmid:17701901
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015 Feb 25;4:7. pmid:25722852
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006 Dec;2(12):e190. pmid:17194218
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006 Aug;38(8):904–9. pmid:16862161
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Chen C-Y, Pollack S, Hunter DJ, Hirschhorn JN, Kraft P, Price AL. Improved ancestry inference using weights from external reference panels. Bioinformatics. 2013 Jun 1;29(11):1399–406. pmid:23539302
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Raj A, Stephens M, Pritchard JK. fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets. Genetics. 2014 Jun 1;197(2):573–89. pmid:24700103
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Wang J. The computer program structure for assigning individuals to populations: easy to use but easier to misuse. Mol Ecol Resour. 2017 Sep;17(5):981–90. pmid:28028941
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Puechmaille SJ. The program structure does not reliably recover the correct population structure when sampling is uneven: subsampling and new estimators alleviate the problem. Mol Ecol Resour. 2016 May;16(3):608–27. pmid:26856252
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Sanders JO. History and Development of Zebu Cattle in the United States. J Anim Sci. 1980 Jun 1;50(6):1188–200.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref16] 16. Decker JE, McKay SD, Rolf MM, Kim J, Molina Alcalá A, Sonstegard TS, et al. Worldwide patterns of ancestry, divergence, and admixture in domesticated cattle. PLoS Genet. 2014 Mar;10(3):e1004254. pmid:24675901
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref17] 17. Kuehn LA, Keele JW, Bennett GL, McDaneld TG, Smith TPL, Snelling WM, et al. Predicting breed composition using breed frequencies of 50,000 markers from the US Meat Animal Research Center 2,000 Bull Project. J Anim Sci. 2011 Jun;89(6):1742–50. pmid:21278116
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref18] 18. Leesburg VLR, MacNeil MD, Neser FWC. Influence of Miles City Line 1 on the United States Hereford population. J Anim Sci. 2014 Jun;92(6):2387–94. pmid:24867928
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref19] 19. McCann LP. battle of bull runts. 1974; http://agris.fao.org/agris-search/search.do?recordID=US201300539215

[ref20] 20. Wiedemar N, Tetens J, Jagannathan V, Menoud A, Neuenschwander S, Bruggmann R, et al. Independent polled mutations leading to complex gene expression differences in cattle. PLoS One. 2014 Mar 26;9(3):e93435. pmid:24671182
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref21] 21. Whitacre L. Structural variation at the KIT locus is responsible for the piebald phenotype in Hereford and Simmental cattle. 2014; http://search.proquest.com/openview/45eba5fa3c5757a2c4c2ab18af1a8a98/1?pq-origsite=gscholar&cbl=18750&diss=y

[ref22] 22. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009 Sep;19(9):1655–64. pmid:19648217
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref23] 23. Decker JE, Pires JC, Conant GC, McKay SD, Heaton MP, Chen K, et al. Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proc Natl Acad Sci U S A. 2009 Nov 3;106(44):18644–9. pmid:19846765
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Genotype data

Marker set determination

Pipeline

PLINK

EIGENSOFT

SNPweights

Reference panel development

FastSTRUCTURE analysis to identify candidate reference panel individuals

SNPweights analyses to refine and validate reference panel members

Breeds with open herdbooks

Additional reference panel filtering using SNPweights

Simulated genotypes

Results and discussion

Reference panel development

Reference population sample size

Marker density

Assignment thresholds

Reference panel definition

Reference panel validation

Registered fullblood animals

Pedigreed crossbred animals

Simulated genotypes

Advanced generation composite animals

Admixture

Broader application using additional commercially available assays

Conclusions

Supporting information

S1 File. Supplementary information (PDF).

S2 File. Supplementary methods (PDF).

S1 Fig. An overview of the processes and iterations of filtering conducted in the development of the reference panel.

S2 Fig. Preliminary FastSTRUCTURE analysis of candidate Angus and Simmental reference population animals.

S3 Fig. Preliminary fastSTRUCTURE analysis of candidate Angus and Gelbvieh reference population animals.

S4 Fig. Preliminary fastSTRUCTURE analysis of candidate Angus and Limousin reference population animals.

S5 Fig. Preliminary fastSTRUCTURE analysis of candidate Angus and Red Angus reference population animals.

S6 Fig. Preliminary fastSTRUCTURE analysis of candidate Red Angus, Hereford, Shorthorn and Salers reference population animals.

S7 Fig. Preliminary fastSTRUCTURE analysis of candidate Red Angus, Hereford and Shorthorn reference population animals.

S8 Fig. Preliminary fastSTRUCTURE analysis of candidate N’Dama, Nelore and Brahman reference population animals.

S9 Fig. SNPweights self-assignment analysis for the reference sample set containing ≤200 individuals per breed analyzed using the BC7K marker set.

S10 Fig. SNPweights self-assignment analysis for the reference sample set containing ≤150 individuals per breed analyzed using the BC7K marker set.

S11 Fig. SNPweights self-assignment analysis for the reference sample set containing ≤50 individuals per breed analyzed using the BC7K marker set.

S12 Fig. SNPweights self-assignment analysis for the reference sample sets containing ≤50 individuals per breed analyzed using the BC13K marker set.

S13 Fig. SNPweights self-assignment analysis for the reference sample set with ≥80% ancestry to breed of registry and ≤50 individuals per breed using the BC7K marker set.

S14 Fig. SNPweights self-assignment analysis for reference sample set with ≥75% ancestry to breed of registry and ≤50 individuals per breed using the BC7K marker set.

S15 Fig. SNPweights self-assignment analysis for the reference sample set with ≥70% ancestry to breed of registry and ≤50 individuals per breed using the BC7K marker set.

S16 Fig. SNPweights self-assignment analyses using a reference panel with ≤50 individuals per breed and sampling from the individuals with ≥85% assignment to their breed of registry but with (a) Red Angus or (b) Angus excluded from the reference panel.

References