If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Corresponding author at: Department of Genetics, Yale University School of Medicine, 333 Cedar Street, PO Box 208005, New Haven, CT 06520-8005, USA. Tel.: +1 203 785 2654; fax: +1 203 785 6568.
1 Current address: Biology for Global Good, 420 Ventura Place, San Ramon, CA, USA. 2 Current address: Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, MN 55905, USA.
Many panels of ancestry informative single nucleotide polymorphisms have been proposed in recent years for various purposes including detecting stratification in biomedical studies and determining an individual's ancestry in a forensic context. All of the panels have limitations in their generality and efficiency for routine forensic work. Some panels have used only a few populations to validate them. Some panels are based on very large numbers of SNPs thereby limiting the ability of others to test different populations. We have been working toward an efficient and globally useful panel of ancestry informative markers that is comprised of a small number of highly informative SNPs. We have developed a panel of 55 SNPs analyzed on 73 populations from around the world. We present the details of the panel and discuss its strengths and limitations.
] attested to the importance of AIMs. These studies have mostly used SNPs or di-allelic insertion–deletion markers (InDels or DIPs) because the forensic STR markers are not especially powerful for ancestry inference [
]. SNP sets have been developed for various reasons: estimating admixture in individuals from populations known to be admixed, usually involving specific ancestral populations; distinguishing an individual's ancestral origins assuming no significant admixture involving distant populations; controlling for heterogeneous ancestry in clinical association studies. Forensic identification of ethnicity has been yet another reason for developing such sets of markers. The variety of population resources used to identify the ancestry informative SNPs has ranged from a few widely separated population samples in the HapMap to the HGDP-CEPH panel of 52 small population samples.
Very large numbers of markers will nearly always provide accurate discrimination for at least 6 or 7 geographic regions. However, most useful for forensics would be a small but efficient and robust set of markers that would provide excellent information on ancestry. We have previously identified a panel of SNPs that have both high heterozygosity globally and very low allele frequency variation around the world [
]. This panel is of great forensic value for individual identification but gives no information on ancestry. In contrast, an optimized panel of ancestry informative SNPs (AISNPs, a subset of AIMs in general) will need SNPs with large allele frequency differences among a very broad set of populations. A limitation of AIMs in general is that they cannot distinguish among populations not previously studied. Thus, individual ancestry estimation is problematic if a relevant ancestral population has not been included in the defining studies.
Our interest in AISNPs is forensics: we wish to identify a small number of SNPs that will be good for identifying the geographic/ethnic origin of an unknown sample. The origin estimated must have a high enough probability of being correct that the SNPs will provide a useful investigative tool. In a forensic context a small number of SNPs can mean lower costs and possibly faster turnaround. A small number of highly selected SNPs can be sufficient for accurate estimation of ancestry [
]. The search for optimal SNPs must use population samples that are representative of diverse geographical regions and have large enough sample sizes so that sampling errors are minimized. One must then identify those polymorphisms most able to distinguish among those populations. We have used enough different population samples that we have several samples from each major geographic region we are investigating and individual population sample sizes averaging 50 individuals. We have selected candidate SNPs using a wide variety of methods and sources. In this report we present our current set of 55 AISNPs that constitute an efficient panel for a global distinction of seven to eight biogeographic regions.
2. Methods
2.1 Strategy
We used many sources of data to identify potential AISNPs. We initially used the Applied Biosystems database of allele frequencies of four populations (Japanese, Chinese, Europeans, African Americans) for the TaqMan probes they sell. SNPs with a frequency range near 1.0 became candidates. Next we used the ∼650,000 SNPs tested on the HGDP-CEPH panel of over 1000 individuals from 51 populations [
]. We also used data we collected for the same SNPs tested on 1300 additional individuals not present in the HGDP. These additional individuals increased the sample sizes for the populations we contributed to the HGDP and added additional populations. We used our own laboratory database of about 4000 polymorphic markers typed on from 44 to 56 populations consisting of a total of nearly 3000 individuals. Our laboratory database resulted from many different studies of allele frequency variation done for a variety of reasons, e.g., pharmacogenetics [
]. As they became available we screened other large datasets for promising candidate AISNPs.
We explored several approaches to selecting candidate SNPs, comparing them, and balancing the information a selection provided. Ultimately, the combination of approaches would have to be considered empiric. Many candidate SNPs initially had data on a small number of populations; we selected those sites that had the largest absolute frequency differences or the largest Fst values for further evaluation. They were tested on our initially available set of 44 populations. Combined analyses of published datasets is often impossible because different studies used different markers on different populations [
]. These have no SNPs in common but can be analyzed together since the individuals studied are the same. To help overcome the general dearth of SNPs studied in common we analyzed the 128 SNPs from Seldin's group on our populations [
] and included data on our populations in the Nievergelt study. In both cases some SNPs had already been identified by us as good candidates; both studies also included other SNPs we had not previously identified as excellent candidates. All of the markers from those two studies were included in the set of several hundred candidate AISNPs that were typed on the remaining samples in our lab to complete a comprehensive dataset with no missing population-SNP data points. The global coverage of our several hundred candidate AISNPs consisted of 63 populations with a total of 3071 individuals (see list in Supplemental Table S1).
2.2 Balancing information
It is important to balance the selection of SNPs such that the information from different SNPs assures that different geographical regions of the world are robustly distinguishable [
]. For example, a random selection of SNPs with high global Fst will have a large excess of SNPs with allele frequencies distinguishing African populations from populations in the rest of the world, a dichotomy that can outweigh most other distinctions among populations. We used several methods to balance the SNP selection. Our approach to identifying highly informative AIMs is analogous to other approaches [
] but differed from them in that we used all (63 × 62)/2 pairwise comparisons of our 63 populations to identify SNPs with the largest pairwise allele frequency differences. This allowed us to identify markers especially useful for discriminating among populations from many different biogeographic regions. In contrast, other studies often focused on comparing more restricted predefined regions appropriate for each specific research question. Heatmaps of the candidate gene allele frequencies helped by graphically portraying redundancy in SNP information. Pairwise Fst calculations for each SNP across populations from different regions helped identify those SNPs best at certain distinctions, such as Europe vs. East Asia, so that the SNPs best at pairwise distinctions were used in the balancing. We also employed STRUCTURE [
] as one first-pass method of identifying the SNPs that differentiated most between the clusters identified. After a considerable amount of testing alternative sets of SNPs and switching individual SNPs in and out, we present a more efficient provisional panel of 55 AIMs. Once we had identified our set of 55 AISNPs on our 63 populations, we extracted the data for 813 individuals from the 1000 Genomes populations. The resulting data include 73 populations and 3884 individuals.
2.3 Laboratory
The 63 population samples from our laboratory were typed for all SNPs by TaqMan SNP Genotyping Assays® (Applied Biosystems, Foster City, California, USA) in three microliter reactions following the manufacturer's instructions. The genotypes of the samples in the 1000 Genomes Project were downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/. Overall, missing genotypes account for 1.5% of the total, with no SNP exceeding 4% missing genotypes in the 3884 individuals.
2.4 Statistics
Fst was calculated for the allele frequencies using the formula of Wright with no modification for sample size variation among the population samples [
]. We used both the overall Fst in selecting candidate SNPs and the pairwise Fst in balancing the panel to include SNPs informative for different distinctions among populations. Heatmaps were calculated using the public program in R. Principal components analysis (PCA) of population sample allele frequencies used XLSTAT (version 2009.4.07; Addinsoft SARL, http://www.xlstat.com/en/company/)). MDS, using XLSTAT on the dataset of 63 populations, was used to illustrate the diversity of SNP information.
] was also used to evaluate and visualize the degree to which sets of sites distinguish among the populations. The various analyses used a burn-in of 20,000 followed by 10,000 iterations with a model of correlated allele frequencies specified. Specific solutions were plotted using DISTRUCT 1.1 (free software downloaded from http://rosenberglab.bioinformatics.med.umich.edu/distruct.html) [
]. The matrix of pairwise similarities among replicate runs was employed to identify different overall patterns based on high G values among runs with the “same” pattern and lower values for runs with different patterns.
Calculation of likelihoods of ancestry for selected individuals used the function in FROG-kb http://frog.med.yale.edu for the Kidd Lab 55 AISNP panel described in this paper. For each population the calculation is simply the product of the frequencies of the genotypes of the input individual across all 55 loci. In the output the populations are ranked from highest to lowest likelihood.
3. Results and discussion
The final list of 55 AISNPs is given in Table 1. The allele frequencies are available in ALFRED for these 73 populations and any other populations that have data available in ALFRED. The data can be retrieved under the individual rs-numbers or through the “SNP Sets” menu as “KiddLab Set of 55 AISNPs”. There were no significant deviations beyond chance levels for Hardy-Weinberg ratios given the 55 × 73 = 4015 tests. Fig. 1 compares the distributions of Fst for these 55 AISNPs with those for two other sets of markers: an essentially random set of SNPs [
]. Though the three distributions are based on different numbers of populations, many population samples occur in all three data sets and the geographic ranges of populations are the same. On average, we are dealing with a set of SNPs with greater global variation than the 128 AISNPs. The Nievergelt et al. [
) are compared to the distribution for the set of 55 AISNPs. The two previous distributions are based on a reference set of SNPs typed on the Kidd Lab populations and on the Seldin group's set of 128 Ancestry Informative SNPs typed on a larger set of populations including the Kidd Lab populations. Because all three sets include the basic 47 Kidd Lab populations, the additional and different populations in the two larger studies are not sufficient to invalidate the marked differences in the distributions.
The heatmap in Fig. 2 is based on the population allele frequencies for the 55 AISNPs. It allows a very quick visualization of (1) the relationship of each SNP in the data set to the others, and (2) of how each SNP contributes to distinguishing among populations. The heatmap shows the relationships of the SNPs and of the populations graphically in the marginal dendrograms. The heatmap also allows a determination of how these individual markers contribute to the differentiation of the specific populations analyzed. The several higher branchings of the SNP dendogram indicate that diverse patterns of allele frequency variation occur among these 55 AIMs.
Fig. 2The heatmap of the clustering of the 73 populations and the 55 AISNPs. The upper left block represents Europe through South Central Asia. The large middle block represents East Asia and below that the Native Americans. The bottom right block represents Africa. Clearly, different SNPs contribute differently to population distinctions and one view of the relationships is given by the lengths of the branches in the dendograms.
STRUCTURE is useful for displaying how individual genotypes for a set of AIMs segregate individuals into approximately Mendelian populations. In the most likely STRUCTURE run at K = 8 the 3884 individuals in this study are assigned to seven distinct clusters in which most individuals in most populations fall into a single clusters (Fig. 3). At K = 8 the results for most individuals in most populations are essentially unaltered from the pattern at K = 7 (not shown) but a complex “admixture” pattern is introduced for the European populations. PCA on the allele frequencies in the populations shows four distinct groupings of populations based on the first 3 components (Fig. 4): a highly distributed African group, a more tightly clustered East Asian group, a modestly clustered Native American group, and a European-Southwest Asian group. This pattern reflects the geographic clustering of the majority of the populations being studied: the geographically intermediate populations tend to be placed in more intermediate positions. The African populations show a West to East cline toward the non-African populations. Taken together, the heatmap and the STRUCTURE analyses show that clusters exist in which several populations are essentially indistinguishable. These analyses demonstrate that information exists on ancestral origins of individuals, but does not obviously indicate how strongly the clusters differ in a statistical sense.
Fig. 3Principal Component Analysis of the 73 populations using the 55 AISNPs. (a) The first PC accounts for 38.9% of the variance and primarily separates African populations from the rest of the world. The second PC accounts for 31.9% of the variance and primarily separates Europe from East Asia and the Americas. The two components account for 70.8% of the variance. (b) The third PC accounts for 12.5% of the variance and completely separates the American Indians from the East Asians.
Fig. 4The most likely of the 20 STRUCTURE analyses at K = 8 for the full dataset. The results are plotted as the average assignments for each population and as the individual assignments. A cline is evident for the Mediterranean populations between the populations in Southwest Asia and those in central and northern Europe. We note also that all of the European populations have been estimated to be admixed between two clusters (illustrated in gray and blue) not otherwise present. This likely relates to the inherent Mendelian segregation for most of the “European specific” markers.
Although STRUCTURE allows evaluation of potential AISNPs, it is cumbersome to use and not particularly useful in our effort to identify as small a set of SNPs as possible while still defining multiple geographic regions of origin. The empiric approach using multiple methods as described above produced surprisingly good results. The value of an ancestry panel depends on how accurately a likelihood function determines ancestry of an individual. That accuracy will depend on the specific ancestry of the individual, the reference populations available for comparison, and the particular set of SNPs. We illustrate this by estimating the population assignments of six individuals not otherwise in the study: two Hungarians, two Druze, and two Mongolians. The Hungarian and Druze individuals were not included in the reference data or used to select the panel of SNPs but are related to individuals in those datasets. The two unrelated Mongolian individuals are recruits from among the students of the Health Sciences University of Mongolia in Ulaan-Baatar; no reference population data for Mongolia are available for calculations. For all six individuals we have used the functions in FROG-kb [
] to calculate the likelihoods of the individual originating from each of our 63 populations. In Table 2, Table 3, Table 4 we list the likelihoods and likelihood ratios for the top 20 populations for each of the six individuals. The likelihoods are graphed in supplemental figures S3 through S5 in numeric order for all 63 populations already incorporated in FROG-kb.
] showed that likelihood of ancestral assignment to the two populations differed among individuals and that a few individuals were misclassified or not classified with statistical significance. With 63 reference populations many more options for “miss-assignment” are possible. Because of Mendelian segregation some individuals in a population may have genotypes that are more likely to occur in a population other than the population of origin. However, the other populations that have higher or similar likelihoods of origin are generally from the same or a nearby region. For the two Druze individuals the other high-ranking populations are generally Mediterranean. The two Hungarians show much different sets of high-ranking populations of origin and the results could be interpreted as Hungarian A having significant Jewish ancestry, an entirely plausible result given known European history. Finally, the two Mongolian individuals have neither a “correct” ancestral population nor any geographically close populations among the reference populations available for assignment. They show quite different rankings of Asian populations and illustrate the high inherent uncertainty in estimating the ancestry of an individual originating from a poorly represented region of the world. Thus, using a likelihood function such as implemented for this panel in FROG-kb [
] cannot be expected to identify routinely the specific population from which an individual originates. Rather, the best resolution one can be reasonably confident of is that the cluster of populations (as seen in Fig. 4) an individual belongs to will be identified but not necessarily with high statistical significance.
To distinguish among populations from many different regions of the world requires SNPs that have a variety of patterns of allele frequencies around the world. We have used MDS of the SNPs to evaluate the diversity of the 55 SNPs (Fig. 5). The variety of patterns of allele frequency variation is reflected in the SNPs’ dispersion on the MDS plot. The only very tight cluster occurs at the bottom of the figure and represents several SNPs that provide a primarily Africa vs. the rest of the world picture. Several SNPs are highlighted in Fig. 5. Their frequency patterns are illustrated in other figures. Fig. 6 shows four SNPs with relatively simple patterns; each differentiates a single geographic region. In combination, however, the set of four clearly distinguishes the Pacific populations and the East African populations. Figures S1 and S2 in supplementary material illustrate the allele frequency patterns of the other SNPs highlighted in Fig. 5.
Fig. 5An MDS plot of SNPs based on the pairwise correlations of SNP allele frequencies across all 73 populations. The dispersion of the SNPs in the plot reflects the variety of patterns of allele frequency variation shown by the different SNPs. Some specific examples are shown in Fig. 6 and in Supplementary Figures S1 and S2.
Fig. 6The allele frequency distributions for four SNPs that are highly differentiated in a single region of the world. These four SNPs are among those at the compass points in the MDS plot (Fig. 5). In addition to SNPs highly differentiating each of four biogeographic regions: rs2814778 in the South distinguishing Africa; rs1426654 in the East distinguishing Europe; rs12498138 in the North distinguishing Native Americans; and rs1800414 in the West distinguishing East Asians. Three of these SNPs have ranges essentially spanning zero to one. Geographically intermediate populations often have intermediate frequencies; we note especially the populations in Southwest Asia.
Can this panel of AISNPs be improved? Absolutely. Resolution of ancestry, especially for individuals from populations not represented in these 73, will likely be improved if more populations are typed for these SNPs. However, the greatest improvement will come from using “better” SNPs. The problem is finding SNPs that provide a clearer differentiation of certain populations or groups of populations without detracting from differentiation among some other populations. As noted in Kersbergen et al. [
], some SNPs simply add noise. We note that several of the SNPs that help differentiate European individuals from the rest of the world are not fixed for the Europe-specific allele. With genotype differences among individuals some individuals will tend to have the non-European alleles at more of the loci than other individuals. At higher K values the STRUCTURE analyses apparently use this Mendelian segregation to classify individuals in all European populations “randomly” into two or three different clusters, as seen in Fig. 4. In general, even if a SNP has extreme frequency variation between, say, East Asians and Native Americans, but the frequencies in Europe and Southwest and South Asia are all intermediate with no population distinguishing pattern, that SNP is adding noise to the differentiation of those populations. The SNP with the lowest Fst in these 73 populations, rs4411548, illustrates exactly that situation (Supplemental Figure S1). The frequency of one allele is near zero in East Asian and Pacific populations and ranges from 19% to 86% in Native Americans. In contrast, that allele ranges from 2% to 45%, with most other populations between 12% and 30%, in Africans, Europeans, and Southwest and South Central Asians. We have found that it is difficult to find additional SNPs that differentiate populations both globally and within regions while, at the same time, minimizing the total number of SNPs. An alternative approach that we are considering is a second tier of SNPs that are good within a region but not necessarily good, or as good as existing AISNPs, for global differentiation. We are currently working on one such second tier of AISNPs for the eastern half of Asia. Phillips et al. [
] have proposed such a regional panel focused on distinguishing European from South Asian populations. Another approach we are pursuing is the use of haplotypes comprised of molecularly close SNPs [
The variety of approaches we have used to optimize a set of ancestry informative SNPs all have value but none seems sufficient. The final test is how well the panel will rank the potential populations of ancestry in a likelihood context. While the current likelihood calculations in FROG-kb do not explicitly allow admixed ancestry involving different biogeographic regions, the possibility of admixed ancestry raises a caveat in use of any statistic with any panel of AIMs. Admixed ancestry cannot be estimated accurately unless the ancestral populations are represented among the reference populations.
While we note that improvements will likely be possible for this panel, our analyses show it is a very good first tier panel for identifying major geographic regions for the ancestry of an individual. Future tests of the robustness of this panel will require that additional populations be tested for these SNPs to determine how well the panel resolves ancestries for individuals from populations that are in poorly represented biogeographic regions and populations intermediate to the existing 73 population samples. Future improvement in resolution of ancestry among populations poorly differentiated by these 55 AISNPs will require searching for appropriate additional SNPs.
Conflict of interest
R. Fang works for Applied Biosystems/Life Technologies. The authors declare no other conflicts of interest.
Ethical approval
All samples were obtained with informed consent for studies of gene frequency variation under a protocol approved by the Yale IRB and by additional approved institutional protocols and government approvals as relevant in the various countries of origin.
Acknowledgments
This work was funded primarily by NIJ Grants 2010-DN-BX-K225 and 2010-DN-BX-K226 to KKK awarded by the National Institute of Justice, Office of Justice Programs, US Department of Justice. Points of view in this presentation are those of the authors and do not necessarily represent the official position or policies of the U.S. Department of Justice. We thank Eva Haigh for excellent technical help. We thank Drs. Jane Brissenden, Baigalmaa Evsanaa, Ariunaa Togtokh and Janet Roscoe for making the two Mongolian samples available. Special thanks are due to the many hundreds of individuals who volunteered to give blood samples for studies of gene frequency variation and to the many colleagues who helped us collect the samples. In addition, some of the cell lines were obtained from the National Laboratory for the Genetics of Israeli Populations at Tel Aviv University, and the African American samples were obtained from the Coriell Institute for Medical Research, Camden, New Jersey.
Appendix A. Supplementary data
The following are the supplementary data to this article:
Supplementary Fig. IThe population frequency profiles of two of the 55 AISNPs. The SNP with the highest Fst (rs2814778) is in the upstream region of DARC, the classic Duffy blood group locus. The SNP with the lowest Fst (rs4411548) is intronic in ATP6V0A1 on chromosome 17. Note that rs2814778 is in the tight cluster at the “South Pole” of Fig. 5 and the nearby SNP illustrated in Fig. 6 has a very similar frequency distribution. The less resolved pattern of rs4411548 is in the Northeast quadrant of the MDS plot in Figure 5. It has been included in the set of 55 AISNPs because it contributes to the distinction between East Asians and Native Americans even though other candidate SNPs had higher global Fst; balancing the panel was the primary basis for including this SNP.
Supplementary Fig. IIThe frequency distributions of three SNPs that have high global variation but provide information toward differentiating more than one region of the world. These three SNPs are not at the four compass points in the MDS plot (Fig. 5) and show more complicated patterns of allele frequency variation. These three examples provide support for distinction of African populations from Europe but not all other biogeographic regions. rs2238151 in the Northeast of the MDS plot has European and Native American populations with similar allele frequencies. rs4918664 in the Northwest has East Asians and Native Americans with similar allele frequencies. rs7354080 in the Southwest has a more complex pattern with intermediate frequencies in East Asian and South American populations. All three show individual populations with specific allele frequency variation that may exceed sampling error; this is noticeable for populations in South Asia and the Pacific.
Supplementary Fig. IIIAncestry likelihoods for two Hungarian individuals. The log likelihoods of all 73 populations are plotted for the two individuals in Table 2. The populations are ordered by the log likelihood with the highest likelihoods at the left. Very shallow slopes exist for groups of similar populations; steeper slopes occur for large differences among those groups.
Supplementary Fig. IVAncestry likelihoods for two Druze individuals. The log likelihoods of all 73 populations are plotted for the two individuals in Table 3. The populations are ordered by the log likelihood with the highest likelihoods at the left. Very shallow slopes exist for groups of similar populations; steeper slopes occur for large differences among those groups.
Supplementary Fig. VAncestry likelihoods for two Mongolian individuals. The log likelihoods of all 73 populations are plotted for the two individuals in Table 4. The populations are ordered by the log likelihood with the highest likelihoods at the left. Very shallow slopes exist for groups of similar populations; steeper slopes occur for large differences among those groups.
☆This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-No Derivative Works License, which permits non-commercial use, distribution, and reproduction in any medium, provided the original author and source are credited.