Highlights
- •Commercial sequencing kits now available for the Kidd Lab set of 55 AISNPs.
- •ALFRED and FROG-kb now have 125 reference population samples for these SNPs.
- •Improved global coverage for reference populations.
Abstract
Ancestry inference for a person using a panel of SNPs depends on the variation of frequencies of those SNPs around the world and the amount of reference data available for calculation/comparison. The Kidd Lab panel of 55 AISNPs has been incorporated in commercial kits by both Life Technologies and Illumina for massively parallel sequencing. Therefore, a larger set of reference populations will be useful for researchers using those kits. We have added reference population allele frequencies for 52 population samples to the 73 previously entered so that there are now allele frequencies publicly available in ALFRED and FROG-kb for a total of 125 population samples.
Keywords
1. Introduction
In 2014 we published a panel of 55 ancestry informative single nucleotide polymorphisms (AISNPs) and showed that seven to eight biogeographic regions could be distinguished using these markers on 3884 individuals from 73 populations [
[1]
]. The data on those populations for these 55 AISNPs are available in the ALlele FREquency Database ALFRED <http://alfred.med.yale.edu> [[2]
] and for estimating ancestry using the Forensic Reference Resource on Genetics Knowledge Base (FROG-kb) <http://frog.med.yale.edu> [[3]
]. How these SNPs help reveal ancestry was demonstrated [[1]
] in principal components analysis (PCA) and STRUCTURE [[4]
] analyses. We note that this panel of 55 AISNPs is now implemented for massively parallel sequencing (MPS) in the sequencing products offered by Illumina and by Life Technologies. Since there are now commercial kits using these 55 SNPs for ancestry inference, we have now added the allele frequencies for 52 more population samples for these 55 SNPs to ALFRED and FROG-kb, making a more comprehensive reference database available for forensic inferences. The total dataset now includes data for 11 more of the 1000 Genomes populations (Phase 3) for a total of 22.2. Materials and methods
For most new population samples studied collaborators sent DNA samples to Yale and the genotyping was done at Yale using the standard TaqMan assay system used for the original study [
[1]
]. In two cases the SNP genotypings were carried out in labs in China. Dr. Cai-Xia Li's group in China employed the custom Golden Gate genotyping assay procedure from Illumina, Inc.; Dr. Hui Li used the same TaqMan assays and protocols used at Yale. Supplementary Table S1 lists the 125 different population samples (6853 individuals) representing the diverse ethnic groups and biogeographic regions that have now been analyzed for the 55 AISNPs. The populations in the table are organized by geographic region. The table also includes the number of individuals from each group, the ALFRED unique identifier (UID) for looking up the description of each sample, and the names of the collaborating co-authors who provided SNP genotypes and/or collected the samples.- Supplementary Table S1
Population samples studied for 55 ancestry informative SNPs.
Genotypes were examined to ensure that the alleles called are on the positive strand and corrected if they were not. Allele frequencies were examined for possible misidentification of a locus and every locus was tested for Hardy-Weinberg ratios in every population sample with no significant deviations.
3. Results
Data from other laboratories are consistent with the TaqMan genotyping done at Yale in that alleles and frequencies agree with what could be expected for the geographic region and other population data in the general region. There are no significant deviations from Hardy-Weinberg ratios.
Allele frequencies for all 55 SNPs in all 125 population samples are accessible in ALFRED. In many cases additional populations have been studied for some of the SNPs and thus those SNPs have data on more than 125 population samples in ALFRED. In FROG-kb the “Kidd Lab – Set of 55 AISNPs” has complete data for all 125 reference population samples. The completeness of the data allows likelihoods and likelihood ratios to be calculated for all of these population samples for any input DNA profile for the 55 AISNPs (or a subset).
As an example of the added information provided by the new populations, we have summarized the results from FROG-kb analyses of an individual from the new population sample of Libyans (Table 1). We have listed the top 30 populations by the probability calculated by FROG-kb [
[3]
] of this individual originating from each of the population samples listed; the other 95 population samples representing other parts of the world had smaller probabilities. By the rules of likelihood, the population with the greatest probability of this genotype becomes the most likely population of origin. Likelihood ratios indicate how much more likely the best population is compared to others. By convention, a population with a ratio of 100 or more is significantly less likely to be the origin of the sample. Nineteen of the populations in Table S1 have ratios less than 100 and cannot be eliminated as the origin of this individual.Table 1Top 30 likelihoods calculated by FROG-kb
[3]
for 55 AISNP set from the 125 current reference populations for a Lybian individual.Populations | Probability of genotype in each population | Likelihood ratio |
---|---|---|
▶Palestinian Arabs | 1.20E−13 | 1.0 |
Sousse, Tunisia | 5.80E−14 | 2.1 |
Turkish Cypriots | 5.70E−14 | 2.2 |
Mehdia, Tunisia | 5.10E−14 | 2.4 |
*Lybians, Libya | 5.00E−14 | 2.4 |
Nebeur, Tunisia | 3.90E−14 | 3.1 |
Kairoun, Tunisia | 2.30E−14 | 5.5 |
▶Kuwaiti | 1.90E−14 | 6.5 |
Smar, Tunisia | 1.10E−14 | 11 |
Kerkennah, Tunisia | 7.70E−15 | 16 |
▶Druze | 6.00E−15 | 21 |
▶Sardinians | 4.90E−15 | 25 |
Turkish | 3.80E−15 | 33 |
Kesra, Tunisia | 3.70E−15 | 33 |
Tajiks | 2.70E−15 | 45 |
▶Adygei | 2.30E−15 | 53 |
Iranians | 2.30E−15 | 53 |
▶Greeks | 1.90E−15 | 66 |
▶Iberian (IBS) | 1.30E−15 | 94 |
▶Ashkenazi | 1.10E−15 | 110 |
▶Negroid Makrani | 1.10E−15 | 110 |
▶Mohanna | 4.80E−16 | 260 |
▶Roman Jews | 3.80E−16 | 330 |
▶Hungarians | 2.20E−16 | 560 |
▶Chuvash | 1.80E−16 | 690 |
▶Russians, Vologda | 1.20E−16 | 1.10E + 03 |
▶Pathans | 9.60E−17 | 1.30E + 03 |
▶Toscani (TSI) | 5.00E−17 | 2.50E + 03 |
▶Yemenite Jews | 2.20E−17 | 5.60E + 03 |
Gujarati (GIH) | 2.00E−17 | 6.10E + 03 |
Results from FROG-kb
[3]
for a Libyan individual's 55 AISNP genotypes. The population sample from which the individual was taken is indicated by the asterisk; population samples previously used for FROG-kb calculations indicated with ▶ as in table. The likelihood ratio is calculated as the probability of the best population divided by the probability of the specified population. Only those populations with a likelihood ratio greater than 100 can be significantly eliminated as a population of origin but the new populations clearly give results favoring North Africa as opposed to Southwest Asia (Palestinian Arabs). Ethiopians, Somali, African Americans, and all Sub-Saharan populations tested had likelihood ratios from 106 to 1055, clearly excluding all other populations from the continent of Africa.a Where symbol ▶ precedes a population name, the population was one of the 73 included in a previous publication—Kidd et al.
[1]
. The * precedes the population name of the group to which the Lybian individual belongs whose genotypes were employed in the FROG-kb calculation.4. Discussion and conclusion
The additional population samples raise the 55 AISNP panel to having the largest number of reference population samples (125) and individuals (nearly 7000) of any public forensic ancestry panel. The absence of other population samples with data for all of these AISNPs illustrates the huge empty matrix problem with forensic panels of SNPs: different populations in the published literature have been typed for different sets of SNPs making comparison and integration impossible.
The value of more population samples is indicated by the results in Table 1. Without the North African samples from Tunisia and Libya, this Libyan individual's more likely populations of origin would be the Palestinian Arab sample, Kuwaiti, and Southwest Asian samples. The most likely broad region would have been suggested by those results, but the additional reference populations shift the interpretation toward North Africa. There will almost always be several populations of possible origin that are not significantly excluded and the denser the biogeographic coverage of a region, the more one expects to see several populations with low likelihood ratios. Thoughtful interpretation of the results will always be necessary for any forensic ancestry panel, especially since the population of origin may not be among the reference population samples. With more highly informative markers also tested on all 125 population samples (and more), it may be possible to narrow the range of possible ancestral populations. It is our opinion that this set of 55 AISNP is not the ultimate final panel nor are all of these 55 likely to be included in an improved panel.
The ideal forensic ancestry inference resource will consist of a large number of highly informative AISNPs with full data on a large number of population samples representing all regions of the world. In this context, the next best panel we are aware of for global ancestry is the one from the Seldin Lab [
5
, 6
]; we are currently working in our lab and with our collaborators to include most of those SNPs on these population samples. We are also adding at least the most informative SNPs from several of the other published ancestry panels, such as [7
, 8
, - Phillips C.
- Salas A.
- Sánchez J.J.
- Fondevila M.
- Gómez-Tato A.
- Álvarez-Dios J.
- Calaza M.
- Casares de Cal M.
- Ballard D.
- Lareu M.V.
- Carracedo A.
The SNPforID Consortium
Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs.
Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs.
Forensic Sci. Int.: Genet. 2007; 1: 273-280
9
]. We encourage other researchers to consider adding their unique populations to this growing dataset of population samples which are all tested for the same set of ancestry informative SNPs.Conflicts of interest
None.
Acknowledgments
This work was funded primarily by NIJ Grants 2013-DN-BX-K023 and 2014-DN-BX-K030 to KKK awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice and Grant BCS-1444279 from the US National Science Foundation. Points of view in this presentation are those of the authors and do not necessarily represent the official position or policies of the U.S. Department of Justice or the Federal Bureau of Investigation. This work was partially funded by the National Natural Science Foundation of China (N.O.81471828) to CXL. OB was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) under 2219-Grant Program. This research was supported in part by an appointment to the Visiting Scientist Program at the Federal Bureau of Investigation (FBI) Laboratory Division, administered by the Oak Ridge Institute of Science and Education, through an interagency agreement between the US Department of Energy and the FBI. This is publication number 15-16 of the Laboratory Division of the Federal Bureau of Investigation. Special thanks are due to the many hundreds of individuals who volunteered to give blood or saliva samples for studies of gene frequency variation and to the many colleagues who helped us collect the samples. In addition, some of the cell lines were obtained from the National Laboratory for the Genetics of Israeli Populations at Tel Aviv University, and the African American samples were obtained from the Coriell Institute for Medical Research, Camden, New Jersey.
References
- Progress toward an efficient panel of SNPs for ancestry inference.Forensic Sci. Int.: Genet. 2014; 10: 23-32
- ALFRED: an allele frequency resource for research and teaching.Nucleic Acids Res. 2012; 40: D1010-D1015
- Introducing the forensic research/reference on genetics knowledge base, FROG-kb.Investig. Genet. 2012; 3: 18
- Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies.Genetics. 2003; 164: 1567-1587
- Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America.Hum. Mutat. 2009; 30: 69-78
- Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples.Investig. Genet. 2011; 2: 1
- Evaluating self-declared ancestry of U.S. Americans with autosomal, Y-chromosomal and mitochondrial DNA.Hum. Mutat. 2010; 31: E1875-E1893
- Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs.Forensic Sci. Int.: Genet. 2007; 1: 273-280
- Eurasiaplex: A forensic SNP assay for differentiating European and South Asian ancestries.Forensic Sci. Int.: Genet. 2013; 7: 359-366
Article info
Publication history
Published online: August 13, 2015
Accepted:
August 7,
2015
Received in revised form:
July 16,
2015
Received:
June 5,
2015
Identification
Copyright
© 2015 The Authors. Published by Elsevier Inc.
User license
Creative Commons Attribution – NonCommercial – NoDerivs (CC BY-NC-ND 4.0) | How you can reuse
Elsevier's open access license policy

Creative Commons Attribution – NonCommercial – NoDerivs (CC BY-NC-ND 4.0)
Permitted
For non-commercial purposes:
- Read, print & download
- Redistribute or republish the final article
- Text & data mine
- Translate the article (private use only, not for distribution)
- Reuse portions or extracts from the article in other works
Not Permitted
- Sell or re-use for commercial purposes
- Distribute translations or adaptations of the article
Elsevier's open access license policy