Advertisement
Research Article| Volume 61, 102780, November 2022

Eurasiaplex-2: Shifting the focus to SNPs with high population specificity increases the power of forensic ancestry marker sets

Open AccessPublished:September 20, 2022DOI:https://doi.org/10.1016/j.fsigen.2022.102780

      Highlights

      • New South Asian-informative forensic ancestry marker panel of 36 SNPs compiled called Eurasiaplex-2.
      • SNPs selected to have zero or near zero South Asian-specific allele frequencies in all other populations located outside Indian sub-continent.
      • Survey of 4097 worldwide samples show average 11–14 South Asian-specific genotypes in South Asians vs. 0.2 in all other population samples.
      • Forensic ancestry markers with near absolute specificity like the SNPs of Eurasiaplex-2 offer potential for highly informative panels differentiating worldwide populations.

      Abstract

      To compile a new South Asian-informative panel of forensic ancestry SNPs, we changed the strategy for selecting the most powerful markers for this purpose by targeting polymorphisms with near absolute specificity – when the South Asian-informative allele identified is absent from all other populations or present at frequencies below 0.001 (one in a thousand). More than 120 candidate SNPs were identified from 1000 Genomes datasets satisfying an allele frequency screen of ≥ 0.1 (10 % or more) allele frequency in South Asians, and ≤ 0.001 (0.1 % or less) in African, East Asian, and European populations. From the candidate pool of markers, a final panel of 36 SNPs, widely distributed across most autosomes, were selected that had allele frequencies in the five 1000 Genomes South Asian populations ranging from 0.4 to 0.15. Slightly lower average allele frequencies, but consistent patterns of informativeness were observed in gnomAD South Asian datasets used to validate the 1000 Genomes variant annotations. We named the panel of 36 South Asian-specific SNPs Eurasiaplex-2, and the informativeness of the panel was evaluated by compiling worldwide population data from 4097 samples in four genome variation databases that largely complement the global sampling of 1000 Genomes. Consistent patterns of allele frequency distribution, which were specific to South Asia, were observed in all populations in, or closely sited to, the Indian sub-continent. Pakistani populations from the HGDP-CEPH panel had markedly lower allele frequencies, highlighting the need to develop a statistical system to evaluate the ancestry inference value of counting the number of population-specific alleles present in an individual.

      Keywords

      1. Introduction

      We developed the original Eurasiaplex forensic single nucleotide polymorphism (SNP) ancestry panel in 2013 [
      • Phillips C.
      • Freire Aradas A.
      • Kriegel A.K.
      • Fondevila M.
      • Bulbul O.
      • Santos C.
      • Serrulla Rech F.
      • Perez Carceles M.D.
      • Carracedo Á.
      • Schneider P.M.
      • Lareu M.V.
      Eurasiaplex: a forensic SNP assay for differentiating European and South Asian ancestries.
      ] specifically to enhance the distinction of European and South Asian ancestries. These population groups are more closely positioned geographically and lack physical barriers to migration, so consequently are genetically less well differentiated than other continentally defined population groups. For a sizeable proportion of their genomic variation, populations of the Indian sub-continent show allele frequencies with variability positioned in the middle of an allele frequency cline running between Europe and East Asia. Key additional patterns of variation are defined by differing ratios of variability from the inferred founding populations of Ancestral North Indians and Ancestral South Indians [
      • Reich D.
      • Thangaraj K.
      • Patterson N.
      • Price A.L.
      • Singh L.
      Reconstructing Indian population history.
      ,
      • Majumder P.P.
      The human genetic history of South Asia.
      ]. Such variation therefore tends to have limited informativeness for distinguishing South Asian individuals from Europeans or East Asians. Nevertheless, the Eurasiaplex SNPs have proved to be a useful set for supplementing other forensic ancestry panels that have a stronger emphasis on differentiating the five continentally based population groups of Africa, Europe, East Asia, America, and Oceania [
      • Phillips C.
      • Parson W.
      • Lundsberg B.
      • Santos C.
      • Freire-Aradas A.
      • Torres M.
      • Eduardoff M.
      • Børsting C.
      • Johansen P.
      • Fondevila M.
      • et al.
      Building a forensic ancestry panel from the ground up: the EUROFORGEN Global AIM-SNP set.
      ,
      • Kidd J.R.
      • Friedlaender F.R.
      • Speed W.C.
      • Pakstis A.J.
      • De La Vega F.M.
      • Kidd K.K.
      Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples.
      ,
      • de la Puente M.
      • Ruiz-Ramírez J.
      • Ambroa-Conde A.
      • Xavier C.
      • Pardo-Seco J.
      • Álvarez-Dios J.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Gross T.E.
      • Cheung E.Y.Y.
      • et al.
      Development and evaluation of the ancestry informative marker panel of the VISAGE basic tool.
      ]. When considering the development of a new set of South Asian-informative SNPs to improve the power of massively parallel sequencing (MPS) ancestry tests for forensic use, we decided to revise the strategy for selecting optimum AIMs. This was accomplished by a change of focus away from allele frequency contrasts between South Asia and Europe and/or East Asia, towards selecting SNPs with near-absolute population specificity, defined in this respect as variants with zero or extremely low allele frequencies in population groups outside the targeted region. Therefore, despite the new variants identified having allele frequencies as low as 0.1 in South Asian populations, in all other regions the allele frequency of the specific allele is generally considerably lower, with values of 0.001 (1-in-1000) or less. Although an allele frequency of 0.1 is seemingly uncommon variation, 18 % of genotypes will be heterozygotes with the specific allele, and when specific allele frequencies are as high as 0.4, 64 % of genotypes in total carry that allele. These frequencies of a specific allele at any one locus contrast with those in individuals from other regions, including areas neighbouring South Asia, of less than 1 in 1000. As a result, when panels of thirty or more markers with near-absolute specificity are compiled, between 12 and 18 specific-allele genotypes are detected in individuals from the target population group, compared to a maximum of one, two or, much more rarely, three genotypes in individuals from elsewhere.
      To develop a second-generation forensic AIMs panel for the analysis of South Asian ancestry, which we called Eurasiaplex-2, approximately 120 candidate SNPs were compiled with absolute or near-absolute specificity to South Asia, a region extending from the Indian sub-continent into Afghanistan in the northwest, and Myanmar along with closely sited parts of the SE Asian archipelago in the east. We used the five 1000 Genomes populations with South Asian origins to detect SNPs with specific alleles that had frequencies ranging from 10 % to 40 % in these populations but contrasting with zero frequencies or in the range of 1-in-200 to 1-in-1000 in the project’s populations from East Asia, Europe, and Africa. Once identified, the best markers were compiled into a smaller panel of 36 AIMs suitable for inclusion in future globally applicable ancestry panels for SNP genotyping using MPS. The 36 SNPs selected for the Eurasiaplex-2 panel were cross-checked for consistent South Asian-specific allele frequency patterns amongst the widely dispersed population sampling of five other whole-genome-sequence human diversity projects.

      2. Methods and materials

      2.1 Changing the concept of population informativeness when selecting forensic ancestry SNPs

      Fig. 1 details three ancestry SNPs of potential interest for inclusion in a panel of South Asian-informative markers. SNP rs10008492 was chosen for the original Eurasiaplex panel as there is evident differentiation in the rs10008492-C allele frequencies (blue segment) between South Asian and European populations, although it is also evident that this allele only very weakly differentiates East Asian populations. This is reflected in the contrasting pairwise In values of 0.16 and 0.02, respectively, shown for each population comparison. The In divergence metric is widely used to gauge population differentiations [
      • Rosenberg N.A.
      • Li L.M.
      • Ward R.
      • Pritchard J.K.
      Informativeness of genetic markers for inference of ancestry.
      ] and was the basis for the first Eurasiaplex SNP selection process (Fig. 1 of [
      • Phillips C.
      • Freire Aradas A.
      • Kriegel A.K.
      • Fondevila M.
      • Bulbul O.
      • Santos C.
      • Serrulla Rech F.
      • Perez Carceles M.D.
      • Carracedo Á.
      • Schneider P.M.
      • Lareu M.V.
      Eurasiaplex: a forensic SNP assay for differentiating European and South Asian ancestries.
      ]). Another simple metric often used and indicative of ancestry inference power, is the allele frequency differential (δ); the absolute difference of one population’s allele frequency from another - in this case, the European-South Asian δ is 0.5, a comparatively high value. SNP rs6053171 provides very good differentiation between Europeans and East Asians, and South Asians have intermediate frequencies for both alleles, consequently giving high In and δ, comparing South Asians to both other populations. This SNP was chosen in a study by Pfaffelhuber et al. in 2020 [
      • Pfaffelhuber P.
      • Grundner-Culemann F.
      • Lipphardt V.
      • Baumdicker F.
      How to choose sets of ancestry informative markers: a supervised feature selection approach.
      ] as part of a set of twelve loci to compile an optimum ancestry SNP set for distinguishing African, European, East Asian and South Asian populations, selected using a supervised feature selection system [
      • Pfaffelhuber P.
      • Grundner-Culemann F.
      • Lipphardt V.
      • Baumdicker F.
      How to choose sets of ancestry informative markers: a supervised feature selection approach.
      ]. However, given a South Asian heterozygosity of 48 %, over half of genotypes are homozygous TT or GG and therefore these individuals are not distinguished from Europeans (69 % of TT homozygotes) or from East Asians (31 % of GG). The third SNP rs371763923 has the lowest In and δ values of all three loci, so would not be amongst the top selections as an ancestry marker. However, the power of this type of SNP’s allele frequency distribution lies in the zero frequency for the rs371763923-G allele (yellow segment) in both Europeans and East Asians. In the 33 % of South Asians where the G allele is detected as a heterozygote, or the 4 % as a homozygote, these genotypes strongly signal origins from this population group as they are not observed elsewhere. Supplementary Fig. S1 details all the individual population allele frequencies in each SNP, indicating there are also zero rs371763923-G allele frequencies in African, Native American, Oceanian and Middle East populations, making this SNP universally specific for South Asia. Only one population, 1000 Genomes KHV (Kinh in Ho Chi Minh City, Vietnam) has two individuals with rs371763923-G alleles, representing < 1 % overall frequency. We selected multiple, well-spaced SNPs on each chromosome that display this kind of highly specific frequency distribution, by applying the strict criterium of the lowest possible specific allele frequency across all populations outside of South Asia. In all cases, the South Asian-specific allele was the Reference Sequence (RefSeq) alternative allele, not the reference allele (herein, Alt and Ref alleles, respectively).
      Fig. 1
      Fig. 1Three different types of ancestry SNP with potentially South Asian-informative allele frequency distributions. The well-established In Divergence measurement of population diversity are shown for each pairwise comparison: European-South Asian left, and South Asian-East Asian right. SNP rs10008492 was part of the original Eurasiaplex panel and is informative for the differentiation of Europeans, but not for East Asians. SNP rs6053171 has high Divergence values, but South Asian homozygotes are uninformative for either Europeans or East Asians. Only SNP rs371763923 is equally informative for both comparisons and despite relatively low Divergence values, over 33 % of genotypes would be highly informative heterozygotes or GG homozygotes that are not found in the other populations.

      2.2 Marker selection

      BCFtools was used to make selections of suitable candidate SNPs from publicly available 1000 Genomes Phase III variant catalogues [
      • The 1000 Genomes Project Consortium
      • Auton A.
      • Brooks L.D.
      • Durbin R.M.
      • Garrison E.P.
      • Kang H.M.
      • Korbel J.O.
      • Marchini J.L.
      • McCarthy S.
      • McVean G.A.
      • et al.
      A global reference for human genetic variation.
      ]. The chromosome based VCF data from the 1000 Genomes FTP site was searched using the simple allele frequency intersect of: ‘> 0.1 in South Asian, < 0.01 in African, East Asian, European and admixed American population sample’ frequency cut-offs. Multiple-allele variants were excluded, and X-/Y-chromosome variants were identified but not compiled into the final candidate lists for each chromosome. Lists of candidate SNPs were assembled in Excel per chromosome, for further allele frequency comparisons of allele frequencies in African, East Asian, European, and American population samples to maximise the level of South Asian specificity. It should be emphasised that we routinely compile 1000 Genomes variant data from the high sequence coverage datasets generated by the NYGC sequencing of the project's sample set, which has greatly improved the quality of SNP genotype calls across the human genome [

      M. Byrska-Bishop, U.S. Evani, X. Zhao, A.O. Basile, H.J. Abel, A.A. Regier, A. André Corvelo, W.E. Clarke, R. Musunuri, K. Nagulapalli, et al., High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios, bioRxiv preprint, posted February 7 2021 doi: 〈https://doi.org/10.1101/2021.02.06.430068〉.

      ]. The SNP rs3857620 is illustrative of the problem of performing searches for ancestry markers using variant catalogues based on low coverage sequencing data, and which may continue to harbour incorrectly called genotypes. This SNP was identified by Zhao et al., in 2019 [
      • Zhao S.
      • Shi C.-M.
      • Ma L.
      • Liu Q.
      • Liu Y.
      • Wu F.
      • Chi L.
      • Chen H.
      AIM-SNPtag: a computationally efficient approach for developing ancestry-informative SNP panels.
      ] as a South Asian-informative marker suitable for a small-scale forensic ancestry panel proposed to consist of 36 SNPs in total. SNP rs3857620 was also adopted for the VISAGE Enhanced Tool for Ancestry and Appearance in a much larger set of SNPs genotyped with MPS [
      • Ruiz-Ramírez J.
      • de la Puente M.
      • Xavier C.
      • Ambroa-Conde A.
      • Álvarez-Dios J.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Ralf A.
      • Amory C.
      • Katsara M.A.
      • et al.
      Development and evaluations of the ancestry informative markers of the VISAGE enhanced tool for appearance and ancestry.
      ]. However, the currently listed genotypes in the 1000 Genomes Ensembl portal [] indicating a 46 % frequency of the A-allele in South Asian populations vs. 0–13 % in the other population groups, contradicts an A-allele frequency estimate across all 1000 Genomes populations sequenced at high coverage, of less than 1 %. For this reason, we cross-checked all current human genome variant databases to ensure each SNP of interest showed consistent patterns across all datasets.
      Once compiled, single SNPs were selected from each cluster of markers on any one chromosome segment showing near-identical allele frequency patterns, to optimise the genomic distribution of the final marker set. An exception was made to this rule for chromosome 16, which had a very large extended haplotype of markers with high South Asian specificity at 16p11.2-q11.2. In this case, multiple SNPs were chosen at widely dispersed positions which were clearly part of a large-scale chromosome segment where haplotypes of specific SNP variants had been preserved at very high frequencies in many South Asian populations.

      2.3 Human genome variant databases accessed

      The SNP variant catalogue of 1000 Genomes Phase III was interrogated with the South Asian-targeted allele frequency intersect described in Section 2.2. All genotypes were cross-checked for accuracy with the NYGC high sequence coverage dataset for each candidate SNP in turn. To further check accuracy of allele frequency estimates made from each set of 1000 Genomes genotypes, the gnomAD (Genome Aggregation Database [
      • Lek M.
      • Karczewski K.J.
      • Minikel E.V.
      • Samocha K.E.
      • Banks E.
      • Fennell T.
      • O’Donnell-Luria A.H.
      • Ware J.S.
      • Hill A.J.
      • Cummings B.B.
      • et al.
      Analysis of protein-coding genetic variation in 60,706 humans.
      ]) v.3.1.2 dataset was checked for all Eurasiaplex-2 candidate SNPs. The gnomAD database is the largest publicly available collection of population variation compiled from large-scale human genome sequencing projects. It only reports allele frequencies per population but compiles the most up-to-date data from 1000 Genomes (i.e., the NYGC high sequence coverage variant data); whole-genome sequence data for the widely used HGDP-CEPH diversity panel, as well as more than 152,000 samples from large-scale African, European (Finnish and non-Finnish compiled separately), East Asian, South Asian, Middle Eastern, Latino (admixed American), Ashkenazi Jewish and Amish population samples. Studies of patterns of human variation on this scale provide very accurate allele frequency estimates, and gnomAD data is particularly sensitive to very rare variation such as a tri-allelic polymorphism where a second alternative allele (Alt-2) at a SNP site is present in one population at a very low frequency. Genotypes from the HGDP-CEPH panel were obtained from this project’s FTP site [
      • Bergström A.
      • McCarthy S.A.
      • Hui R.
      • Almarri M.A.
      • Ayub Q.
      • Danecek P.
      • Chen Y.
      • Felkel S.
      • Hallast P.
      • Kamm J.
      • J
      • et al.
      Insights into human genetic variation and population history from 929 diverse genomes.
      ].
      Human variant data from Simons Foundation human genome diversity project (herein, SGDP [
      • Mallick S.
      • Li H.
      • Lipson M.
      • Mathieson I.
      • Gymrek M.
      • Racimo F.
      • Zhao M.
      • Chennagiri N.
      • Nordenfelt S.
      • Tandon A.
      • et al.
      The Simons Genome Diversity Project: 300 genomes from 142 diverse populations.
      ]) offers information on geographic areas outside of those covered by 1000 Genomes or the HGDP-CEPH sampling regimes; in particular, Northeast Asia (broadly, Siberia east towards the Bering Straits); Southeast Island Asia; and Central South Asia (broadly, the Caucasus, east towards the central Asian Steppe immediately north of the Hindu Kush, Kashmir, and Tibet). There are 278 samples with genome-wide SNP datasets, of which 22 overlap with 1000 Genomes and 133 overlap with the HGDP-CEPH panel samples, leaving 123 unique samples, from 67 populations, but each a very limited number of 1, 2 or 3 samples per population. The Estonian Biocentre genome diversity panel (herein, EGDP [
      • Pagani L.
      • Lawson D.J.
      • Jagoda E.
      • Mörseburg A.
      • Eriksson A.
      • Mitt M.
      • Clemente F.
      • Hudjashov G.
      • DeGiorgio M.
      • Saag L.
      • et al.
      Genomic analyses inform on migration events during the peopling of Eurasia.
      ]) largely mirrors the sampling regimes of SGDP by being mainly samples of 1, 2 or up to 6 individuals per population or region. EGDP has 402 individual samples from 121 populations that almost all complement the SGDP samples by providing extensive geographic coverage of Siberian, Northeast Asian, Eastern European, and Central South Asian regions. We compiled the 123 SGDP-unique genotype datasets and 402 EGDP genotype datasets for the candidate Eurasiaplex-2 SNPs.
      Lastly, we collected genotypes for the Eurasiaplex-2 final selection of 36 SNPs from the Singapore Genome Variation Project (herein SGVP), a whole-genome sequencing study from 2014 which included 36 individuals of Indian descent residing in Singapore, plus 96 Malays in Singapore [
      • Wong L.-P.
      • Ong R.T.
      • Poh W.T.
      • Liu X.
      • Chen P.
      • Li R.
      • Koi-Yau Lam K.
      • Esakimuthu Pillai N.
      • Sim K.-S.
      • Xu H.
      • et al.
      Deep whole-genome sequencing of 100 southeast Asian Malays.
      ,
      • Wong L.-P.
      • Kuan-Han Lai J.
      • Saw W.-Y.
      • Ong R.T.
      • Youzhi Cheng A.
      • Esakimuthu Pillai N.
      • Liu X.
      • Xu W.
      • Chen P.
      • Foo J.-N.
      • et al.
      Insights into the genetic structure and diversity of 38 South Asian Indians from deep whole-genome sequencing.
      ].

      2.4 Statistical considerations

      To emphasise the power of specific variation to signal a particular population, Bayes analysis which generates a likelihood ratio (LR) between two possible populations-of-origin starts to build very high cumulative probabilities in SNPs with specific alleles at even moderate frequencies. The highest likelihood ratio for an rs10008492-TT or rs6053171-GG homozygote comparing South Asian and European allele frequencies (shown in Fig. 1) produces a probability of ∼4.7 times more likely South Asian. The same likelihood test for an rs371763923-AG heterozygote, and applying a conservative ‘global’ G allele frequency of 1 % for all non-South Asian populations, produces a probability of 16.75 times more likely South Asian. Given most populations outside of South Asia have a zero frequency for the specific allele of most of the SNPs chosen for Eurasiaplex-2, Snipper avoids zero-value numerators in LR calculations by applying the default value of 1/n + 1; where n is the sample size for that population.
      We explored the application of Bayes analysis using the Snipper multiple profiles SNP classifier [], which accepts multiple SNP profiles and generates principal component analysis (PCA) plots in the same test. Allowance was made for non-independence of linked SNPs when applying the chromosome 16 haplotype SNPs by choosing the ‘Hardy-Weinberg principle need not apply’ option in Snipper, which adjusts for association of allele frequencies amongst closely sited SNPs on the same chromosome. Additionally, estimations were made of likely recombination rates amongst the haplotype component SNPs by measuring the recombination fraction values between these markers using the HapMap genetic map for this chromosome, as previously described [
      • Phillips C.
      • Ballard D.
      • Gill P.
      • Syndercombe Court D.
      • Carracedo A.
      • Lareu M.V.
      The recombination landscape around forensic STRs: accurate measurement of genetic distances between syntenic STR pairs using HapMap high density SNP data.
      ].
      An alternative to Bayes LR tests and PCA is to run a genetic cluster algorithm such as STRUCTURE [
      • Porras-Hurtado L.
      • Ruiz Y.
      • Santos C.
      • Phillips C.
      • Carracedo Á.
      • Lareu M.V.
      An overview of STRUCTURE: applications, parameter settings, and supporting software.
      ]. Since we were only compiling markers specific to one population group, there is limited information that can be obtained from STRUCTURE analyses seeking two genetic clusters (i.e., setting analysis runs for a K value of 2). Nevertheless, we performed a simple comparison of STRUCTURE analyses of 1000 genomes populations using Eurasiaplex and Eurasiaplex-2 SNP sets to explore the extra power potentially gained from alleles with absolute specificity to a single population group.
      A much simpler and potentially informative alternative to all three of the established population analysis systems, is to simply count the number of specific alleles found in any one individual and assess if these match the patterns observed across the whole region of interest. This was done in the current study on a large scale with the five South Asian 1000 Genomes sample sets and included HGDP-CEPH samples plus those on a much smaller scale, but geographically dispersed and covering a wide range of different populations from the Indian sub-continent samples of SGDP and EGDP.

      3. Results

      3.1 Screening South Asian-specific candidate SNPs

      A total of 123 candidate South Asian-specific SNPs were compiled by applying the allele frequency intersect to the 1000 genomes Phase III variant catalogue. The full candidate SNP set is listed with summary genomic details and complete genotype data (including 1000 Genomes NYGC high coverage, HGDP-CEPH, SGDP and EGDP genotypes) in Supplementary Tables S1A and S1B, respectively. Genotype concordance was checked by comparing the currently published 1000 Genomes Phase III data used for the allele frequency screening, with the high coverage genotype calls from the NYGC re-sequenced 1000 Genomes samples. The 2,505 sample-by-sample comparisons for each of the 123 candidate SNPs are listed in Supplementary Table S1C.
      An initial filter set was applied to exclude SNPs that had one of three characteristics: i. genotype discordancy rates higher than 20 incompatibilities (approximately 1 % or more differences in genotype calls); ii. SNPs that lost the expected specific allele pattern when the NYGC genotypes of South Asian samples were compared to those of the other populations; iii. SNP pairs which were physically well separated on the same chromosome segment, but which showed identical allele frequencies indicating they were in linkage disequilibrium - in such cases, one SNP was chosen. Note that the third screening rule was not applied to SNPs in the chromosome-16 extended haplotype. Fig. 2A shows the ten SNPs with more than 20 discordant genotypes, excluded from further consideration. No suitable SNPs were identified on chromosome 21, and the single SNP on chromosome 22, rs113693449, was excluded due to a high level of genotyping discordancy. Fig. 2B provides summary allele frequency charts for the SNPs that were expected to have South Asian-specific alleles in searches of the 1000 Genomes Phase III data, which were not detected in the high coverage NYGC data: rs3857620, rs199671447 and rs113693449. The NYGC data for SNP rs11103281 maintained the allele expected in South Asian samples, but it was also detected in the other population groups so lacked specificity. All four of the above SNPs had zero South Asian-specific allele frequencies in the Gujarati in Houston US (GIH), which suggested discrepancies in the way these SNPs had been genotyped - notably that the GIH population samples were originally studied by HapMap, and this data may simply have been merged with the South Asian populations added to 1000 Genomes Phase III studies. SNP rs9915709 had a much higher frequency in African samples than South Asians, therefore although this SNP was listed, in practice it would be less informative than other SNPs with zero, or near-zero allele frequencies across all non-South Asian populations. Finally, SNP rs371441513 had the highest level of genotype discordancy, which suggests it has sequencing issues, meaning it is unlikely to be reliably genotyped with any assay developed for Eurasiaplex-2 SNPs.
      Fig. 2
      Fig. 2(A) Ten Eurasiaplex-2 candidate SNPs with discordant genotypes between the current 1000 genomes Phase III data (2–3x sequence coverage) and the high coverage (30x) NYGC re-sequenced samples. Although the top four have relatively low numbers of discordant genotypes, the other six indicate sequence alignment problems or complex sites (e.g., with closely positioned Indels) and all were rejected to minimise the risk of genotyping problems using MPS. (B) Three Eurasiaplex-2 candidate SNPs with misleading allele frequency information in the current 1000 Genomes Phase III variant dataset (upper left-hand pie charts for South Asian populations and population group summaries), contrasting with the allele frequency information from the higher coverage sequence analysis data for the same samples, plus gnomAD South Asian data, where frequency estimates are based on ∼4800 samples (upper right-hand charts). Lower pie charts show additional problems of insufficient specificity in rs9915709 (South Asian-specific allele at a higher frequency in African populations) and rs11103281 (all population groups have the South-Asian specific allele at >0.05 frequencies), plus SNP rs371441513 with the highest recorded genotype discordancy rate indicating a potentially complex variant site.

      3.2 Selecting a core set of South Asian-specific SNPs for Eurasiaplex-2

      3.2.1 Patterns of South Asian-specific genotype distributions

      Table 1 outlines genomic details and summary allele frequencies of the 36 SNPs selected for Eurasiaplex-2. It is noteworthy that only the two SNPs rs77510889 and rs17158407 were previously identified in the 1000 Genomes Phase I variant catalogue (accessible in the SPSmart ENGINES genome browser [
      • Amigo J.
      • Salas A.
      • Phillips C.
      ENGINES: exploring single nucleotide variation in entire human genomes.
      ]). This is mainly due to an absence of South Asian population samples in this first 1000 Genomes SNP genotyping project phase, so a SNP with the Alt allele only present in South Asian populations would consist entirely of Ref allele homozygotes in all Phase I populations and thus not be identified as a variant. SNPs were given internal codes based on their chromosome and RefSeq 5′ to 3′ locations, from 1A to 20 (i.e., one SNP on this chromosome). Generally, gnomAD South Asian-specific allele frequencies were slightly lower than most or all of those in 1000 Genomes. With the exception of rs374908464, the CEPH Pakistani allele frequencies averaged across the eight populations, are substantially lower than either 1000 Genomes or gnomAD South Asian samples. The overall average South Asian-specific allele frequency of 0.085 in CEPH Pakistani samples is two- to three-times lower, and for 27 of the 36 SNPs would not meet the selection criteria of > 0.1 allele frequencies.
      Table 1Genomic details and South Asian-specific allele frequencies from 1000 Genomes, HGDP-CEPH and gnomAD databases of an optimum set of 36 SNPs.
      Genomic detailsFrequency of the alternative (South Asian-specific) allele in 1000 Genomes groups/South Asian populationsHGDP-CEPHgnomAD
      No.CodeChr.GRCh37GRCh38rs-numberRef.Alt.GeneAfricanEuropeanEast AsianBEBGIHITUPJLSTUPakistaniSouth Asian
      11A12758898827262497rs191008849TCWDTC100.00100.00100.22670.25000.23530.19270.23530.21710.1866
      21B1207023473206850128rs370300597CG0000.20930.19770.19610.13020.25490.20720.1459
      32A21870226518520999rs373262633AGLOC1053734540000.18020.19770.24510.13540.17650.18360.1336
      42B29881644098199977rs183145214AGVWA3B000.00500.21510.17440.15200.20310.21570.18730.1861
      53A34425005744208565rs578118259TG000.00100.25580.27910.28430.18750.23530.24440.1817
      63B3159603038159885249rs369609492CASCHIP10000.25000.22090.21570.09380.19120.18360.1572
      73C3167730032168012244rs375081853AGGOLIM4000.00100.15700.15120.20590.16150.19120.18730.1588
      84A4115827968114906812rs182767282TCNDST400.00100.00100.14530.19190.21570.20310.25980.22700.1703
      94B4152007237151086085rs146398591AG00.00100.00300.27330.25000.32350.18750.27940.24570.2192
      104C4167677008166755857rs554572765ACSPOCK30000.18600.19190.19120.14580.20100.18110.1068
      11553370412533704020rs375710694TCADAMTS12000.00200.15120.16280.22060.16150.20590.19230.1468
      126A6117206569116885406rs186371551GARFX600.00500.00500.12790.16860.23040.18230.17160.20840.1703
      136B6130116672129795527rs368650154CT000.00100.16860.13950.22550.16150.24020.19350.1578
      146C6154459950154138815rs368661757CAOPRM1000.00100.23840.25000.24510.21350.25490.24440.1946
      15775023888150199285rs368444091CT0000.16860.18600.18140.16670.19120.17740.1396
      16999756051794798235rs187619767CTAOPEP000.00200.24420.28490.16180.14580.22550.18490.1740
      171010122095086120335574rs77510889*AG0.071400.00200.22670.20930.23040.20310.22060.21710.1593
      1811A112926285929241312rs370097977TC000.00500.16280.16860.19120.20310.24510.22580.1529
      1911B115946275959695286rs375766368GA000.00100.14530.15120.21570.17710.15200.18240.1222
      2011C117217515972464115rs377589165GA0000.15120.18600.22060.16150.18630.18730.1514
      2112A1242687034159537rs376263717CT00.00200.00100.22670.20350.25980.15100.20100.19110.1655
      2212B122257011922417185rs371763923AG000.00200.25000.24420.20590.15100.23040.20220.1636
      2312C125042837950034596rs368764180AGRACGAP1000.00200.15700.13950.19610.08850.18630.17870.1388
      2413135605749955483364rs184748067GA00.00600.00690.17440.18020.23530.16150.23040.19980.1600
      2514146571229865245580rs189013802GA000.00790.17440.17440.24020.18230.20590.18730.1372
      2615158323682582568075rs17158407*CTCPEB100.00100.00200.25580.25580.29900.31250.33330.29900.2681
      2716A1631789713128970rs368479296CTZNF213-AS1000.00200.18020.18020.14220.16670.21080.18360.1645
      2816B162305381523042494rs376893831GT000.00100.28490.26740.15690.18230.19610.21460.1991
      2916C162858805928576738rs370130302CGSGF29000.00100.29070.26740.19610.17190.22060.21340.2024
      3016D163392159334119126rs368738705CT0000.41860.43020.48530.33850.44120.40070.3329
      3116E164649985846465946rs368538881CT0000.41280.40700.49510.30210.46080.40200.3549
      3216F164832778848293877rs377323011AGLONP2000.00100.33140.29650.43140.27600.37750.33620.2742
      3317A174396496645887600rs369091847ATMAPT-AS10000.13370.15120.24020.12500.24510.20350.1604
      3417B178066020482702328rs376153825GCLOC105376791000.00100.12790.12790.17650.17710.17160.17620.1273
      35191983712408306356rs374908464AGCD320000.00200.18020.20930.25000.17190.10780.19850.1740
      36202049875505006904rs186201674CTSLC23A20.0010.00400.00500.20930.22090.22550.18750.24510.21090.1545
      * SNP also identified in 1000 Genomes Phase IAverage:0.0020.0010.0020.2140.2160.2400.1820.2330.2190.178
      The full list of genotypes and summary allele frequency estimates for the 36 SNPs in each population group are listed in Supplementary Table S1D. For each of the 4097 samples listed in this table, numbers of South Asian-specific genotypes and alleles were counted. The individual South Asian-specific genotype counts are plotted as blue bars in Fig. 3A (the four 1000 Genomes population groups), and Fig. 3B (all other samples from HGDP-CEPH, SGDP, EGDP and SGVP SNP databases), with the small CEPH Uyghur and EGDP Roma population samples highlighted as red bars for clarity. Aligned above each set of bar plots in Fig. 3A and B are graphic representations of South Asian-specific allele homozygotes (red lines) and heterozygotes (orange lines), with non-specific allele homozygotes in grey. All 36 of these non-specific alleles are the RefSeq reference allele. These graphics highlight the large number of informative genotypes recorded in the 1000 Genomes South Asians, with only rs77510889 (internal code ‘SNP 10′) indicating a higher-than-average number of South Asian-specific genotypes in 1000 Genomes African and CEPH African populations. Note that there are four SNPs not genotyped in EGDP, five in SGVP Singapore Indian, and two in Singapore Malay genome datasets. In each of the other databases, South Asian samples are clearly indicated by blue bars with prominent heights and the corresponding dense patterns of orange and red lines. The average number of informative genotypes per individual is given above each South Asian population box (values for the individual CEPH Pakistani populations below the bar plots). The contrast is evident between the CEPH Pakistani populations and 1000 Genomes South Asian populations, with an average of ∼14 specific genotypes per individual in BEB, ITU, GIH, STU populations, dropping to less than 12 in PJL, Punjabi from Lahore, Pakistan. The CEPH Pakistani populations have much lower average numbers of informative genotypes per individual, which range from ∼8 in the Sindhi to ∼2 in the Hazara samples. In the other datasets, the SGDP and EGDP South Asian, plus SGVP Singapore Indian samples have close to an average of 11 informative genotypes per individual (adjusted for the missing SNPs), which can be taken to represent an overall median number of informative genotypes for this target population group in the 36 SNPs selected. The other 1000 Genomes population groups have the expected very low average informative genotype value of less than 0.2, allowing a degree of differentiation to be made between East Asians and the South Asian-related samples of CEPH Uyghur, occupying regions to the northeast of the Indian sub-continent; and EGDP Roma, a trans-national cultural isolate suggested to have originated from a proto-Romani population living in northwest India [
      • Gómez-Carballa A.
      • Pardo-Seco J.
      • Fachal L.
      • Vega A.
      • Cebey M.
      • Martinón-Torres N.
      • Martinón-Torres F.
      • Salas A.
      Indian signatures in the westernmost edge of the European Romani diaspora: New insight from mitogenomes.
      ]. The Uyghur have relatively low average informative genotype levels of 1.8, but the Roma, although only five individuals and based on 32/36 SNPs, show a high average value of ∼6.7 informative genotypes.
      Fig. 3
      Fig. 3(A) South Asian-specific genotype distributions and total number of informative genotypes in each 1000 Genomes sample; specific allele homozygotes marked in red and heterozygotes in orange (both genotypes counted singly) of 36 Eurasiaplex-2 SNPs in twenty 1000 Genomes populations. The average number of informative genotypes are shown as group-wide values for Africans, Europeans, and East Asians, and for individual populations for South Asians. Internal codes are used to label each SNP, which are detailed in B. (B) South Asian-specific genotype distributions and total number of informative genotypes in four whole-genome-sequencing human diversity projects, additional to 1000 Genomes: HGDP-CEPH diversity panel; Simons Foundation genome diversity project (SGDP); Estonian Biocentre diversity project (EGDP); Singapore genome variation project. HGDP-CEPH Uyghur and EGDP Roma geographic outlier samples are marked in red. The average number of informative genotypes are shown individually for eight HGDP-CEPH Pakistani populations and HGDP-CEPH Uyghur; SGDP South Asian samples; EGDP South Asian, SE Asian and Roma samples; SGVP Singapore Indian and Singapore Malay samples. Note EGDP lacks data for four SNPs, SGVP lacks data for five in Singapore Indian and two in Singapore Malay samples. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

      3.2.2 South Asian-specific allele frequency estimates from Eurasiaplex-2 SNP genotypes

      Although genotype frequencies provide indications of the population informativeness of the SNPs selected for Eurasiaplex-2, examination of allele frequencies allows a clearer overview of how the SNP variability is distributed across the Indian sub-continent. When population-wide average South Asian-specific allele frequencies are calculated for each of the 36 SNPs from 1000 Genomes combined South Asian populations, gnomAD samples (approximately 5000 South Asians, with no specific geographic data provided), and CEPH Pakistani datasets, the contrast between Pakistan and the rest of South Asia is further underlined. Fig. 4 gives bar plots for the South Asian-specific average allele frequencies per SNP in each dataset, ranking the markers from most informative to least, left to right. The Fig. 4 plots suggest a close match in frequency estimates between 1000 Genomes and gnomAD data, but much lower specific allele frequency estimates for Pakistanis that are between 10 % and 25 % of the other dataset values. The reciprocal bar plots in Fig. 4 use a 100-fold smaller scale and show most South Asian-specific allele frequencies in other parts of the world rarely exceed 0.001–0.003, or a maximum of 1 in 300. The boxed values for rs77510889 indicate high frequencies for the G-allele outside of South Asia, which exclude population data where this allele was at a particularly high frequency - notably the HGDP-CEPH hunter-gatherer populations of San, Mbuti Pygmies and Biaka Pygmies. South Asian-specific alleles in the rest of the world for rs186371551 and rs184748067 also had higher-than-average frequencies but were more widely dispersed.
      Fig. 4
      Fig. 4Allele frequency spectra of the 36 Eurasiaplex-2 SNPs. Markers are arranged in descending 1000 Genomes South Asian-specific average allele frequency and show bars for 1000 Genomes South Asians, gnomAD South Asians and CEPH Pakistanis. Rest-of-the-World average frequencies are compiled from 1000 Genomes African, European, and East Asian average frequencies, plus gnomAD and HGDP-CEPH average non-South Asian population data. The Rest-of-the-World axis shows 1/100th allele frequency values compared to those of South Asian populations. The boxed values of rs77510889 exclude high frequencies for the South Asian-specific allele amongst HGDP-CEPH African Khoisan, Biaka Pygmies and Mbuti Pygmies (rs77510889-G allele frequency of 0.23 in HGDP-CEPH African hunter-gatherer populations vs. 0.06 in HGDP-CEPH Pakistani populations).
      Although there are very different sample sizes between the populations sampled by 1000 Genomes, CEPH, SGDP and EGDP (approximately 100, 25, 2–3, and 2–6, respectively), it is instructive to map the distribution of South Asian-specific allele frequencies in all the populations studied which are located in, or near, the Indian sub-continent. Fig. 5 provides pie charts of the average percentage of South Asian-specific alleles in each population sample in 1000 Genomes, CEPH, SGDP and EGDP ( percentage values adjusted for four missing SNPs in EGDP) from in or near the Indian sub-continent, mapped to their approximate geographic locations. The UK-resident 1000 Genomes Indian Telugu (ITU) and Sri Lankan Tamil (STU); and US-resident Gujarati (GIH) populations are placed in their approximate locations of origin, and the EGDP Roma have no sampling location described. The CEPH Cambodian and EGDP Aeta from the Philippines are also too distant from the centre of South Asia to be easily placed on this map. Note that the average percentage of South Asian-specific alleles in all other population samples not shown in Fig. 5 was less than 0.5 %, except the two SGVP samples: Indians in Singapore = 20 %; Malays in Singapore = 0.76 %. Patterns of average allele frequency distributions in the 31 population samples shown in Fig. 5 indicate a high percentage value in most populations within India and Bangladesh, but a sharp drop in these values in populations from regions northwest and east of this country. A notable exception is the 1000 Genomes Punjabi from Lahore, Pakistan, with this population group occupying the most easterly part of Pakistan at the northwest corner of India. The smallest recorded average percentage of South Asian-specific alleles were observed in the most geographically distant Iranians in the West (1.38 %), Uyghur in the north (2.5 %) and Cambodians in the east (2 %).
      Fig. 5
      Fig. 5The average percentage of South Asian-specific alleles (from a total of 72) in each population sample with a recorded value higher than 0.5 %. All such populations are located within or closely sited to the Indian sub-continent apart from the CEPH Cambodian and EGDP Aeta of the Philippines. Populations to the west, east and north of India tend to show much lower average percent specific alleles compared to the maximum value of 24 % seen in the 1000 Genomes STU and ITU population samples. These population samples based in the UK, and the GIH in the US, are positioned in their approximate geographic location. EGDP Roma samples have no stated sampling location.

      3.3 Analysis of the six Eurasiaplex-2 SNPs on chromosome 16

      Six SNPs chosen from chromosome 16 (C16) had the highest levels of South Asian specificity and three were clustered around the centromere, potentially meaning they could show full or high levels of allelic association due to reduced recombination, precluding their use as independent loci. Table 2 summarises the recombination rates between the six SNPs using the HapMap genetic map database to estimate map distances in Centimorgans (cM) and Kosambi-adjusted recombination fractions (Rc). The Rc estimates in Table 2 indicate the three 5′ SNPs rs368479296-rs376893831-rs370130302 show minimal linkage with 32 % and 11 % recombination fractions, but the other three 3′ centromeric SNPs rs368738705-rs368538881-rs377323011 have a much lower level of recombination of ∼0.35 % across the 3-SNP span. The top half of Fig. 6 outlines the structural landscape of these three C16 centromeric SNPs and the common genotype combinations they form in South Asian vs. other populations.
      Table 2HapMap genetic map analysis of the six chromosome 16 South Asian-specific SNPs in Eurasiaplex-2. The map distance was estimated in Centimorgans (cM) and the recombination fraction (Rc) calculated from the cM values using Kosambi adjusted data. The 5′ SNPs rs368479296-rs376893831-rs370130302 show very little linkage with 32 % and 11 % recombination fractions, but the other three 3′ SNPs rs368738705-rs368538881-rs377323011 (in bold) have a much lower level of recombination of less than 0.3 %.
      Internal IDSNPGRCh37 positionGRCh38 positioncM inter-SNP distanceKosambi-adjusted Rc
      16Ars36847929631789713128970
      16Brs376893831230538152304249438.6810.324515
      16Crs370130302285880592857673811.38980.111968
      16Drs36873870533921593341191261.7930.017922
      16Ers36853888146499858464659460.04220.000422
      16Frs37732301148327788482938770.31180.003118
      Fig. 6
      Fig. 6The six SNPs of Chromosome 16 (red bars) and the haplotype landscape around SNPs 4–6 [rs368738705-rs368538881-rs377323011] closest to the centromere, where Centimorgan (cM) genetic map distances are particularly small (red box). As all three SNPs have alphabetic ordering for the Ref and Alt (South Asian-specific) alleles - i.e., CT, CT, AG, respectively, all 1000 Genomes heterozygous genotypes were alphabetised, and homozygote genotypes counted in order to record recombination between the root haplotype CCA and the South Asian-specific haplotype TTG. In this way rs368738705 CC or TT homozygote genotypes indicated recombination between rs368738705-rs368538881 (‘4 × 5 × 6′); rs377323011 AA or GG indicated recombination between rs368538881-rs377323011 (‘4 5 × 6′); and double sequential recombination events (‘4 × 5 × 6′) were recorded as rs368538881 AA or CC homozygotes. Counts are given for the 1000 Genomes South Asians (1KG SAS) vs. all other 1000 Genomes populations (1KG Other, excluding admixed individuals), indicating widespread disruption of CCA and TTG haplotypes and likely minor levels of association amongst the South Asian-specific alleles of these three SNPs. The single CCG haplotype recorded in a non-South Asian individual was inferred from CC-CC-AG genotypes. Mb: megabase; KHV: Kinh in Ho Chi Minh City, Vietnam.
      To explore the possibility of association between the rs368479296-rs376893831-rs370130302 SNPs, we decided to treat them as a haplotype and gauge haplotype diversity in the 1000 Genomes samples (excluding admixed samples). Although the Rc values between these SNPs are very low, the physical distances in megabases (Mb) are much bigger, with the 12.35 Mb span between rs368738705 and rs368538881 alone representing nearly 14 % of the total C16 length. Therefore, it would not be possible for 1000 Genomes to accomplish accurate phasing of alleles for this series of SNPs over such distances. We decided to convert any localised phasing (i.e., a SNP allele’s phase with reference to its immediate neighbours) made by 1000 Genomes of rs368479296-rs376893831-rs370130302 heterozygotes, into alphabetic order, respectively: TC>CT, TC>CT, GA>AG. This created the root haplotype of CCA – universally present in all 1000 Genomes non-South Asian samples, plus the South Asian-specific TTG haplotype, exclusively confined to these populations in 1000 Genomes. In this way, any 3-SNP genotypes which are not CC-CC-AA; TT-TT-GG; or, CT-CT-AG, represent disruptions to the South Asian-specific haplotypes which signify reduced allelic association. We counted the derived haplotypes amongst South Asian individuals as homozygous genotypes in each SNP, specifically (in bold): CC-CT-AG and TT-CT-AG in rs368479296; CT-CC-AG and CT-TT-AG in rs376893831; CT-CT-AA and CT-CT-GG in rs370130302. These patterns can be interpreted to indicate disrupting recombination between rs368479296-rs376893831, double recombination between rs368479296-rs376893831 and rs376893831-rs370130302, or recombination between rs376893831-rs370130302, respectively. Although true phasing cannot be achieved, the extent to which the above six derived haplotypes occur in 1000 Genomes populations will indicate the level of disruption of allelic association amongst the three SNPs. The rs368479296-rs376893831-rs370130302 inferred haplotypes for all population samples are listed in Supplementary File S1E and the numbers of each haplotype in 1000 Genomes populations are summarised in the lower half of Fig. 6. Except for a singleton CCG haplotype (present as CC-CC-AG genotypes in a Vietnamese KHV sample), all 3,021 non-South Asian samples from 1000 Genomes had the CCA root haplotype. Amongst the South Asian samples, 206 were inferred to have specific TTG haplotypes and 464 non-specific CCA haplotypes, but a significant number of South Asian specific haplotypes, a total of 308, were derived, i.e., inferred to be different combinations of alleles to either CCA or TTG. This would suggest there is very little association between the centromeric C16 SNPs. Furthermore, levels of recombination between these three SNPs are likely to be higher, given the South Asian individuals with CT-CT-AG genotypes (88 of 489, 18 %) cannot be reliably phased.

      3.4 Statistical analyses

      3.4.1 Conventional Bayes analysis of South Asian population variability

      The results of the Bayes analysis likelihood assessments and PCA patterns generated by Snipper, are summarised in Supplementary File S1. First, evaluations were made of pairwise cumulative Divergence (In) values calculated for South Asian vs. European, and vs. East Asian populations, using 1000 Genomes data and comparing the 23 SNPs of the original Eurasiaplex with the 36 of Eurasiaplex-2 panel. Supplementary File S1.1 shows the cumulative In for South Asians vs. Europeans at the point of 23 SNPs have similar values in Eurasiaplex (1.87) compared to Eurasiaplex-2 (2.01), but for South Asian vs. East Asian variation Eurasiaplex only reaches 0.77, compared to 1.99 from 23 Eurasiaplex-2 SNPs. This highlights how frequencies for the specific alleles of zero, or near zero outside of the targeted population, produce almost identical In values in all populations comparisons and for each SNP added. This is expressed in the cumulative In chart in Supplementary File S1 as two diverging and flattening curves in Eurasiaplex, compared with the two straight lines with identical trajectories in Eurasiaplex-2. Therefore, when the final cumulative values are calculated for 36 Eurasiaplex-2 SNPs, each population comparison has almost identical values of 2.84 (vs. Europe) and 2.79 (vs. East Asia). When population specific SNPs are combined in the future to differentiate all the main population groups, it will be straightforward to balance the In values for each comparison as this will just entail adjustment of the number of SNPs needed to reach a final cumulative value that can be matched across all populations.
      Second, the distribution of likelihood ratios (LR) from the comparison of 1000 Genomes South Asian and African likelihoods (Africans produced the second highest likelihoods in five population comparisons) using Bayes analyses in Snipper, was compiled in a chart of ranked LRs shown in Supplementary File S1.2. These values are generally much higher than those observed with alternative ancestry SNPs [
      • Phillips C.
      • Parson W.
      • Lundsberg B.
      • Santos C.
      • Freire-Aradas A.
      • Torres M.
      • Eduardoff M.
      • Børsting C.
      • Johansen P.
      • Fondevila M.
      • et al.
      Building a forensic ancestry panel from the ground up: the EUROFORGEN Global AIM-SNP set.
      ,
      • de la Puente M.
      • Ruiz-Ramírez J.
      • Ambroa-Conde A.
      • Xavier C.
      • Pardo-Seco J.
      • Álvarez-Dios J.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Gross T.E.
      • Cheung E.Y.Y.
      • et al.
      Development and evaluation of the ancestry informative marker panel of the VISAGE basic tool.
      ], with the bulk of samples producing LR values in excess of 1E+ 12 or ‘1 in a trillion times more likely to be from South Asia than Africa’. Nevertheless, five individuals had LR values below ‘2000 times more likely’, that all corresponded to samples with the lowest number (6) of South Asian-specific alleles. Third, two PCA plots are shown in Supplementary File S1.3 from analyses in Snipper of 1000 Genomes YRI, CEU, CHB, and CEPH Native American, Oceanian populations compared with CEPH Pakistanis (Plot 1), and compared with 1000 Genomes GIH (plot 2). These plots illustrate the very tightly distributed set of PCA points in populations outside of South Asia obtained with Eurasiaplex-2 SNPs. The target population PCA point distributions are different between Pakistanis and GIH, with Pakistanis equally diffuse in distribution, but overlapping with the much smaller area of the 2D plot occupied by populations outside of South Asia. While PCA itself would not be a system of choice for assigning ancestry and this represents analysis with a single set of population-specific markers, it is interesting that zero or near-zero allele frequencies in most populations causes samples, even in large numbers, to occupy a very small area of the PCA plot.

      3.4.2 Genetic cluster analysis with STRUCTURE comparing Eurasiaplex and Eurasiaplex-2 SNPs

      Supplementary File S1.4 shows the cluster plots from separate STRUCTURE analysis of 1000 Genomes populations using the original 23 Eurasiaplex SNPs and the 36 Eurasiaplex-2 SNPs. Patterns show that the Eurasiaplex-2 SNP set clearly distinguishes South Asians from all other 1000 Genomes populations at K:2, with a clean set of columns and some minor mixed cluster proportions in the PJL. As there is almost no genetic variation present in non-South Asian samples for Eurasiaplex-2 SNPs, no other genetic clusters are identified at K:3, K:4, or higher K values (data not shown). It is noteworthy that the South Asian-specific rs77510889-G allele detected in African populations did not produce an African cluster for any higher K values analysed. In contrast, the 23 Eurasiaplex SNPs distinguish Europeans as the first major genetic cluster at K:2, then Africans at K:3, with South Asians only emerging as a differentiated population group at K:4. To some degree, these patterns are likely to reflect the selection of the original Eurasiaplex SNPs that had strongly contrasting allele frequencies between Europeans and other population groups, including small, but above-average allele frequency differences between Europeans and South Asians.

      3.4.3 Exploration of a simple South Asian-specific allele counting system

      Arguably, a much more straightforward system of population assignment than Bayes analysis can be achieved by simply counting the number of South Asian-specific Eurasiaplex-2 alleles in an individual. Fig. 3 illustrating the worldwide distribution of genotype counts, and Fig. 5, those of allele counts greater than 0.5 %, show they are both almost completely confined to the Indian sub-continent and adjoining areas. These patterns show that highly contrasted allele counts in individuals from South Asia vs. individuals from other regions will give strong indications of origins from the regions targeted by Eurasiaplex-2, provided there is no overlap between minimum and maximum counts from each set of populations. Fig. 7 plots the distribution of South Asian-specific allele counts in the four 1000 Genomes population groups plus CEPH Pakistanis. In the 1000 Genomes Africans, Europeans and East Asians, the majority of samples, some 80–90 %, have no specific alleles present in any genotypes. Africans have 16 % of genotypes with a single specific allele, mainly due to the rs77510889-G allele in these populations, but that represents the upper limit of specific allele counts in this population group. Only East Asians have a few individuals with more than one specific allele, with five samples having two alleles and a singleton with a maximum of three. The lower limit of specific alleles in South Asians is 5, with a singleton sample with this number, and then ten with 6 alleles, meaning there is no overlap in counts between the South Asians of 1000 Genomes and all individuals from the other population groups. Although a count of three, four or five specific alleles might be considered ambiguous, such a small number of individuals could stay unassigned. Therefore, there would be a very low probability of incorrect assignment of individuals using a lower limit of six specific alleles to signal South Asian origins.
      Fig. 7
      Fig. 7Distribution of South Asian-specific allele counts in the four 1000 Genomes population groups, showing a bell-shaped distribution of specific alleles in the South Asian populations, and the lowest value represented by a single South Asian individual with five specific alleles. There is no overlap with the distribution of specific allele counts in the other 1000 Genomes populations with a single East Asian individual with a maximum three specific alleles. CEPH Pakistani samples show a degree of overlap with a single individual having no specific alleles, but more than half of Pakistani samples having less than a nominal six specific alleles lower limit to signify South Asian origins.
      The CEPH Pakistanis show a distribution with a greater degree of overlap with populations outside of South Asia, which is not unexpected, but strict adherence to a minimum six specific alleles means just over half of Pakistanis (88 of 168 with five or less South Asian-Specific alleles) are not assigned as South Asian. We intend to develop a statistical system for the handling of population-specific allele counts, based on hypothesis testing where the null hypothesis represents origins in the specific-allele target population. This will be more easily accomplished when a globally applicable set of population-specific SNPs has been compiled for all population groups.

      4. Discussion

      By constructing a SNP panel composed of a completely new type of ancestry marker focused on variation that is specific to the single targeted South Asia population group, rather than using SNPs with highly contrasting but shared variation across multiple populations, we have identified a characteristic signature of South Asian origin in almost all individuals from the Indian sub-continent. This specific-allele signature, illustrated by the dense pattern of orange and red bars in Fig. 3, is clearly observed across a large, broadly-based collection of South Asian samples taken from across the world. Apart from rs77510889, that showed a relatively high frequency of the South Asian-specific G-allele in Africans, all other SNP alleles chosen to be specific to South Asia had frequencies below 0.005 (0.5 %) in populations outside this region. Applying the same allele frequency intersect to other 1000 Genomes population groups plus Oceanian, American, and Middle East populations represented in the HGDP-CEPH panel, will allow the compilation of a global population-specific ancestry marker set, with marker numbers appropriate for MPS-scale SNP multiplexes. The process of building a large MPS multiplex requires some adjustments of targeted SNPs with poor context sequence or flanking region variation that interferes with sequence alignments or primer binding. For this reason, we chose to report a full list of suitable South Asian-specific candidate loci, rather than build a small-scale multiplex of 30–40 markers, as we have done for many forensic SNP panels previously [
      • Phillips C.
      • Freire Aradas A.
      • Kriegel A.K.
      • Fondevila M.
      • Bulbul O.
      • Santos C.
      • Serrulla Rech F.
      • Perez Carceles M.D.
      • Carracedo Á.
      • Schneider P.M.
      • Lareu M.V.
      Eurasiaplex: a forensic SNP assay for differentiating European and South Asian ancestries.
      ,
      • de la Puente M.
      • Santos C.
      • Fondevila M.
      • Manzo L.
      • EUROFORGEN-NoE Consortium
      • Carracedo A.
      • Lareu M.V.
      • Phillips C.
      The Global AIMs Nano set: a 31-plex SNaPshot assay of ancestry-informative SNPs.
      ,
      • Phillips C.
      • Manzo L.
      • de la Puente M.
      • Fondevila M.
      • Lareu M.V.
      The MASTiFF panel - a versatile multiple-allele SNP test for forensics.
      ]. The near-equal cumulative Divergence values between South Asians vs. East Asians and vs. Europeans shown by the 36 SNPs in Eurasiaplex-2, will make the process of balancing SNPs specific for each population group relatively straightforward, as the number of markers can be adjusted to produce a comparable average number of population-specific alleles per individual from that population.
      The variant data and its statistical treatment that we have briefly explored in this study, requires a rethink of how best to adapt highly population-specific SNP allele patterns of variation into a forensic ancestry prediction framework. We do not feel that Bayes analysis or PCA will provide the necessary detailed assessments of the number of alleles specific to a given population that are detected in an individual. We expect STRUCTURE to be more sensitive to specific-allele patterns as well as being able to efficiently analyse co-ancestry in individuals with admixed backgrounds [
      • de la Puente M.
      • Ruiz-Ramírez J.
      • Ambroa-Conde A.
      • Xavier C.
      • Pardo-Seco J.
      • Álvarez-Dios J.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Gross T.E.
      • Cheung E.Y.Y.
      • et al.
      Development and evaluation of the ancestry informative marker panel of the VISAGE basic tool.
      ,
      • Ruiz-Ramírez J.
      • de la Puente M.
      • Xavier C.
      • Ambroa-Conde A.
      • Álvarez-Dios J.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Ralf A.
      • Amory C.
      • Katsara M.A.
      • et al.
      Development and evaluations of the ancestry informative markers of the VISAGE enhanced tool for appearance and ancestry.
      ]. Until a global panel of population-specific SNPs is constructed this will need to be explored in-silico once enough candidate SNPs for each population group have been compiled. It is also useful to begin to develop a statistical framework that centres on formal testing of hypotheses for H1: originating from the target population vs. H2: originating from another.
      An ancestry SNP selection process that enriches for markers with near-absolute specificity is more likely to detect variation present in the targeted population due to less common evolutionary genetic processes than those that underlie traditionally compiled ancestry sets (i.e., natural selection and genetic drift). These processes might include recent mutation events creating new SNP variants confined to a specific geographic region [
      • Williams L.M.
      • Oleksiak M.F.
      Ecologically and evolutionarily important SNPs identified in natural populations.
      ]; gene flow from Hominin introgression taking place in a particular locality [
      • Sankararaman S.
      • Patterson N.
      • Li H.
      • Pääbo S.
      • Reich D.
      The date of interbreeding between Neandertals and modern humans.
      ,
      • Huerta-Sánchez E.
      • Casey F.P.
      Archaic inheritance: supporting high-altitude life in Tibet.
      ]; localised selective sweeps, which might favour certain low frequency variants which then become region-specific [
      • Chen H.
      • Patterson N.
      • Reich D.
      Population differentiation as a test for selective sweeps.
      ]. Any of these processes could have occurred after the South Asian root populations of Ancestral North Indians and Ancestral South Indians [
      • Reich D.
      • Thangaraj K.
      • Patterson N.
      • Price A.L.
      • Singh L.
      Reconstructing Indian population history.
      ,
      • Majumder P.P.
      The human genetic history of South Asia.
      ] separated from other Eurasian groups of populations [
      • Narasimhan V.M.
      • Patterson N.
      • Moorjani P.
      • Rohland N.
      • Bernardos R.
      • Mallick S.
      • Lazaridis I.
      • Nakatsuka N.
      • Olalde I.
      • Lipson M.
      • et al.
      The formation of human populations in South and Central Asia.
      ]. There is also the additional complexity of the caste system in Indian populations creating highly stratified distributions of variability across the sub-continent (e.g., the Brahmin caste has higher Iranian ancestry than other Indian castes and this differentiation would be maintained by reduced outbreeding across castes [
      • Debortoli G.
      • Abbatangelo C.
      • Ceballos F.
      • Fortes-Lima C.
      • Norton H.L.
      • Ozarkar S.
      • Parra E.J.
      • Jonnalagadda M.
      Novel insights on demographic history of tribal and caste groups from West Maharashtra (India) using genome-wide data.
      ]). The silk roads provided a strong driver of East-West gene flow across the central parts of Eurasia, but these were largely routed to the north of the Himalayas which acted as a lengthy barrier to mass movements into the Indian sub-continent. Overall, the South Asian SNP variation we have identified and compiled represents only a very small proportion of total genomic variability, but the markers have maintained their high levels of specificity by consistently showing zero, or near zero allele frequencies in every region outside of the Indian sub-continent studied so far.

      Acknowledgements

      M.d.l.P. is supported by a post-doctorate grant funded by the Consellería de Cultura, Educación e Ordenación Universitaria e da Consellería de Economía, Emprego e Industria from Xunta de Galicia, Spain (ED481D-2021-008). J.R. is supported by the “Programa de axudas á etapa predoutoral” funded by the Consellería de Cultura, Educación e Ordenación Universitaria e da Consellería de Economía, Emprego e Industria from Xunta de Galicia, Spain (ED481A-2020-039).

      Appendix A. Supplementary material

      References

        • Phillips C.
        • Freire Aradas A.
        • Kriegel A.K.
        • Fondevila M.
        • Bulbul O.
        • Santos C.
        • Serrulla Rech F.
        • Perez Carceles M.D.
        • Carracedo Á.
        • Schneider P.M.
        • Lareu M.V.
        Eurasiaplex: a forensic SNP assay for differentiating European and South Asian ancestries.
        Forensic Sci. Int. Genet. 2013; 7: 359-366
        • Reich D.
        • Thangaraj K.
        • Patterson N.
        • Price A.L.
        • Singh L.
        Reconstructing Indian population history.
        Nature. 2009; 461: 489-494
        • Majumder P.P.
        The human genetic history of South Asia.
        Curr. Biol. 2010; 20: R184-187
        • Phillips C.
        • Parson W.
        • Lundsberg B.
        • Santos C.
        • Freire-Aradas A.
        • Torres M.
        • Eduardoff M.
        • Børsting C.
        • Johansen P.
        • Fondevila M.
        • et al.
        Building a forensic ancestry panel from the ground up: the EUROFORGEN Global AIM-SNP set.
        Forensic Sci. Int. Genet. 2014; 11: 13-25
        • Kidd J.R.
        • Friedlaender F.R.
        • Speed W.C.
        • Pakstis A.J.
        • De La Vega F.M.
        • Kidd K.K.
        Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples.
        Invest. Genet. 2011; 2: 1
        • de la Puente M.
        • Ruiz-Ramírez J.
        • Ambroa-Conde A.
        • Xavier C.
        • Pardo-Seco J.
        • Álvarez-Dios J.
        • Freire-Aradas A.
        • Mosquera-Miguel A.
        • Gross T.E.
        • Cheung E.Y.Y.
        • et al.
        Development and evaluation of the ancestry informative marker panel of the VISAGE basic tool.
        Genes. 2021; 12: 1284
        • Rosenberg N.A.
        • Li L.M.
        • Ward R.
        • Pritchard J.K.
        Informativeness of genetic markers for inference of ancestry.
        Am. J. Hum. Genet. 2003; 73: 1402-1422
        • Pfaffelhuber P.
        • Grundner-Culemann F.
        • Lipphardt V.
        • Baumdicker F.
        How to choose sets of ancestry informative markers: a supervised feature selection approach.
        Forensic Sci. Int. Genet. 2020; 46102259
        • The 1000 Genomes Project Consortium
        • Auton A.
        • Brooks L.D.
        • Durbin R.M.
        • Garrison E.P.
        • Kang H.M.
        • Korbel J.O.
        • Marchini J.L.
        • McCarthy S.
        • McVean G.A.
        • et al.
        A global reference for human genetic variation.
        Nature. 2015; 526: 68-74
      1. M. Byrska-Bishop, U.S. Evani, X. Zhao, A.O. Basile, H.J. Abel, A.A. Regier, A. André Corvelo, W.E. Clarke, R. Musunuri, K. Nagulapalli, et al., High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios, bioRxiv preprint, posted February 7 2021 doi: 〈https://doi.org/10.1101/2021.02.06.430068〉.

        • Zhao S.
        • Shi C.-M.
        • Ma L.
        • Liu Q.
        • Liu Y.
        • Wu F.
        • Chi L.
        • Chen H.
        AIM-SNPtag: a computationally efficient approach for developing ancestry-informative SNP panels.
        Forensic Sci. Int. Genet. 2019; 38: 245-253
        • Ruiz-Ramírez J.
        • de la Puente M.
        • Xavier C.
        • Ambroa-Conde A.
        • Álvarez-Dios J.
        • Freire-Aradas A.
        • Mosquera-Miguel A.
        • Ralf A.
        • Amory C.
        • Katsara M.A.
        • et al.
        Development and evaluations of the ancestry informative markers of the VISAGE enhanced tool for appearance and ancestry.
        Forensic Sc. Int. Genet. 2022;
      2. 〈http://www.ensembl.org/Homo_sapiens/Variation/Explore?r=6:60527829–60528829;v=rs3857620;vdb=variation;vf=169483878〉 (Accessed June 2022).

        • Lek M.
        • Karczewski K.J.
        • Minikel E.V.
        • Samocha K.E.
        • Banks E.
        • Fennell T.
        • O’Donnell-Luria A.H.
        • Ware J.S.
        • Hill A.J.
        • Cummings B.B.
        • et al.
        Analysis of protein-coding genetic variation in 60,706 humans.
        Nature. 2016; 536: 285-291
        • Bergström A.
        • McCarthy S.A.
        • Hui R.
        • Almarri M.A.
        • Ayub Q.
        • Danecek P.
        • Chen Y.
        • Felkel S.
        • Hallast P.
        • Kamm J.
        • J
        • et al.
        Insights into human genetic variation and population history from 929 diverse genomes.
        Science. 2020; 367: 1339-1349
        • Mallick S.
        • Li H.
        • Lipson M.
        • Mathieson I.
        • Gymrek M.
        • Racimo F.
        • Zhao M.
        • Chennagiri N.
        • Nordenfelt S.
        • Tandon A.
        • et al.
        The Simons Genome Diversity Project: 300 genomes from 142 diverse populations.
        Nature. 2016; 538: 201-206
        • Pagani L.
        • Lawson D.J.
        • Jagoda E.
        • Mörseburg A.
        • Eriksson A.
        • Mitt M.
        • Clemente F.
        • Hudjashov G.
        • DeGiorgio M.
        • Saag L.
        • et al.
        Genomic analyses inform on migration events during the peopling of Eurasia.
        Nature. 2016; 538: 238-242
        • Wong L.-P.
        • Ong R.T.
        • Poh W.T.
        • Liu X.
        • Chen P.
        • Li R.
        • Koi-Yau Lam K.
        • Esakimuthu Pillai N.
        • Sim K.-S.
        • Xu H.
        • et al.
        Deep whole-genome sequencing of 100 southeast Asian Malays.
        Am. J. Hum. Genet. 2013; 92: 52-66
        • Wong L.-P.
        • Kuan-Han Lai J.
        • Saw W.-Y.
        • Ong R.T.
        • Youzhi Cheng A.
        • Esakimuthu Pillai N.
        • Liu X.
        • Xu W.
        • Chen P.
        • Foo J.-N.
        • et al.
        Insights into the genetic structure and diversity of 38 South Asian Indians from deep whole-genome sequencing.
        PLoS Genet. 2014; 10e1004377
      3. 〈http://mathgene.usc.es/snipper/analysismultipleprofiles.html〉.

        • Phillips C.
        • Ballard D.
        • Gill P.
        • Syndercombe Court D.
        • Carracedo A.
        • Lareu M.V.
        The recombination landscape around forensic STRs: accurate measurement of genetic distances between syntenic STR pairs using HapMap high density SNP data.
        Forensic Sci. Int. Genet. 2012; 6: 354-365
        • Porras-Hurtado L.
        • Ruiz Y.
        • Santos C.
        • Phillips C.
        • Carracedo Á.
        • Lareu M.V.
        An overview of STRUCTURE: applications, parameter settings, and supporting software.
        Front. Genet. 2013; 4: 98
        • Amigo J.
        • Salas A.
        • Phillips C.
        ENGINES: exploring single nucleotide variation in entire human genomes.
        BMC Bioinf. 2011; 12: 105
        • Gómez-Carballa A.
        • Pardo-Seco J.
        • Fachal L.
        • Vega A.
        • Cebey M.
        • Martinón-Torres N.
        • Martinón-Torres F.
        • Salas A.
        Indian signatures in the westernmost edge of the European Romani diaspora: New insight from mitogenomes.
        PLoS One. 2013; 8e75397
        • de la Puente M.
        • Santos C.
        • Fondevila M.
        • Manzo L.
        • EUROFORGEN-NoE Consortium
        • Carracedo A.
        • Lareu M.V.
        • Phillips C.
        The Global AIMs Nano set: a 31-plex SNaPshot assay of ancestry-informative SNPs.
        Forensic Sci. Int. Genet. 2016; 22: 81-88
        • Phillips C.
        • Manzo L.
        • de la Puente M.
        • Fondevila M.
        • Lareu M.V.
        The MASTiFF panel - a versatile multiple-allele SNP test for forensics.
        Int. J. Leg. Med. 2020; 134: 441-450
        • Williams L.M.
        • Oleksiak M.F.
        Ecologically and evolutionarily important SNPs identified in natural populations.
        Mol. Biol. Evol. 2011; 28: 1817-1826
        • Sankararaman S.
        • Patterson N.
        • Li H.
        • Pääbo S.
        • Reich D.
        The date of interbreeding between Neandertals and modern humans.
        PLoS Genet. 2012; 8e1002947
        • Huerta-Sánchez E.
        • Casey F.P.
        Archaic inheritance: supporting high-altitude life in Tibet.
        J. Appl. Physiol. 1985; 119: 1129-1134
        • Chen H.
        • Patterson N.
        • Reich D.
        Population differentiation as a test for selective sweeps.
        Genome Res. 2010; 20: 393-402
        • Narasimhan V.M.
        • Patterson N.
        • Moorjani P.
        • Rohland N.
        • Bernardos R.
        • Mallick S.
        • Lazaridis I.
        • Nakatsuka N.
        • Olalde I.
        • Lipson M.
        • et al.
        The formation of human populations in South and Central Asia.
        Science. 2019; 365: eaat7487
        • Debortoli G.
        • Abbatangelo C.
        • Ceballos F.
        • Fortes-Lima C.
        • Norton H.L.
        • Ozarkar S.
        • Parra E.J.
        • Jonnalagadda M.
        Novel insights on demographic history of tribal and caste groups from West Maharashtra (India) using genome-wide data.
        Sci. Rep. 2020; 10: 10075