Advertisement
Research Article| Volume 64, 102853, May 2023

Download started.

Ok

Development and evaluations of the ancestry informative markers of the VISAGE Enhanced Tool for Appearance and Ancestry

Open AccessPublished:March 03, 2023DOI:https://doi.org/10.1016/j.fsigen.2023.102853

      Highlights

      • 226 ancestry (BGA) markers compiled for VISAGE Enhanced Tool (ET) combined with 184 appearance markers in one MPS assay.
      • Autosomal BGA SNP number in ET reduced to allow inclusion of 85 Y-SNPs, 16 X-SNPs and 21 Microhaplotypes (MHs).
      • Extra BGA markers give enhanced detail of co-ancestry patterns in admixed males and MH loci allow ancestry-based mixed DNA.
      • Comprehensive reference population datasets and analyses of global distribution of variation in the ET BGA markers outlined.
      • Expanded Middle East-informative SNPs enhance differentiation of these populations particularly when combined with nested K:5 STRUCTURE runs.

      Abstract

      The VISAGE Enhanced Tool for Appearance and Ancestry (ET) has been designed to combine markers for the prediction of bio-geographical ancestry plus a range of externally visible characteristics into a single massively parallel sequencing (MPS) assay. We describe the development of the ancestry panel markers used in ET, and the enhanced analyses they provide compared to previous MPS-based forensic ancestry assays. As well as established autosomal single nucleotide polymorphisms (SNPs) that differentiate sub-Saharan African, European, East Asian, South Asian, Native American, and Oceanian populations, ET includes autosomal SNPs able to efficiently differentiate populations from Middle East regions. The ability of the ET autosomal ancestry SNPs to distinguish Middle East populations from other continentally defined population groups is such that characteristic patterns for this region can be discerned in genetic cluster analysis using STRUCTURE. Joint cluster membership estimates showing individual co-ancestry that signals North African or East African origins were detected, or cluster patterns were seen that indicate origins from central and Eastern regions of the Middle East. In addition to an augmented panel of autosomal SNPs, ET includes panels of 85 Y-SNPs, 16 X-SNPs and 21 autosomal Microhaplotypes. The Y- and X-SNPs provide a distinct method for obtaining extra detail about co-ancestry patterns identified in males with admixed backgrounds. This study used the 1000 Genomes admixed African and admixed American sample sets to fully explore these enhancements to the analysis of individual co-ancestry. Samples from urban and rural Brazil with contrasting distributions of African, European, and Native American co-ancestry were also studied to gauge the efficiency of combining Y- and X-SNP data for this purpose. The small panel of Microhaplotypes incorporated in ET were selected because they showed the highest levels of haplotype diversity amongst the seven population groups we sought to differentiate. Microhaplotype data was not formally combined with single-site SNP genotypes to analyse ancestry. However, the haplotype sequence reads obtained with ET from these loci creates an effective system for de-convoluting two-contributor mixed DNA. We made simple mixture experiments to demonstrate that when the contributors have different ancestries and the mixture ratios are imbalanced (i.e., not 1:1 mixtures) the ET Microhaplotype panel is an informative system to infer ancestry when this differs between the contributors.

      Keywords

      1. Introduction

      The VISible Attributes through GEnomics (VISAGE) Consortium was initiated in 2017 specifically to develop new massively parallel sequencing (MPS) tools to genotype single nucleotide polymorphisms (SNPs) for the prediction of bio-geographical ancestry (BGA) [
      • Phillips C.
      Forensic genetic analysis of bio-geographical ancestry.
      ] and a range of externally visible characteristics (EVCs) [
      • Kayser M.
      Forensic DNA Phenotyping: Predicting human appearance from crime scene material for investigative purposes.
      ] that contribute to the appearance of an unidentified suspect who has left contact trace DNA at the crime-scene. The SNP genotyping tests for BGA and EVC prediction run in parallel to dedicated MPS assays for age estimation based on quantitative DNA methylation analysis [
      • Freire-Aradas A.
      • Phillips C.
      • Lareu M.V.
      Forensic individual age estimation with DNA: from initial approaches to methylation tests.
      ]. VISAGE used a two-stage program to develop the MPS toolbox for DNA-based prediction of ancestry, appearance, and age. In the first stage, two prototype Basic Tools (BT) were created comprising the VISAGE BT for Appearance and Ancestry that combined in one MPS assay, 41 markers for predicting eye, hair, and skin colour with 115 ancestry-informative SNPs to analyse BGA [
      • de la Puente M.
      • Ruiz-Ramírez J.
      • Ambroa-Conde A.
      • Xavier C.
      • Pardo-Seco J.
      • Álvarez-Dios J.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Gross T.E.
      • Cheung E.Y.Y.
      • et al.
      Development and evaluation of the ancestry informative marker panel of the VISAGE basic tool.
      ,
      • Xavier C.
      • de la Puente M.
      • Mosquera-Miguel A.
      • Freire-Aradas A.
      • Kalamara V.
      • Vidaki A.
      • Gross T.E.
      • Revoir A.
      • Pośpiech E.
      • Kartasinśka E.
      • et al.
      Development and validation of the VISAGE AmpliSeq basic tool to predict appearance and ancestry from DNA.
      ,
      • Palencia-Madrid L.
      • Xavier C.
      • de la Puente M.
      • Hohoff C.
      • Phillips C.
      • Kayser M.
      • Parson W.
      VISAGE consortium, evaluation of the VISAGE basic tool for appearance and ancestry prediction using PowerSeq chemistry on the MiSeq FGx system.
      ]; and the VISAGE BT for age estimation from blood that combined in one MPS assay 32 CpGs from five genes [
      • Heidegger A.
      • Xavier C.
      • Niederstätter H.
      • de la Puente M.
      • Pośpiech E.
      • Pisarek A.
      • Kayser M.
      • Branicki W.
      • Parson W.
      VISAGE consortium, development and optimization of the VISAGE basic prototype tool for forensic age estimation.
      ].
      Once the BT assays had been comprehensively optimised and their forensic performance evaluated on the Ion S5 (Thermo Fisher Scientific) and MiSeq (Illumina) MPS platforms, VISAGE moved to the second stage of MPS tool design with much more ambitious developmental targets for the Enhanced Tools (ET): The VISAGE ET for Appearance and Ancestry and two separate age tools: the VISAGE ET for age estimation from somatic tissue and the VISAGE ET for age estimation from semen. For the VISAGE ET for Appearance and Ancestry assay, new phenotyping SNPs were introduced for an expanded range of common EVCs beyond, but including, eye, hair and skin colour, which were combined with new BGA SNPs. Additional BGA SNPs focussed on the following objectives: i. the efficient differentiation of Middle East population variation from other Eurasian populations by selecting an expanded panel of SNPs focussed on Middle East regions; ii. the addition of gonosomal SNPs (X and Y) to obtain more detailed analysis of co-ancestry patterns in persons with admixed backgrounds; iii. the inclusion of markers providing a system to estimate the ancestry of the components in simple, 2-way mixed DNA, commonly encountered in forensic analyses. The ET toolbox expanded the age estimation MPS sequencing to eight combined CpG clusters analysing somatic tissue methylation patterns in blood, buccal cells and bones [
      • Woźniak A.
      • Heidegger A.
      • Piniewska-Róg D.
      • Pośpiech E.
      • Xavier C.
      • Pisarek A.
      • Kartasińska E.
      • Boroń M.
      • Freire-Aradas A.
      • Wojtas M.
      • et al.
      Development of the VISAGE enhanced tool and statistical models for epigenetic age estimation in blood, buccal cells and bones.
      ], and in a separate test, 13 CpG clusters for analysis of semen [
      • Pisarek A.
      • Pośpiech E.
      • Heidegger A.
      • Xavier C.
      • Papież A.
      • Piniewska-Róg D.
      • Kalamara V.
      • Potabattula R.
      • Bochenek M.
      • Sikora-Polaczek M.
      • et al.
      Epigenetic age prediction in semen - marker selection and model development.
      ,
      • Heidegger A.
      • Pisarek A.
      • de la Puente M.
      • Niederstätter H.
      • Pośpiech E.
      • Woźniak A.
      • Schury N.
      • Unterländer M.
      • Sidstedt M.
      • Junker K.
      • et al.
      Development and inter-laboratory validation of the VISAGE enhanced tool for age estimation from semen using quantitative DNA methylation analysis.
      ]. Therefore, the ET assays comprised a single combined appearance and ancestry MPS multiplex plus somatic or semen age estimation multiplexes running in parallel workflows in the same way as BT-based analyses. A key part of the development of the VISAGE toolbox was the design, optimisation and implementation of an integrated interpretation framework which includes software for combined statistical consideration of DNA information predicting appearance, age, and ancestry delivered by the ET assays.
      For the ET ancestry panel, Middle East informative BGA SNPs were expanded from 12 to 29, but the overall number of binary autosomal BGA SNPs was reduced by ∼25%. To analyse co-ancestry patterns in persons with admixed backgrounds, the two most informative marker sets complementing autosomal SNPs are Y-SNPs and mitochondrial DNA (mtDNA) SNPs. However, mtDNA was not considered for ET, as the target DNA copy number is substantially higher than genomic DNA extracted from the same forensic sample. Additionally, a very large number of SNPs would need to be genotyped. To compensate for the lack of mtDNA data, 16 X-SNPs were included to analyse the maternal lineage in admixed persons, alongside a core set of 85 Y-SNPs to analyse the paternal lineage in males. Both X- and Y-SNP sets provide highly informative data with which to compare the co-ancestry ratios estimated from autosomal BGA SNPs. Lastly, 21 Microhaplotypes (MHs) with ancestry informative properties [
      • de la Puente M.
      • Ruiz-Ramírez M.J.
      • Ambroa-Conde A.
      • Xavier C.
      • Amigo J.
      • Casares de Cal M.A.
      • Gómez-Tato A.
      • Carracedo A.
      • Parson W.
      • Phillips C.
      • Lareu M.V.
      Broadening the applicability of a custom multi-platform panel of Microhaplotypes: Bio-geographical ancestry inference and expanded reference data.
      ] were included to improve the analysis of mixed DNA from the measurement of sequence imbalance and/or detecting more than two haplotypes per locus across multiple MHs, when such mixtures occur.
      In the current study, we outline the selection of ancestry markers for the VISAGE ET for Appearance and Ancestry, the performance of these loci for ancestry inference using established statistical methodology, and the use of the specialist X-SNP, Y-SNP and MH marker sets added to the ET ancestry panel for co-ancestry analysis and ancestry-based deconvolution of simple DNA mixtures.

      2. Materials and methods

      2.1 Selection of ancestry markers for ET

      2.1.1 Autosomal BGA SNPs

      The previous targeted population differentiations of the BT BGA panel, which was composed entirely of autosomal SNPs, were Sub-Saharan Africa (herein Africa, unless specified as the geographically and genetically distinct North Africa or East Africa), Europe, East Asia, South Asia, America (i.e., Native American populations), and Oceania. These datasets are abbreviated to AFR, EUR, EAS, SAS, AMR and OCE, respectively. ET expanded the above population divisions to include Middle East populations (ME), located in regions ranging from North Africa bounded by the Sahara, eastwards to Iran and southwards towards the regions adjacent to the horn of Africa, where originally no distinction was made between North African variation and that shown by other Middle East populations when selecting candidate BGA SNPs. An additional 12 or more BT SNPs that had previously exhibited strong allele frequency contrasts between Middle East populations and Europeans or South Asians, so were also considered. The main source of ME-informative SNPs was the EUROFORGEN NAME panel [
      • Pereira V.
      • Freire-Aradas A.
      • Ballard D.
      • Børsting C.
      • Diez V.
      • Pruszkowska-Przybylska P.
      • Ribeiro J.
      • Achakzai N.M.
      • Aliferi A.
      • Bulbul O.
      • et al.
      Development and validation of the EUROFORGEN NAME (North African and Middle Eastern) ancestry panel.
      ] that previously compiled a total of 111 SNPs. Fig. 1 shows the proportion of autosomal binary and tri-allelic SNPs in both BT and ET ancestry panels, indicating autosomal SNPs comprised 46% of the ancestry markers in ET. Autosomal binary SNP numbers were reduced from BT to ET for all target population groups, ranging from a 21% reduction for SAS to over 87% reduction for OCE. The number of tri-allelic SNPs was increased, but in all markers, there was only limited commonality with BT BGA SNPs - i.e., no population used a simple subset of previously compiled BT BGA SNPs, but each was re-configured to include more powerful ancestry markers to compensate for a reduced number of autosomal SNPs overall, as outlined in Fig. 1. There was also a degree of adjustment for varied population informativeness, measured during searches by calculating Population Specific Divergence (i.e., Shannon’s Divergence metric applied to the comparison of one population with all others in the classification system, herein denoted by: In AFR; In EUR; In EAS; etc.) using the Snipper SNP analysis portal, as previously described [
      • Phillips C.
      • Parson W.
      • Lundsberg B.
      • Santos C.
      • Freire-Aradas A.
      • Torres M.
      • Eduardoff M.
      • Børsting C.
      • Johansen P.
      • Fondevila M.
      • et al.
      Building a forensic ancestry panel from the ground up: the EUROFORGEN Global AIM-SNP set.
      ,
      • Galanter J.M.
      • Fernandez-Lopez J.C.
      • Gignoux C.R.
      • Barnholtz-Sloan J.
      • Fernandez-Rozadilla C.
      • Via M.
      • Hidalgo-Miranda A.
      • Contreras A.V.
      • Figueroa L.U.
      • Raska P.
      • et al.
      Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas.
      ]. Notably, many SNPs specifically targeted to differentiate populations outside of Africa and Oceania also had informative patterns of variation in both of these populations. Many of the tri-allelic SNPs selected were chosen because of above-average levels of divergence between South Asia and Europe for allele-2 and/or allele-3.
      Fig. 1
      Fig. 1Proportion of BGA SNPs and ancestry markers in the VISAGE Basic Tool (BT) and the VISAGE Enhanced Tool (ET). Amongst the binary autosomal BGA SNPs, all population-indicative sets were reduced in number, apart from Middle East (ME) informative SNPs, which were more than doubled in number. The expansion in multiplex space dedicated to ancestry markers in ET was occupied with ancestry-informative Microhaplotypes, Y-SNPs, X-SNPs, and more tri-allelic BGA SNPs. Light grey circles left denote BGA SNPs retained, dark grey circles right novel BGA SNPs introduced to ET to improve each population differentiation.
      The bulk of autosomal BGA SNPs selected for ET were identified from previous forensic ancestry panels, using HGDP-CEPH human diversity panel [
      • Phillips C.
      • Freire Aradas A.
      • Kriegel A.K.
      • Fondevila M.
      • Bulbul O.
      • Santos C.
      • Serrulla Rech F.
      • Perez Carceles M.D.
      • Carracedo A.
      • Schneider P.M.
      • Lareu M.V.
      Eurasiaplex: a forensic SNP assay for differentiating European and South Asian ancestries.
      ,
      • Santos C.
      • Phillips C.
      • Fondevila M.
      • Daniel R.
      • van Oorschot R.A.H.
      • Burchard E.G.
      • Schanfield M.S.
      • Souto L.J.
      • Uacyisrael J.
      • Via M.
      • et al.
      Pacifiplex: An ancestry-informative SNP panel centred on Australia and the Pacific region.
      ,
      • Carvalho Gontijo C.
      • Porras-Hurtado L.G.
      • Freire-Aradas A.
      • Fondevila M.
      • Santos C.
      • Salas A.
      • Henao J.
      • Isaza C.
      • Beltrán L.
      • Nogueira Silbiger V.
      • et al.
      PIMA: A population informative multiplex for the Americas.
      ] and 1000 Genomes Phase III SNP data [
      • The 1000 Genomes Project Consortium A.
      • Auton L.D.
      • Brooks R.M.
      • Durbin E.P.
      • Garrison H.M.
      • Kang J.O.
      • Korbel J.L.
      • Marchini S.
      • McCarthy G.A.
      • McVean
      • et al.
      A global reference for human genetic variation.
      ], (HGDP-CEPH population descriptions, grouping and sample sizes as outlined in [
      • Amigo J.
      • Phillips C.
      • Lareu M.
      • Carracedo Á.
      The SNPforID browser: an online tool for query and display of frequency data from the SNPforID project.
      ]; and for 1000 Genomes populations in [
      • The 1000 Genomes Project Consortium A.
      • Auton L.D.
      • Brooks R.M.
      • Durbin E.P.
      • Garrison H.M.
      • Kang J.O.
      • Korbel J.L.
      • Marchini S.
      • McCarthy G.A.
      • McVean
      • et al.
      A global reference for human genetic variation.
      ] – also see Section 2.2.1). Such population sample sets are increasingly being enhanced with more detailed and comprehensive whole-genome-sequence based variant catalogs. We took advantage of a series of recently published studies that provide high quality variant calls from higher levels of sequence coverage of the human genome [
      • Bergström A.
      • McCarthy S.A.
      • Hui R.
      • Almarri M.A.
      • Ayub Q.
      • Danecek P.
      • Chen Y.
      • Felkel S.
      • Hallast P.
      • Kamm J.
      • et al.
      Insights into human genetic variation and population history from 929 diverse genomes.
      ,
      • Byrska-Bishop M.
      • Evani U.S.
      • Zhao X.
      • Basile A.O.
      • Abel H.J.
      • Regier A.A.
      • Corvelo A.
      • Clarke W.E.
      • Musunuri R.
      • Nagulapalli K.
      • et al.
      High coverage whole-genome-sequencing of the expanded 1000 Genomes Project cohort including 602 trios.
      ,
      • Almarri M.A.
      • Haber M.
      • Lootah R.A.
      • Hallast P.
      • Al Turki S.
      • Martin H.C.
      • Xue Y.
      • Tyler-Smith C.
      The genomic history of the Middle East.
      ] to compile the most up-to-date allele frequency estimates for each ET BGA SNP. At the same time, identical data was collected for the EVC SNPs of ET to explore whether additional SNPs can improve population differentiations beyond the three overlapping loci for appearance and ancestry analysis used in BT and ET (rs16891982 in SLC45A2, rs1426654 in SLC24A5, rs12913832 in HERC2). Lastly, we analysed genome-wide patterns of population variation in tri-allelic SNPs in the human genome from detailed scrutiny of a full dataset of these markers we had previously compiled [
      • Phillips C.
      • Amigo J.
      • Tillmar A.O.
      • Peck M.A.
      • de la Puente M.
      • Ruiz-Ramírez J.
      • Bittner F.
      • Idrizbegović Š.
      • Wang Y.
      • Parsons T.J.
      • et al.
      A compilation of tri-allelic SNPs from 1000 Genomes and use of the most polymorphic loci for a large-scale human identification panel.
      ]. The allele frequency data were then used to estimate and compile markers with the maximum In POP values and then to balance the panel composition by adjusting relative numbers of BGA SNPs for the continental comparisons as previously described [
      • de la Puente M.
      • Ruiz-Ramírez J.
      • Ambroa-Conde A.
      • Xavier C.
      • Pardo-Seco J.
      • Álvarez-Dios J.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Gross T.E.
      • Cheung E.Y.Y.
      • et al.
      Development and evaluation of the ancestry informative marker panel of the VISAGE basic tool.
      ,
      • Phillips C.
      • Parson W.
      • Lundsberg B.
      • Santos C.
      • Freire-Aradas A.
      • Torres M.
      • Eduardoff M.
      • Børsting C.
      • Johansen P.
      • Fondevila M.
      • et al.
      Building a forensic ancestry panel from the ground up: the EUROFORGEN Global AIM-SNP set.
      ], but ignoring In SAS and In ME calculations and marker balance.

      2.1.2 Y-SNPs

      A total of 85 Y-SNPs were selected to create a set intended to achieve an optimal balance between detecting all broadly defined global Y-haplogroups and providing additional resolution within certain haplogroups, in a way that could be informative for forensic ancestry analysis, while at the same time, occupying the minimum multiplex space in ET. Supplementary Fig. S1 illustrates three examples of carefully selected Y-SNPs amongst the 85 that all belong to haplogroup R1a, but exhibit geographic frequency distributions that are very different, namely: R1a-Z284 = Northwest Europe; R1a-Z282 = East Europe; R1a-Z93 = West/Central/South Asia. The Y-SNP selection process also made use of compilations of the most informative Y-SNPs identified from the more extensive 859 Y-SNP MPS assays designed to analyse 640 Y-haplogroups [
      • Ralf A.
      • van Oven M.
      • Montiel González D.
      • de Knijff P.
      • van der Beek K.
      • Wootton S.
      • Lagacé R.
      • Kayser M.
      Forensic Y-SNP analysis beyond SNaPshot: High-resolution Y-chromosomal haplogrouping from low quality and quantity DNA using Ion AmpliSeq and targeted massively parallel sequencing.
      ]. The genomic details and geographic distribution summaries of the 85 Y-SNPs incorporated in ET are detailed in Table 1.
      Table 1Genomic details and geographic distribution summaries of the 85 Y-SNPs incorporated in ET. NA: no information available.
      No.Marker nameSNP-IDPosition GRCh37Position GRCh38SubstitutionISOGG NomenclatureGeographic distributionNo.Marker nameSNP-IDPosition GRCh37Position GRCh38SubstitutionISOGG NomenclatureGeographic distribution
      1V148rs18133566667881916920150G->AA0Central Africa, West Africa43M522rs978671471731437305102G->AIJK
      2L1086NA28263122958271A->TA00Central Africa44M304rs134473522274985320587967A->CJW Asia, North Africa, Horn of Africa, S Europe, Central Asia, South Asia
      3V168rs1915051821794767215835792G->AA145M267rs93413132274181820579932T->GJ1Northern Africa, Horn of Africa, West Asia, South Asia
      4M31rs3693159482173975419577868G->CA1aWest Africa, North Africa46M172rs20326041496963412857709T->GJ2Southern Europe, West Asia
      5V50rs18920502868459366977895T->CA1b1aSouthern Africa, Central Africa47M9rs39002173025719568371C->GK
      6M32rs5582419242174043619578550T->CA1b1bEast Africa, Southern Africa48M526rs20330032355092421389038A->CK2
      7M13rs39042172209819560212G->CA1b1b2bCentral Africa, East Africa49M20rs39112173345419571568A->GLSouth Asia, West Asia
      8M42rs20326302186684019704954A->TBT50P326rs37268754384672908599249T->CLT [K1]
      9M181rs20325991485155412739620T->CBCentral Africa, Southern Africa, East Africa51P256P25686852318817190G->AM or K2b1bNear Oceania, Wallacea, Australia, Remote Oceania ???
      10M168rs20325951481399112702062C->TCT52M231rs93412781546972413357844G->ANNorthern Asia, Central Asia, Americas
      11M145rs38489822171720819555322C->TDE53M46rs344421261492258312810648T->CN1a1Siberia / East Asia
      12M174rs20326021495428012842354T->CDEast Asia54VL29rs7525123091457042412458624T->CN1a1a1a1a1aNE Europe, Eastern Europe, Central Asia
      13F6251NA76812757813234C->TD1aEast Asia, Central Asia55B479NA2627107524124928C->AN1a1a1a1a1c∼East Asia
      14M55rs20326212187273819710852T->CD1bJapan56Z1936rs7740081642146332619301440C->TN1a1a1a1a2NE Europe, Eastern Europe, Central Asia
      15L1378rs89392483828281402960099C->TD2SE Asia57F4205rs10282029611633143214219552A->GN1a1a1a1a3aMongolia
      16M96rs93068412177899819617112C->GEAfrica, West Asia, Southern Europe58B202NA28805463012505T->CN1a1a1a1a3bRussian Far East
      17M33rs3687627062174045019578564A->CE1aWest Africa59M2118rs5718767132325962421097738A->GN1a1a1a1a4Russian Far East
      18V38rs76898368182916950250C->TE1b1aSub Saharan Africa60F2930rs5283117461908060216968722G->AN1bEast Asia
      19M215rs20326541546782413355944A->GE1b1b61P186rs1698129075685687700527C->AOEast Asia, SE Asia, South Asia, Oceania
      20V32rs37125461469328217064780G->CE1b1b1a1a1bEast Africa62M119rs726130402176268519600799T->GO1aSE Asia, East Asia, Oceania
      21V13rs36803107468422636974222G->AE1b1b1a1b1aSouthern Europe63P31rs2008616591449524312383440T->CO1bSouth Asia, SE Asia
      22M81rs20326402189257219730686C->TE1b1b1b1aNorthern Africa64M176rs1157589726551802787139G->AO1b2East Asia
      23M123rs3711432482176458619602700C->TE1b1b1b2a1East Africa, West Asia65M122rs781490622176467419602788A->GO2East Asia, Oceania
      24M75rs20326392189017719728291G->AE2Sub Saharan Africa66JST-002611rs207518175467267678685G->AO2a1bEast Asia
      25P143rs41418861419786712077161G->ACF67P201rs226780128281962960155T->CO2a2Oceania, East Asia
      26M130rs3528497027348542866813C->TCCentral, North & SE Asia, N America, East Asia, Near Oceania, Australia, Remote Oceania68P295rs89553079630318094990T->GP or K2b2
      27M38rs3696119322174215819580272T->GC1b3aOceania / Indonesia69M242rs81790211501858212906671C->TQNorthern Asia, Central Asia, America
      28M347rs86836375828774793009438A->GC1b3bAustralia70M3rs38941909636316984483G->AQ1b1a1aAmerica
      29M217rs20326681543733313325453A->CC2South Asia, Southern East Asia, Northern East Asia71M207rs20326581558198313470103A->GREurope, West Asia, Central Asia, South Asia, North Africa, Central Africa
      30P39rs8874502451448458112363850G->AC2b1a1a1Northern America72M173rs20326241502642412914512A->CR1
      31M48rs3736812132174988119587995A->GC2b1a1bSiberia / Northern East Asia73M420rs172505352347320121311315T->AR1a
      32M89rs20326522191731319755427C->TF74Z282rs1125631271558840113476521T->CR1a1a1b1aEastern Europe, Balkan
      33M201rs20326361502752912915617G->TGWest Asia, South-West Asia, Europe, Central Asia75Z284rs76726579487171968849155C->GR1a1a1b1a3aNorthern Europe
      34M285rs134473782274174020579854G->CG1South-West Asia Central Asia76Z93rs56632360575523567684315G->AR1a1a1b2South Asia, Middle East, Central Asia
      35P287rs41168202207209719910211G->TG2West Asia, South-West Asia, Europe, Central Asia77M343rs978618428878243019783C->AR1bWestern Europe
      36L901rs5678485861784430415732424C->THSouth Asia, Eastern Europe, South-West Europe, Western Europe78U106rs1698129387960788928037C->TR1b1a1b1a1a1Western Europe
      37P96rs10270172841486974312757813C->AH2Eastern Europe, South-West Europe, Western Europe79P312rs342763002215731119995425C->AR1b1a1b1a1a2Western Europe
      38M170rs20325971484779212735858A->CIEurope, West Asia80L21rs117992261565442813542548C->GR1b1a1b1a1a2c1Western Europe
      39M253rs93412961502270712910796C->TI1North-Europe, West Europe81CTS1078rs56770321771861357318094G->CR1b1a1b1bCaucasus, Balkan, Middle East
      40M438rs173072941663880414526924A->GI2South Europe, Central Europe, East Europe82V88rs18094684448628614994820C->TR1b1bSub Saharan Africa
      41M436rs173156801874749316635613G->CI2a1bNorth-Europe, West Europe83M479rs3721576272083466718672781C->TR2South Asia
      42M429rs173066711403133411910628T->AIJ84B254rs3722953361410258011981874C->ASOceania, East Asia, Australia
      85M184rs203201489816312786229G->ATWest Asia, Horn of Africa, North Africa, Southern Europe, South Asia
      To interpret the Y-SNP data generated, a Y-haplogroup reference database was required, and creating such a database involved the challenging task of compiling disparate published Y-SNP population data. Although a lot of different geographic regions have been studied since Y-SNP genotyping became established, almost every published dataset has analysed different sets of Y-SNPs. Some studies have only focused on broad haplogroups, while others generated high-resolution Y-SNP data within a certain haplogroup. In order to make a reference database that was compatible with the Y-SNPs included in ET, the genotypes of each individual paper were inspected manually, the data that was compatible was included in the database, and incompatible data discarded. In some cases, the absence of certain haplogroups in a population sample could be inferred, for example, if 100 males were typed of which 70 belonged to haplogroup R1b and the remaining 30 to haplogroup I. By extension, the frequency of all Y-SNPs belonging to any other haplogroup was almost certain to be 0, even if those Y-SNPs had not been genotyped in the original study. Ninety Y-SNP studies, plus the data published by 1000 Genomes was used to create a Y-haplogroup database, these studies combined 84,269 genotyped males, of which 35,624 (42%) could be assigned to one of the haplogroups defined by the 85 ET Y-SNPs.
      The compiled Y-SNP population database then formed the basis for a mapping module within the VISAGE ET interpretative software. The module generated charts which visualised the frequency distributions of the inferred haplogroup in populations or regions covered by reference studies and compatible with the 85 Y-SNPs. The distribution maps of Supplementary Fig. S1 illustrate the efforts to make a clear distinction between zero observations and missing data for those regions lacking genotype observations.
      ET Y-SNP data was analysed in male samples in the VISAGE Study populations and compared to X-SNP data. Haplogroup assignments were made using the extensive population data compiled for the ET Y-SNP panel selection and used to generate the geographic distribution charts shown in Supplementary Fig. S1. We did not formally collect Y-SNP data from 1KG, CEPH or Sanger ME data as this was quite incomplete. Furthermore, we chose not to make the inference that all SNP data absent from each project’s VCF files meant the male samples all had the RefSeq reference allele by default.

      2.1.3 X-SNPs

      In a previous unpublished survey of X chromosome SNP data which was made to compare variation across the major continental population groups of the HGDP-CEPH diversity panel from 650,000 genotyped SNPs [
      • Li J.Z.
      • Absher D.M.
      • Tang H.
      • Southwick A.M.
      • Casto A.M.
      • Ramachandran S.
      • Cann H.M.
      • Barsh G.S.
      • Feldman M.
      • Cavalli-Sforza L.L.
      • Myers R.M.
      Worldwide human relationships inferred from genome-wide patterns of variation.
      ], we identified a small number of X-SNPs with highly stratified allele frequency distributions. Sets of between two to four SNPs were compiled that were informative for AFR, EUR, EAS, AMR or OCE population differentiations to create a compact X-SNP panel of 16 markers distributed across the full length of the X chromosome. Five of these 16 SNPs were regularly spaced around the centromere but located in a region with very low recombination (Rc rates graphically summarised in Fig. 5 of [
      • Phillips C.
      • Ballard D.
      • Gill P.
      • Court D.S.
      • Carracedo A.
      • Lareu M.V.
      The recombination landscape around forensic STRs: accurate measurement of genetic distances between syntenic STR pairs using HapMap high density SNP data.
      ]) and so were treated as a single haplotype block. The most recently published genomic data with genotypes for all the BGA SNPs of ET from high sequence coverage analysis of 1000 Genomes samples [
      • Byrska-Bishop M.
      • Evani U.S.
      • Zhao X.
      • Basile A.O.
      • Abel H.J.
      • Regier A.A.
      • Corvelo A.
      • Clarke W.E.
      • Musunuri R.
      • Nagulapalli K.
      • et al.
      High coverage whole-genome-sequencing of the expanded 1000 Genomes Project cohort including 602 trios.
      ] has phased the SNP genotypes in all chromosomes, so X-SNP genotypes from females were collected as haplotypes for the centromeric 5-SNP haplotype block, and from males as single chromosome haplotype data (thus, phased by default). All other X-SNP data was compiled with the same approach used for autosomal variants, but accounting for hemizygosity in males when estimating allele frequencies.
      In an operational setting, a forensic ancestry test using ET that analysed co-ancestry patterns would compare Y-SNP data and single chromosome X-SNP genotypes in male samples alone, so phasing into haplotype combinations would not be necessary. We collected the phased data from 1000 Genomes female samples in addition to male genotypes in order to provide the most complete analysis of population variation across the major population groups represented in 1000 genomes and added X-SNP genotype data for AMR and OCE from whole-genome-sequence analyses of the HGDP-CEPH diversity panel samples. An important parallel study was to assess the viability of X-SNP analysis in the admixed African and admixed American population samples of 1000 Genomes (labelled by this project as ACB, ASW African and MXL, CLM, PUR, PEL American [
      • The 1000 Genomes Project Consortium A.
      • Auton L.D.
      • Brooks R.M.
      • Durbin E.P.
      • Garrison H.M.
      • Kang J.O.
      • Korbel J.L.
      • Marchini S.
      • McCarthy G.A.
      • McVean
      • et al.
      A global reference for human genetic variation.
      ]) - where X chromosomes of varied ancestral lineages are going to be present in a large proportion of these individuals and a degree of recombination may have disrupted the population stratification shown by the selected X-SNPs in the AFR, EUR and AMR admixture contributor populations.

      2.1.4 Microhaplotypes

      We chose Microhaplotypes for incorporation into ET from two sets we had previously designed for MPS sequence analysis [
      • de la Puente M.
      • Ruiz-Ramírez M.J.
      • Ambroa-Conde A.
      • Xavier C.
      • Amigo J.
      • Casares de Cal M.A.
      • Gómez-Tato A.
      • Carracedo A.
      • Parson W.
      • Phillips C.
      • Lareu M.V.
      Broadening the applicability of a custom multi-platform panel of Microhaplotypes: Bio-geographical ancestry inference and expanded reference data.
      ,
      • Phillips C.
      • McNevin D.
      • Kidd K.K.
      • Lagacé R.
      • Wootton S.
      • de la Puente M.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Eduardoff M.
      • Gross T.E.
      • et al.
      MAPlex-A massively parallel sequencing ancestry analysis multiplex for Asia-Pacific populations.
      ] that had been selected and characterised for their ancestry informativeness properties. Carefully selected ancestry informative MH loci will have multiple haplotypes with contrasting population frequencies [
      • Cheung E.Y.Y.
      • Phillips C.
      • Eduardoff M.
      • Lareu M.V.
      • McNevin D.
      Performance of ancestry-informative SNP and microhaplotype markers.
      ], and potentially allow simple mixed DNA deconvolution with the possibility to assign ancestry to components in simple 2-way mixtures, particularly if they are present in unequal ratios [
      • Kidd K.K.
      • Speed W.C.
      • Pakstis A.J.
      • Podini D.S.
      • Lagacé R.
      • Chang J.
      • Wootton S.
      • Haigh E.
      • Soundararajan U.
      Evaluating 130 microhaplotypes across a global set of 83 populations.
      ]. From 22 MHs originally chosen, 21 were successfully incorporated into the ET assay, comprising 8 from the MAPlex BGA panel [
      • Phillips C.
      • McNevin D.
      • Kidd K.K.
      • Lagacé R.
      • Wootton S.
      • de la Puente M.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Eduardoff M.
      • Gross T.E.
      • et al.
      MAPlex-A massively parallel sequencing ancestry analysis multiplex for Asia-Pacific populations.
      ], and 13 from a panel of 113 MHs designed for forensic identification but including several with ancestry informative haplotype distributions [
      • de la Puente M.
      • Ruiz-Ramírez M.J.
      • Ambroa-Conde A.
      • Xavier C.
      • Amigo J.
      • Casares de Cal M.A.
      • Gómez-Tato A.
      • Carracedo A.
      • Parson W.
      • Phillips C.
      • Lareu M.V.
      Broadening the applicability of a custom multi-platform panel of Microhaplotypes: Bio-geographical ancestry inference and expanded reference data.
      ]. Six of the eight MAPlex MH loci were shortened from the original much longer loci containing more SNPs [
      • Kidd K.K.
      • Speed W.C.
      • Pakstis A.J.
      • Podini D.S.
      • Lagacé R.
      • Chang J.
      • Wootton S.
      • Haigh E.
      • Soundararajan U.
      Evaluating 130 microhaplotypes across a global set of 83 populations.
      ] to ensure forensic sensitivity analysing degraded DNA, by amplifying size-reduced sequences of comparable length to single-site SNP targets. Details of the SNP sets of the 21 MHs selected for ET and size reductions when made, are outlined in Table 2.
      Table 2Genomic details of the 21 Microhaplotypes incorporated in ET. Original MH nomenclature lists in bold the six loci reduced in size in ET designs to enhance their forensic sensitivity.
      Principal SNPs in the haplotypeExtra SNPs in MPS outputInternal MH nameOriginal MH nomenclaturePrincipal component SNPsExtra SNPs in Ion S5 MPS sequence output5′ coordinate: GRCh373′ coordinate: GRCh375′ coordinate: GRCh383′ coordinate: GRCh38MH span in nucleotidesOriginal MH span
      411pArs28503881-rs4648788-rs72634811-rs28689700rs532405039152995015299981594570159461848
      32MH01mh01KK-01rs6663840-rs58111155-rs6688969rs199565833 / rs548721351374331937433913826755382682772259
      3-1pDrs6702428-rs12031966-rs6687440-10677007610677011010622745410622748834
      3-MH03mh02KK-134rs12469721-rs3101043-rs3111398-16107941116107945016022290016022293939103
      32MH04mh02KK-136rs6714835-rs6756898-rs12617010rs530973697 / rs5460113132280923892280924592272276732272277437070
      513pBrs11129981-rs11129982-rs75361533-rs11129983-rs1896565rs5284746144292462542924691428831334288319966
      553qCrs6583335-rs9848767-rs843520-rs9833841-rs965140rs559681042 / rs552643442 / rs550318827 / rs60667153 / rs18343436719637989719637999319665302619665312296
      414qDrs34521178-rs4533811-rs4450974-rs61132367rs53123941918279588918279593918187473618187478650
      437pBrs6951954-rs6969555-rs2158900-rs73080042rs139000977 / rs185814343 / rs5524289082544758925447640254079702540802151
      418pArs10097211-rs80063668-rs73660014-rs7007616rs538206051330643033064583448908344893628
      568pBrs34821009-rs7822905-rs7836134-rs7822909-rs6474278rs577517386 / rs539800640 / rs113457629 / rs188201066 / rs113010596 / rs5655379694066419440664243408066754080672449
      519pArs1408329-rs11789647-rs12555748-rs1535838-rs1408330rs567753466228864722887182288647228871871
      3410pBrs11816330-rs10828819-rs4749046rs570240814 / rs536076967 / rs555668598 / rs5721233812583939425839446255504652555051752
      32MH11mh11KK-180rs4752778-rs74047734-rs7112918-rs4752777rs140892495 / rs555496836169095016909841669720166975434193
      4112qBrs11177060-rs2111058-rs10878750-rs11835920rs5718898266850827668508353681144966811457377
      4-15qDrs1816771-rs74033914-rs5007156-rs4965040-9825592898255978977126989771274850
      42MH18mh16KK-255rs16956011-rs3934954-rs3934955-rs3934956-rs576469239 / rs1840921088197035381970407819367488193680254142
      41MH20mh18KK-293rs621320-rs621340-rs678179-rs621766rs80093367760898867608996878329886783299688282
      32MH21mh21KK-315rs6517970-rs202132081-rs8131148-rs6517971rs533846035 / rs5380724352188015821880231205078462050791973145
      32MH22mh21KK-324rs2838868-rs7279250-rs8133697rs537553521 / rs5675331474671464146714707452947264529479266158
      5222qBrs4925431-rs4925399-rs4925432-rs4925400-rs77899570rs192804904 / rs5378237154906097649061028486651644866521652

      2.2 Reference and test population data

      2.2.1 Public population data from human genome sequencing projects

      A comprehensive population dataset for ET BGA markers was generated by compiling publicly available online whole-genome-sequencing variant data for 3570 samples, published by three major human genome projects [
      • Bergström A.
      • McCarthy S.A.
      • Hui R.
      • Almarri M.A.
      • Ayub Q.
      • Danecek P.
      • Chen Y.
      • Felkel S.
      • Hallast P.
      • Kamm J.
      • et al.
      Insights into human genetic variation and population history from 929 diverse genomes.
      ,
      • Byrska-Bishop M.
      • Evani U.S.
      • Zhao X.
      • Basile A.O.
      • Abel H.J.
      • Regier A.A.
      • Corvelo A.
      • Clarke W.E.
      • Musunuri R.
      • Nagulapalli K.
      • et al.
      High coverage whole-genome-sequencing of the expanded 1000 Genomes Project cohort including 602 trios.
      ,
      • Almarri M.A.
      • Haber M.
      • Lootah R.A.
      • Hallast P.
      • Al Turki S.
      • Martin H.C.
      • Xue Y.
      • Tyler-Smith C.
      The genomic history of the Middle East.
      ]. This population data comprised 2504 1000 Genomes project samples (herein 1KG) now consisting of a revised, higher quality variant dataset based on an average 30x sequence coverage [
      • Byrska-Bishop M.
      • Evani U.S.
      • Zhao X.
      • Basile A.O.
      • Abel H.J.
      • Regier A.A.
      • Corvelo A.
      • Clarke W.E.
      • Musunuri R.
      • Nagulapalli K.
      • et al.
      High coverage whole-genome-sequencing of the expanded 1000 Genomes Project cohort including 602 trios.
      ]; 929 HGDP-CEPH human diversity panel samples (CEPH [
      • Bergström A.
      • McCarthy S.A.
      • Hui R.
      • Almarri M.A.
      • Ayub Q.
      • Danecek P.
      • Chen Y.
      • Felkel S.
      • Hallast P.
      • Kamm J.
      • et al.
      Insights into human genetic variation and population history from 929 diverse genomes.
      ]), and 137 Middle East samples from the analysis of 8 populations by Almarri et al. in 2021 [
      • Almarri M.A.
      • Haber M.
      • Lootah R.A.
      • Hallast P.
      • Al Turki S.
      • Martin H.C.
      • Xue Y.
      • Tyler-Smith C.
      The genomic history of the Middle East.
      ], which we refer to collectively as the ‘Sanger ME’ dataset. We also added 130 samples from the Simons Foundation human genome diversity panel (SGDP [
      • Mallick S.
      • Li H.
      • Lipson M.
      • Mathieson I.
      • Gymrek M.
      • Racimo F.
      • Zhao M.
      • Chennagiri N.
      • Nordenfelt S.
      • Tandon A.
      • et al.
      The simons genome diversity project: 300 genomes from 142 diverse populations.
      ]) excluding samples that overlap with those of 1KG or CEPH, and 402 samples from the Estonian Biocentre human genome diversity panel (EGDP [
      • Pagani L.
      • Lawson D.J.
      • Jagoda E.
      • Mörseburg A.
      • Eriksson A.
      • Mitt M.
      • Clemente F.
      • Hudjashov G.
      • DeGiorgio M.
      • Saag L.
      • et al.
      Genomic analyses inform on migration events during the peopling of Eurasia.
      ]). Some genotype gaps exist in certain sample panels, notably all the tri-allelic SNP genotypes are missing from EGDP and there is a wide-scale absence of many MH component SNPs from EGDP data. The core ET BGA SNP dataset centred on 1KG, CEPH and Sanger ME SNP genotypes and haplotypes, and we used this data to create a standardised population reference set and to perform most of the evaluations of the ET BGA SNPs’ population differentiation capabilities. SGDP and EGDP data is included as testing sample sets for users to make their own explorations.

      2.2.2 VISAGE in-house study populations

      A range of VISAGE participant laboratory in-house population sample sets (herein Study populations) were genotyped with the ET MPS assay. These sets were chosen to cover geographic gaps in under-represented regions, particularly the Middle East, comprising: 32 individuals from Morocco; 30 from Eritrea; 16 from Somalia; 30 from Central Iraq; 29 from the Kurdistan region of Iraq; 29 Turkish-origin individuals resident in Germany; 41 from Fiji; 19 from rural Brazil (Kalunga individuals, Goiás State), and 16 from urban Brazil (residents of the City of Brasília).
      Informed consent was obtained from all Study population donors, which comprised samples previously obtained from: i. Moroccans resident in Madrid collected in 2008 by the Comisaría General de Policía Cientifíca, Madrid, with written informed consent obtained from donors regarding the use of anonymised samples for the characterisation of population variation; ii. Eritrean, Somali, Central Iraqi, Kurdish Iraqi, and Turkish resident in Germany (co-authors P.M.S., T.E.G.) collected according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of the Faculty of Medicine, University of Cologne, Germany, reference no. 17–416 (dated 16.5.2018); iii. Fijian island samples obtained by Fiji Police Forensic Biology and DNA Laboratory, (co-author J.U.), with written informed consent obtained from donors regarding the use of anonymised samples for the characterisation of population variation; iv. Brazilian samples obtained in Brazil (co-authors S.O., M.K.-G., C.C.-G.) with ethical approval from Universidade de Brasília reference No. CAAE: 16542613.8.0000.0030 (rural) and CAAE: 72917916.3.0000.0030 (urban).

      2.2.3 Compilation of standardised reference population datasets

      A standardised seven-population group reference dataset was constructed to enable end-users to make population analyses independently of the VISAGE ET interpretative software. The reference dataset consisted of: Africans represented by 108 1KG Yoruba from Nigeria (YRI); Europeans by 99 1KG NW Europeans from Utah (CEU); East Asians by 103 1KG Han Chinese from Beijing (CHB); South Asians by 103 1KG Gujarati from Houston (GIH); Middle East by 161 HGDP-CEPH Israeli Arabs from Palestinian, Druze and Bedouin populations plus Algerian Mozabite - the latter sample divided into an eighth reference population representing North Africa in STRUCTURE analyses of Eurasians; Oceanians by 28 HGDP-CEPH Papuans from Bougainvillea and Papua New Guinea; Native Americans by 79 samples, comprising 61 HGDP-CEPH samples from Maya, Pima, Colombian and Amazonian Surui and Karitiana populations, supplemented by 18 1KG Peruvians from Lima, Peru (PEL) which we had previously analysed to indicate no detectable non-American co-ancestry (from analysis of 572,743 Affymetrix Human Origins SNPs, see Table 10.5 of [
      • Phillips C.
      • Amigo J.
      • McNevin D.
      • de la Puente M.
      • Cheung E.Y.Y.
      • Lareu M.V.
      Online population data resources for forensic SNP analysis with Massively Parallel Sequencing: An overview of online population data for forensic purposes.
      ]).
      The 1KG admixed populations, comprising 96 African Caribbean individuals in Barbados (ACB), 61 Americans of African Ancestry in SW USA (ASW), 64 individuals with Mexican Ancestry from Los Angeles USA (MXL,) 94 Colombians from Medellin, Colombia (CLM), 104 Puerto Ricans from Puerto Rico (PUR), and 67 of 85 PEL (i.e., with detected co-ancestry), were used as the testing sample set for evaluating the admixture analysis capabilities of ET by comparison with co-ancestry estimates provided by the 1000 Genomes project (personal communication, Adam Auton, Albert Einstein College of Medicine, NYC, USA).

      2.3 Evaluation of ancestry and co-ancestry analysis using ET BGA SNPs

      The efficiency of the ET autosomal BGA SNPs to infer an individual’s population of origin was assessed for a seven-group differentiation of African, European, East Asian, South Asian, American, Oceanian and Middle East ancestries. For BGA prediction within the ET integrated interpretation framework, VISAGE has implemented dedicated software using a strictly Bayesian approach that applies a flat prior probability model and multiple logistic regression to assign a forensic sample to one of the above seven possible ancestry classes. In the reported study we did not apply the alternative likelihood ratio analyses that form the core of the Snipper web portal [

      Available online: http://mathgene.usc.es/Snipper/ Multiple profiles classifier at: 〈http://mathgene.usc.es/snipper/analysismultipleprofiles.html〉 (both accessed 1st February 2023).

      ], but instead relied on STRUCTURE analysis [
      • Pritchard J.K.
      • Stephens M.
      • Donnelly P.
      Inference of population structure using multilocus genotype data.
      ] to assess the ability of the ET ancestry markers to discern complex ancestry patterns in population samples from the Middle East, as well as populations representing regions where admixture to varying degrees is the predominant demographic pattern observed.
      We developed a two-stage nested STRUCTURE analysis approach which analysed the test population sets (POPFLAG=0) with the reference population dataset (POPFLAG=1), which consisted of five continental populations of AFR, EUR, EAS, AMR and OCE at K:5, with a second Eurasian-focussed STRUCTURE analysis using a reference population dataset at K:6 consisting of AFR, EUR, SAS, EAS, ME and a sixth North African (NAF) population. The division of Middle East and North African ancestries followed the observation of consistent separation of the NAF Algerian Mozabite reference population from the ME HGDP-CEPH Israeli Arab reference populations at K:6. This approach was adopted after originally evaluating a dual K:5 run orientated towards west Eurasia (AFR, EUR, NAF, ME, SAS) and east Eurasia (EUR, NAF, ME, SAS, EAS), although using these two slightly different K:5 runs did not show any advantage over a K:6 Eurasian analysis. STRUCTURE was run with 100,000 burnin steps and 100,000 MCMC steps, using correlated allele frequencies under the Admixture model. Cluster membership proportion plots were constructed with CLUMPAK v.1.1 [
      • Kopelman N.M.
      • Mayzel J.
      • Jakobsson M.
      • Rosenberg N.A.
      • Mayrose I.
      Clumpak: a program for identifying clustering modes and packaging population structure inferences across K.
      ]. Optimum ‘K'genetic cluster values were inferred by calculating mean ΔK and L(K) values using standard protocols [
      • Evanno G.
      • Regnaut S.
      • Goudet J.
      Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.
      ,
      • Santos C.
      • Phillips C.
      • Gomez-Tato A.
      • Alvarez-Dios J.
      • Carracedo A.
      • Lareu M.V.
      Inference of ancestry in forensic analysis II: analysis of genetic data.
      ].
      The ability of ET X-SNPs to infer the ancestry of an individual’s X-chromosome complement was evaluated using Principal Component Analysis (PCA), by uploading reference data from 1000 Genomes for simple three-way comparisons based on AFR-EUR-AMR, using the ‘Classification of multiple profiles with a custom Excel file of populations’ option in the Snipper web portal [

      Available online: http://mathgene.usc.es/Snipper/ Multiple profiles classifier at: 〈http://mathgene.usc.es/snipper/analysismultipleprofiles.html〉 (both accessed 1st February 2023).

      ]. The multiple profiles classifier provides a Bayes likelihood ratio and PCA analysis, which is now based on the three 2D plots of principal component (PC) 1 vs PC2, PC1 vs PC3, and PC2 vs PC3. X-chromosome ancestries were assigned based on the position of an admixed study sample in relation to these three reference population PCA clusters, and unassigned if this lay in the region of minor overlap between clusters, Positions were judged to be equidistant from two adjacent cluster centroids by visual inspection of PCA chart data. Such points occupying intermediate positions where cluster overlap can occur, were interpreted to indicate recombination of contributor population X-SNP alleles, and alternatively the presence of two X-chromosomes with different ancestries in females. Initial explorations of five-way analyses with PCA of the 16 X-SNPs used the above three populations plus CEPH Oceanians, and 1KG East Asians (CHB), and results were compared using a single Bayes likelihood ratio test in Snipper.
      The 5-SNP centromeric haplotype was compiled in parallel to the full set of 16 SNPs to assess the effect of limited recombination rates on preserving haplotype structures in admixed individuals (with a focus on 1KG ACB and ASW, but also reviewing patterns in 1KG CLM, MXL, PUR, PEL). However, to generate PCAs we used the ‘Naive Bayes (Hardy-Weinberg principle need not apply)’ option in Snipper which adjusts the likelihood calculations to account for non-independence of variables, typically seen with syntenic marker sets.

      2.4 Microhaplotype reconstruction from ET data and pilot experiments to evaluate ancestry-based deconvolution of simple mixed DNA

      The haplotypes of each MH locus, identified as combinations of composite SNP alleles on the same sequence strand, need to be reconstructed from sequence data obtained from the ET MPS run. For this reason, we previously developed a custom MH calling pipeline for MPS sequence data from the Ion S5 platform [
      • de la Puente M.
      • Phillips C.
      • Xavier C.
      • Amigo J.
      • Carracedo A.
      • Parson W.
      • Lareu M.V.
      Building a custom large-scale panel of novel microhaplotypes for forensic identification using MiSeq and Ion S5 massively parallel sequencing systems.
      ]. In brief: i. a synthetic partial reference genome is constructed from 100 kb sequence segments extracted from the GRCh37/hg19 genome assembly that contain each MH amplicon; ii. raw reads in FASTQ format are aligned to the partial reference genome using the Burrows-Wheeler aligner (BWA) [
      • Li H.
      • Durbin R.
      Fast and accurate short read alignment with Burrows-Wheeler transform.
      ]; iii. alignments are then processed with SAMtools [
      • Li H.
      • Handsaker B.
      • Wysoker A.
      • Fennell T.
      • Ruan J.
      • Homer N.
      • Marth G.
      • Abecasis G.
      • Durbin R.
      The sequence Alignment/Map format and SAMtools.
      ] to create the required input files for running the microhaplot R package [

      N. Thomas, R Package - Microhaplot, (2019) 〈https://github.com/ngthomas/microhaplot〉. (Accessed 1st February 2023).

      ], comprising a VCF file of composite SNPs of each MH and alignments in SAM format, sorted and filtering out short reads (<100 bp) and low-quality alignments (mapping quality < 30); iv. microhaplot output provides a raw table of allele strings and depth per MH, that are then filtered by minimum coverage per allele (min_cov) - set at 15 reads, and minimum allele read frequency (min_allele_frequency) - set at 0.02 for mixtures (0.1 for single-donor samples). All scripts and guidelines for processing raw MPS reads to obtain phased MH alleles are available at Github (https://github.com/MariadelaPuente/VISAGE_ET_Microhaplotyper).
      As a pilot study of the viability of using MH data to infer the ancestry of components in simple 2-way mixed DNA, we combined two Coriell control DNA samples NA07000 and NA18498 at 1:1, 1:3 and 1:9 ratios. Coriell samples NA07000 and NA18498 have EUR and AFR ancestries, respectively; and the mixed DNA was run with the ET MPS genotyping assay using an optimised MPS protocol, with the sequence output processed as outlined above to reconstruct the haplotypes of the 21 MH loci. For the ancestry-based deconvolution of the multiple sequences observed in the mixtures, three analysts were asked to independently assign haplotypes based on the proportion of sequence reads recorded for each allele and with prior knowledge of the mixture proportions in each sample. A consensus 21 MH set of profiles was generated from the mixtures, adopting a conservative approach when there were discrepancies amongst analysts, i.e., the profile with less alleles assigned was used. Finally, each 21 MH profile was analysed in STRUCTURE, alongside a standard reference set of 1KG phased haplotypes for the samples from YRI, CEU, CHB, i.e., forming a simplified three-way AFR-EUR-EAS ancestry inference test that exploits the limits of differentiation for these continental population groups when analysing MH loci alone.

      3. Results

      3.1 Characteristics of autosomal BGA SNPs selected for ET

      The genomic characteristics of the 104 autosomal BGA SNPs selected for ET are detailed in Table 3, with markers divided into the population differentiation they provide. The full genotype grids compiled from online datasets and in-house genotyping of population samples with ET are provided in Supplementary Table S1A. Allele frequency estimates for these SNPs are listed in Supplementary Table S1B, with summary frequencies for the 1KG and CEPH population groups, but individually for the Sanger ME and VISAGE study populations, as the gnomAD v.3.1.2 variant database [
      • Phillips C.
      • Amigo J.
      • Carracedo A.
      • Lareu M.V.
      Tetra-allelic SNPs: Informative forensic markers compiled from public whole-genome sequence data.
      ] lists individual population allele frequencies for nearly all SNPs. One SNP that was originally thought to be tri-allelic, rs6504633, was in fact tetra-allelic - i.e., showing four common nucleotide substitution alleles in some populations [
      • Lek M.
      • Karczewski K.J.
      • Minikel E.V.
      • Samocha K.E.
      • Banks E.
      • Fennell T.
      • O’Donnell-Luria A.H.
      • Ware J.S.
      • Hill J.A.J.
      • Cummings B.B.
      • et al.
      Analysis of protein-coding genetic variation in 60,706 humans.
      ].
      Table 3Genomic details of autosomal ancestry informative SNPs incorporated in ET. SNP rs3857620 was the only redundant marker in terms of uninformative allele frequencies. SNPs are listed in each population set in descending order of differentiation power. BT: originally in the VISAGE BT ancestry panel; EVC: shared with the EVC informative SNP set; Chr: chromosome.
      NoSNPSourceChrGrCh37 coordinateGrCh38 coordinateNoSNPSourceChrGrCh37 coordinateGrCh38 coordinate
      African1rs2814778BT1159174683159204893American1rs12498138BT78353304783903731
      2rs1871534Novel81456396811444142972rs12594144BT206215771863526365
      3rs2789823BT91367698881339047663rs7151991Novel3121459589121740742
      4rs1197062BT1758641118605637574rs17130385BT143263557232166366
      5rs9479657Novel61539283961536072615rs3737576BT10115196019113436260
      European1rs16891982EVC BT533951693339515886rs6088466Novel1101709563101244007
      2rs1426654EVC BT1548426484481342877rs9847307Novel203291353434325728
      3rs12913832EVC BT1528365618281204728rs11960137Novel36452571364540037
      4rs12142199BT1124918713138079rs2024566Novel5155338081155911071
      5rs8072587BT171921107319307760South Asian1rs182857716Novel224169733841301334
      6rs10962599BT916795286167952882rs367953206Novel164822177148187860
      7rs9522149BT131118271671111748203rs3857620Novel65749607657629240
      8rs2196051BT81221243021211120624rs1757928BT4130022161129101006
      9rs1924381BT1372321856717477245rs2472304BT157504423874751897
      10rs2715883BT111201334941202627856rs12405776Novel1242431557242268255
      11rs1592672Novel1280128593797348137rs2026999BT9103140157100377875
      East Asian1rs3827760BT21095136011088971458rs3844336BT86221476661302207
      2rs1545397Novel1528187772279426269rs1796048BT29764357696977839
      3rs1229984BT41002393199931816210rs1567803Novel2101343018100726556
      4rs6437783Novel310817281710845397011rs6754311Novel2136707982135950412
      5rs1371048BT15641613516386915212rs13280988BT8112370516111358287
      6rs881929Novel214575316614499559913rs17625895BT162577510225763781
      7rs4657449BT16310793713106805014rs10764919BT10131663651129865387
      Oceanian1rs4471745Novel116546528116549604415rs1040934BT107806626076306502
      2rs3751050BT175356888455491523Middle East1rs1024124Novel153361706433324863
      3rs10954737Novel11909124490696972rs12880237Novel146862181868155101
      Tri-allelic SNPs1rs1074689Novel1652216074521821623rs1317026Novel6161154955160733923
      2rs1150911Novel12284943822283066814rs1495085BT81529851515441006
      3rs12629397Novel365814779658291045rs166054Novel161128520211191345
      4rs1382568Novel811351220114937116rs17086288Novel6124210612123889467
      5rs1398461BT1383839778832656437rs2156208Novel186013130662464073
      6rs17287498Novel1054530788527710288rs234623Novel205748896458913909
      7rs2375771Novel41873719301864507769rs262037Novel5177990886178563885
      8rs2387842Novel12387364423834264010rs2835133Novel213713345735761159
      9rs2585339BT14491349784866577511rs310362Novel85992561859013059
      10rs2605361BT12749035317450975112rs3852253Novel71886619018826567
      11rs2737126BT173618815371552113rs3862700Novel186786222470194988
      12rs392461Novel5817202718242445214rs4308478BT5136334314136998625
      13rs393953Novel21433890364196892715rs4465645Novel175083284352755483
      14rs408046Novel15800315107973916816rs4737753BT85470181153789251
      15rs4540055BT4388032553880163417rs487750Novel9138603740135711894
      16rs5030240Novel11324243893240284318rs6496996Novel159340249692859266
      17rs556365Novel16659278026589389919rs6701640Novel1170696474170727333
      18rs6588145Novel1658597846539410120rs6894681Novel5127218995127883303
      19rs6933094BT615029760314997646721rs7252391Novel194414277143638619
      20rs7171818Novel15588551695856297022rs7594173Novel23290033032675263
      21rs776912BT1108477841078772723rs7816786Novel8101349662100337434
      22rs7989291Novel13575729895699885524rs7975017Novel122642879326275860
      23rs809540Novel27879001773887025rs848461Novel77758226577952948
      24rs914468Novel20621004636346911026rs9467370Novel62496868224968454
      25rs9845503Novel3597009775971525127rs9817359Novel37647316376424012
      26rs6504633
      Tetra-allelic SNP
      Novel17481129275003556328rs9899480Novel173618566537826045
      a Tetra-allelic SNP
      We also list in Supplementary Table S1A the 184 EVC-SNP marker details and genotypes. Certain of the ET EVC SNPs showed potentially informative allele frequency distributions across the global population groups of this study (individual EVC-SNP’s informativeness for the relevant populations are marked in Supplementary Table S1A) and we wished to explore whether they can improve ancestry inferences when combined with the dedicated BGA SNPs of ET (see Section 3.5.3).
      Cumulative population-specific Divergence values were calculated (individual SNP data not shown) for the five main continental population groups and were quite comparable for In EAS= 6.789; In OCE= 6.857; In AME= 7.6324, indicating that despite quite different numbers of BGA SNPs targeting these populations, they were well balanced. However, In AFR= 13.154 and In EUR= 11.351 are much higher Divergence values and reflect a bias towards selecting BGA SNPs that could differentiate Middle East and South Asian populations from those of Europe most efficiently, while African-informative allele frequency distributions are seen in almost all BGA SNPs selected, particularly Middle East-informative markers, underlining the reason why only five SNPs specifically targeting African population differentiation were selected for the ET ancestry SNP set.
      Only one selected SNP that was successfully incorporated into the ET MPS assay failed to produce the expected genotypes in the populations studied and this was highlighted when we reviewed the more detailed genomic datasets generated from high coverage sequencing of 1000 Genomes samples published by the New York Genome Centre [
      • Byrska-Bishop M.
      • Evani U.S.
      • Zhao X.
      • Basile A.O.
      • Abel H.J.
      • Regier A.A.
      • Corvelo A.
      • Clarke W.E.
      • Musunuri R.
      • Nagulapalli K.
      • et al.
      High coverage whole-genome-sequencing of the expanded 1000 Genomes Project cohort including 602 trios.
      ]. The uninformative BGA SNP was rs3857620, with an average rs3857620-A allele frequency in 1000 Genomes SAS populations of ∼46%, in contrast to 0% in EUR and 2% in EAS - suggesting a highly informative marker. However, the data from the high coverage sequencing variant calls indicates this SNP has almost no variation in the rs3857620-A allele with rs3857620-G present in all populations at 99–100%. The rs3857620 frequencies in 1KG and CEPH populations groups are summarised in Supplementary Fig. S2. Overall, this uninformative marker illustrates the importance of cross-checks between online SNP databases, as the main 1KG Phase III data portal of Ensemble [] continues to show inaccurate allele frequency data for this SNP at the time of writing, while gnomAD correctly compiles all the HGDP-CEPH and 1KG high sequence coverage rs385762 allele frequencies that match what we observed.

      3.2 Characteristics of X-SNPs selected for ET

      The full set of X-SNP genotypes for all online genome datasets and study populations, apart from those of EGDP, are listed in Supplementary Table S1C. This data is arranged in an identical grid to the autosomal SNPs in Supplementary Table S1A for sample order and population grouping. Allele frequency estimates for the 16 ancestry-informative X-SNPs are provided in Supplementary Table S1D.
      The chromosomal distribution and six-population summary allele frequencies of the 16 X-SNPs selected for inclusion in ET are shown in Fig. 2. A distinction is made between the five p-arm SNPs plus six q-arm SNPs compiled, and the five SNPs forming a tightly linked set of markers and consequent 5-allele haplotypes on the 3′ side of the X centromere. The Kosambi-adjusted recombination fractions (Rc values) are shown for the X-centromeric haplotype block (herein 5-SXC) indicating minimal recombination in this region, with close to zero recombination likely between the rs6655556-rs5937317 SNP pair at the 3′ end of the block.
      Fig. 2
      Fig. 2X chromosome ideogram showing the positions of 16 ancestry-informative X-SNPs selected for ET. The grey bar on the 3′ side of the centromere defines the position of the 5-SNP X centromeric haplotype block (5-SXC) with estimated recombination fraction (Rc) values between the five component X-SNPs shown above each marker pair. Pie charts summarise the allele frequency distributions in six population groups, based on the detailed genotype data in . The population(s) most differentiated by each X-SNP are marked with red triangles.
      The autosomal BGA SNPs of ET efficiently differentiate each of the seven population groups targeted by the SNP selection made for the panel (see Section 3.5), and this extends to the analysis of co-ancestry patterns detected in individuals with admixed backgrounds. However, the panel of 16 X-SNPs was selected to differentiate only the five main continental groups, and we found no indications of strong divergence between South Asian or Middle East populations and the other continental groups. Since there is no need to improve differentiation of unadmixed individuals by combining autosomal and X-SNPs, we advocate analysing X-SNP data separately (along with Y-SNP data in males), when an admixed background is inferred from patterns detected in autosomal BGA SNPs (see Section 3.5.2), so analyses will benefit from inference of contributor ancestries (which in male samples comprise matrilineal and patrilineal components). Given the limits of X-SNP differentiations, any analyses indicating South Asian or Middle East co-ancestry would not gain extra information on the likely admixture contributors from X-SNP genotypes.

      3.2.1 16 X-SNPs

      We adopted an approach for analysing the ancestry of an individual’s X-chromosome complement by treating the complete 16 X-SNP genotypes and the 5-SXC haplotypes as two separate datasets. The full set of 16 X-SNPs can be treated as a linked group of syntenic markers analysed with a likelihood ratio test in Snipper that is adjusted for association of the alleles tested. If such a test provides a high likelihood value (above 100 times more likely a given population) and occupies a position in a PCA plot within a reference population cluster of points, the ancestry inference for the X-chromosome(s) of the individual can be considered to be reasonably secure. Analysis of all seven population groups targeted by ET indicated SAS and ME populations are not well differentiated using the 16 X-SNPs alone (data not shown), and PCA patterns lack clearly defined clusters for these two population groups that would aid easy interpretation. Therefore, we evaluated a five-population ancestry analysis of X-SNPs by testing a random selection of 20 SGDP samples, comprising four from each population group. Supplementary Fig. S3 shows the Bayes likelihood ratio values and PCA plots for this test and indicates all samples were well classified, although American and Oceanian PCA positions require scrutiny of the PC1 vs PC3 and PC2 vs PC3 plot patterns, and the ’Korean-1' likelihood, though correctly assigned as EAS, was below the ‘100 times more likely’ threshold value.
      Although this Bayes-PCA test illustrates the effectiveness of the X-SNPs compiled for ET and can be run independently of an autosomal SNP analysis, there is little reason to perform such a test if the sample gives no indications of admixture. When admixture is detected from the autosomal SNP patterns, then the 16 X-SNP data can be more informative, particularly when the sample is male, enabling X-SNP and Y-SNP genotypes to be directly compared. Furthermore, an atypical X chromosome can still occur and will be undetected if there are no apparent co-ancestry patterns amongst the autosomal SNP data. As an example of this phenomenon, the very evident pink reference sample point within the EUR cluster of the PCAs in Fig. 3 is a male PEL sample HG02265, which gave > 99% AMR genetic cluster membership proportions in both 1KG and our own analyses.
      Fig. 3
      Fig. 3PCA plots of AFR, EUR, AMR reference populations and admixed African populations from 1000 Genomes, using 16 X-SNP genotype data. 3 A: African Caribbeans from Barbados (ACB); 3B: African Americans from SW USA (ASW), each with male and female separately analysed; 3 C: VISAGE Study population Urban Brazilian males and Rural Brazilian males. Pie charts combine data from both PCAs for ACB and ASW, with proportions of each identified X ancestry in each set of samples (Brazil populations with individual pie charts). Blank pie chart segments and grey dots represent undetermined PCA points occupying intermediate positions between the reference population PCA clusters. The genetic cluster plots in ACB and ASW show the thirty highest non-African co-ancestry proportions in each population taken from the 1000 Genomes own genetic cluster analyses using genome-wide SNP data (), with dots next to each sample denoting those with an identified non-African PCA position. Note that this compares the autosomal ancestry inference of the 1000 Genomes samples with that made here for the X chromosome, if not African.
      To evaluate how the full set of 16 X-SNP genotypes performed with admixed individuals, we used the 1KG admixed American genotypes listed in Supplementary Table S1C and applied a PCA test with reduced reference data comprising AFR (YRI), EUR (CEU) and AMR (CEPH-PEL) genotypes. Results from PCA tests for ACB and ASW males and females, tested separately, are shown in Fig. 3 A and 3B, alongside the genetic cluster analysis of 1000 Genomes for the 30 samples in each population with the lowest proportions of AFR co-ancestry (note 1000 Genomes used the ADMIXTURE genetic cluster algorithm, not STRUCTURE, but results are directly comparable). The overall percentage proportions of African, European, and American inferred X chromosome ancestries identified amongst male and female ACB, ASW and Study Brazilians using PCA tests, are summarised in Table 4.
      Table 4Proportions of African, European and American X chromosome ancestries (inferred using 16-SNP PCA tests) identified amongst African Caribbeans in Barbados (ACB); African Americans from southwest USA (ASW); and Study Brazilians from a rural region sample (Kalunga, Goiás State) and an urban sample (Brasilia). The increased levels of European co-ancestry in ASW compared to ACB; and the dominance of European co-ancestry in urban Brazilians compared to rural Brazilians are both evident in individuals whose X chromosome ancestries could be successfully inferred from their PCA positions.
      Admixed population samplesTotal no. of samplesUndetermined PCA positionX Ancestry Inference Success Rate% AFR X chromosomes% EUR X chromosomes% AMR X chromosomes
      ACB Males47296%90%6%-
      ACB Females49296%94%2%-
      ASW Males26485%65%20%-
      ASW Females35877%60%11%6%
      Urban Brazil Males*16287%19%62%6%
      Rural Brazil Males*18383%61%17%6%
      * Only one female in these population samples
      Taking each sample set in turn, there are three ACB males inferred to have EUR X chromosomes, two that occupy PCA space between AFR and EUR clusters so are undetermined, and the rest (42) are inferred to have AFR X chromosomes. The individuals in intermediate PCA space or on the edge of the reference clusters could be interpreted to show some recombination, but this would only be discernible in males, since such positions in females can equally represent a heterologous X chromosome pair with different ancestries. This corresponds to proportions of ∼6% of ACB males with a EUR X and ∼90% with an AFR X, but only 4% of ACB males had X chromosome patterns that could not be discerned. There were five ASW males with an inferred EUR X (∼19%), 17 with an inferred AFR X (∼65%), and four undetermined.
      Notably, amongst ACB females there were only two undetermined individuals with intermediate PCA positions and just one inferred EUR X chromosome pair. ASW females were the only sample set to show inferred AME X chromosome pairs, but both correlate well with the genetic cluster patterns from 1000 Genomes’ analyses, particularly sample NA20134, which has no detectable AFR ancestry with either test (although self-declared to have four African American grandparents). There were eight undetermined X ancestries in this set, although the PCA positions of NA19625 and NA20299 between AFR and AMR reference clusters match well with the genetic cluster proportions of both co-ancestries detected by 1000 Genomes. Although it is not possible to say whether these patterns are due to recombination amongst the 16 X-SNPs, or heterologous X chromosome ancestries. Four inferred EUR X chromosome pairs also reflect the higher levels of EUR co-ancestry in ASW compared to ACB, with a consequent smaller proportion of 23/35 inferred AFR X chromosome pairs (60%).
      The Brazilian samples were assessed in the same way to evaluate how successfully a de novo sample set could be analysed using PCA cluster analysis of X-SNP data. This sample of 16 urban male Brazilians (Brasília) and 18 rural male Brazilians (Kalunga, Goiás State) included a single rural female Brazilian which was not analysed. Comparisons of the EUR, AFR and AMR X ancestries inferred from the PCA patterns showed a marked contrast between the urban and rural samples, with urban Brazilians having the highest proportion of EUR X chromosomes (62%) and lowest AFR (19%) amongst the four admixed populations studied here, whereas the rural Brazilians showed 17% EUR X chromosomes and 61% AFR - akin to the proportions seen in ASW. A single male from each Brazilian sample had an AMR X chromosome. Results from the PCA tests along with pie-chart summaries of the above ancestry portions of each population sample (i.e., males and females combined) are shown in Fig. 3. Similar analyses were made of male and female CLM, MXL, PUR and PEL, shown in Supplementary Figs S4A-S4H. However, the three-way co-ancestry patterns common to these population samples (i.e., AFR-EUR-AMR in varied co-ancestry ratios), make it difficult to infer X ancestries with the same level of confidence as ACB, ASW and Study Brazilians.
      These results indicate that well separated clusters are obtained in PCA for 16 X-SNPs using reference data representing the three major contributor populations of admixed Americans. The system provides a simple way to compare the positions of individuals with unknown ancestry when they show signals of admixture in their autosomal SNP genetic cluster patterns. Our assessments suggest PCA provides a secure inference of the X ancestry in the majority of ACB and ASW test samples, and males have the advantage of Y-SNP genotype data from ET with which to compare the pattern of variation observed in the X-SNPs. When the levels of minority co-ancestry are low the X ancestry inference success rates are consequently higher, likely due to a smaller proportion of intermediate PCA positions (undetermined ancestry) caused by heterologous X pairs in females and reduced disruptive recombination in the X chromosomes of males.
      All X-SNP data uploaded to the Snipper Bayes-PCA analysis portal are provided as a series of Excel worksheets (that can be made active individually for uploading to Snipper by placing in the leftmost ‘worksheet 1′ position in the uploaded file) in Supplementary File S1.

      3.2.2 The 5-SNP X centromere haplotype

      Although recombination was not observed to be a frequent disruptor of the 16 X-SNP genotype combinations across the full chromosome length in ACB and ASW males, representing only 5–15% of intermediate, and therefore undetermined PCA positions, it is instructive to focus on the 5-SXC haplotypes with much lower levels of allelic assortment occurring compared to the whole chromosome. The estimated recombination rate [
      • Phillips C.
      • Ballard D.
      • Gill P.
      • Court D.S.
      • Carracedo A.
      • Lareu M.V.
      The recombination landscape around forensic STRs: accurate measurement of genetic distances between syntenic STR pairs using HapMap high density SNP data.
      ] of the first 5′ p-arm ET X-SNP to the 5′-most 5-SXC SNP is 45.3%, and the 3′-most 5-SXC SNP to last 3′ q-arm X-SNP is 45%, compared with an estimated 6.8% across the 5-SXC haplotype span. The full details of the centimorgan values and Kosambi-adjusted Rc estimates [
      • Phillips C.
      • Ballard D.
      • Gill P.
      • Court D.S.
      • Carracedo A.
      • Lareu M.V.
      The recombination landscape around forensic STRs: accurate measurement of genetic distances between syntenic STR pairs using HapMap high density SNP data.
      ] between each of the 16 X-SNPs of ET are listed in Supplementary Table S1C, rows 3–5. As a result of these estimated recombination rates, population-specific X-SNP patterns will become assorted in individuals with co-ancestry within a few generations, but the 5-SXC allelic combinations will stay intact across a much longer period of time (i.e., many more generations) after each admixture event. As such, the 5-SXC provides a more secure way to track the X ancestry of individuals with admixed backgrounds of unknown time-depth.
      Analysis of 5-SXC haplotype frequency distributions in 1KG and CEPH populations are outlined in Fig. 4, with the underlying data for all populations studied listed in Supplementary Table S1C. It is evident from these haplotype frequency plots that African and East Asian 5-SXC haplotypes are highly specific, with four ‘signature’ haplotypes accounting for 90.3% of those observed in 1KG AFR (CGTTT; CGCTT; CGTCT; AGTTT, with CGTTT alone forming almost 60% of observed haplotypes), and 88.9% of 1KG EAS (AATCT; CATCT; AACCT; CACCT, with AATCT also near 60% of observed haplotypes). 1KG EUR have five specific haplotypes with a collective frequency of 75.5%, and CEPH AMR four haplotypes with a collective frequency of 60%, although it should be noted that the absence of the CGCTC haplotype is likely to be due to inaccurate phasing of these SNPs in the CEPH genome data, since CGCTC is found at an equivalent frequency to AACTT in 1KG admixed American populations (suggesting the collective AMR-specific haplotype frequency of 60% is likely to be an underestimate by ∼8%). 1KG SAS almost exclusively show combinations of European and East Asian specific haplotypes and lack South Asian specific haplotypes, although AATTT accounts for 9.5% of total 5-SXC haplotypes in this population, and although present in EUR at 4% frequency, is not frequent in AFR or EAS. In contrast to SAS, CEPH OCE have a frequent, highly indicative CATTT haplotype accounting for 46.2% of all haplotypes observed in this sample, ten times more frequent than the 4.8% of CATTT haplotypes observed in 1KG SAS. The Study Fijian 5-SXC haplotype data is included (haplotypes in females were not phased but inferred) and show the CATTT and AATTT haplotypes indicate that Oceanian ancestries and South Asian admixture are observable characteristics of this population sample.
      Fig. 4
      Fig. 4Haplotype frequency estimates for the 5-SNP X centromeric haplotype block (5-SXC) in 1000 Genomes AFR, EUR, SAS, EAS, admixed African and admixed American populations, plus CEPH AMR (includes 18 1KG PEL), CEPH OCE and Study Fijians (to allow comparison with the relatively small sample size for CEPH OCE). All three main population groups have 4–5 specific haplotypes shown boxed and with a collective haplotype frequency. However, SAS mainly comprises an equal combination of EUR and EAS haplotypes, with AATTT the most differentiated haplotype compared to other populations. Almost half of OCE 5-SXC haplotypes are CATTT, making this haplotype highly informative, although it is found at 3–4% frequencies in other populations. Eleven haplotypes are not charted as they were observed at frequencies less than 3% of total variation in any one population.
      Fig. 4
      Fig. 4Haplotype frequency estimates for the 5-SNP X centromeric haplotype block (5-SXC) in 1000 Genomes AFR, EUR, SAS, EAS, admixed African and admixed American populations, plus CEPH AMR (includes 18 1KG PEL), CEPH OCE and Study Fijians (to allow comparison with the relatively small sample size for CEPH OCE). All three main population groups have 4–5 specific haplotypes shown boxed and with a collective haplotype frequency. However, SAS mainly comprises an equal combination of EUR and EAS haplotypes, with AATTT the most differentiated haplotype compared to other populations. Almost half of OCE 5-SXC haplotypes are CATTT, making this haplotype highly informative, although it is found at 3–4% frequencies in other populations. Eleven haplotypes are not charted as they were observed at frequencies less than 3% of total variation in any one population.
      The 1KG admixed African American populations indicate little or no disruption of the 5-SXC haplotype combinations in those individuals with admixed backgrounds, and it is possible to infer a sex bias (AFR females-EUR males) in the admixture profiles of the ACB and ASW, since very few AATTC EUR-specific haplotypes are observed in these samples, in line with the co-ancestry proportions estimated from PCA analyses shown in Table 4. Interestingly, the very similar total EUR-specific and AMR-specific haplotypes observed in the four admixed American populations of 1000 Genomes provides evidence of further sex bias (AMR females-EUR males), since EUR co-ancestry is the dominant component in CLM, and PUR, and about half of MXL autosomal SNP patterns (no formal Y-SNP analysis made). Although the 5-SXC haplotypes cannot be phased and therefore must be inferred, the real power of using a tightly linked combination of SNPs is their persistence as population-specific haplotypes in individuals that are likely to have had admixture events occurring in their family histories some time ago. Once again, males avoid phasing issues and allow comparison with patterns in all 16 X-SNPs, as well as the highly informative Y-SNP genotypes generated by ET.
      An important point to highlight is the genotyping performance in the ET assay of rs5937317 - the key EUR-informative 3′ bounding SNP of the 5-SXC haplotype. The below-average sequence coverage observed for this SNP in MPS analysis means it has a much higher genotyping no-call rate than the other 15 X-SNPs, even when applying a more relaxed sequence coverage threshold of a minimum 20 reads. We observed a 26% no-call rate for this SNP when analysing the VISAGE study population samples, which is certain to be higher with forensic DNA. In sharp contrast, the other 15 X-SNPs gave a single no-call from 3600 genotypes successfully obtained. Therefore, to maintain the significant ancestry-informativeness value of the X-SNP panel in ET, detailed above, either AmpliSeq MPS primer design adjustments will be necessary, or closely sited substitute SNPs with comparable patterns of EUR-informative allele distributions should be identified.

      3.3 Characteristics of Microhaplotypes selected for ET and mixed DNA sequence analysis

      3.3.1 Patterns of variation in the 21 Microhaplotypes

      Supplementary Fig. S5 provides scaled positional genomic maps outlining the component SNPs in the haplotype structure of the 21 MHs, which also include low frequency SNPs reported by the Ion S5 genotyping software, but not compiled when reconstructing the haplotypes of each locus. Our experience with developing MH loci for forensic MPS assays [
      • de la Puente M.
      • Ruiz-Ramírez M.J.
      • Ambroa-Conde A.
      • Xavier C.
      • Amigo J.
      • Casares de Cal M.A.
      • Gómez-Tato A.
      • Carracedo A.
      • Parson W.
      • Phillips C.
      • Lareu M.V.
      Broadening the applicability of a custom multi-platform panel of Microhaplotypes: Bio-geographical ancestry inference and expanded reference data.
      ,
      • Phillips C.
      • McNevin D.
      • Kidd K.K.
      • Lagacé R.
      • Wootton S.
      • de la Puente M.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Eduardoff M.
      • Gross T.E.
      • et al.
      MAPlex-A massively parallel sequencing ancestry analysis multiplex for Asia-Pacific populations.
      ] has been that low frequency SNPs within a sequence segment that has much more polymorphic SNPs making up the common haplotypes, creates too much data complexity and these extra SNPs do not merit compilation since their variation minimally changes the distribution of haplotype variability. In terms of ancestry analysis, certain SNP alleles do vary considerably between populations and create the ancestry-informativeness in the selected MHs, even if there is little or no variation recorded in some populations. The MH maps in Supplementary Fig. S5 are placed above bar plots summarising the population distribution of haplotype frequencies in the 21 MHs. Very rare haplotypes observed in one or two, or many populations, are marked in red or yellow to aid visibility. The underlying haplotype data in all populations including the VISAGE Study populations is compiled in Supplementary Table S2. We included both CEPH and Sanger Middle East population data in the bar plots for cross-comparison purposes, as the CEPH whole-genome-sequencing data was not phased so allelic combinations need to be inferred from the common haplotypes of each MH in other Eurasian population data. Therefore, some discrepancies occur, but these are mainly lower frequency haplotypes, and the Sanger ME haplotype frequencies should be considered the most reliable for reference purposes.
      Although any review of MH haplotype frequency distributions is rather subjective, this is the first analysis of Middle East variation for these types of loci, so it is useful to discern the overall patterns of variation in the data. In line with previous observations [
      • de la Puente M.
      • Ruiz-Ramírez M.J.
      • Ambroa-Conde A.
      • Xavier C.
      • Amigo J.
      • Casares de Cal M.A.
      • Gómez-Tato A.
      • Carracedo A.
      • Parson W.
      • Phillips C.
      • Lareu M.V.
      Broadening the applicability of a custom multi-platform panel of Microhaplotypes: Bio-geographical ancestry inference and expanded reference data.
      ,
      • Phillips C.
      • McNevin D.
      • Kidd K.K.
      • Lagacé R.
      • Wootton S.
      • de la Puente M.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Eduardoff M.
      • Gross T.E.
      • et al.
      MAPlex-A massively parallel sequencing ancestry analysis multiplex for Asia-Pacific populations.
      ], AFR haplotype frequencies show consistently higher levels of variation compared to other populations. The opposite characteristic applies to OCE levels of polymorphism in these loci, where fewer haplotypes are observed and one or two haplotypes predominate in many of the ET MHs, e.g., in 8pB the CATCA haplotype alone accounts for almost 85% of the total variation in Oceanians. The AME haplotype frequencies also represent lower levels of polymorphism, but only for certain MHs; (notably 1pD, 10pB, 15qD, and the TCT haplotype in MH03, at >75% frequency). In general, neither the CEPH nor Sanger ME haplotype distributions are very differentiated from EUR populations, and with the exception of 3qC, the same low level of differentiation applies to SAS vs EUR. The development of ancestry predictive systems that use single-site SNP genotypes in combination with haplotypes from a series of MHs has not been completed in either the VISAGE interpretative software or the research laboratories supporting the development of the VISAGE ET assay. Therefore, the MH data generated by ET has not been integrated in order to enhance the autosomal BGA SNP data forming the core of most forensic ancestry prediction systems. Despite the complications of using mixed marker datasets for Bayes analysis, MH data is suitable for inclusion in SNP-based analysis runs with STRUCTURE. However, we did not formally test the addition of the 21 MH loci to the 104 autosomal SNPs using STRUCTURE, although experience indicates no improvement is seen in the differentiation of the inferred genetic clusters from this algorithm when marker data is extended in this way.

      3.3.2 Pilot studies to evaluate ancestry-based deconvolution of simple mixtures using Microhaplotypes

      Fig. 5 summarises the findings of the pilot study of ancestry-based mixed DNA deconvolution made on the 2-way mixture constructed from EUR and AFR Coriell control DNAs. First, the STRUCTURE analysis of the standard reference populations of YRI, CEU and CHB indicates well differentiated genetic clusters for each population, despite being based on data from just 21 MH loci (Fig. 5A). Once it was confirmed that analysis of representative populations from the three main continental population groups was sufficiently informative, the cluster membership proportions for the two control DNAs were obtained and showed that STRUCTURE analysis of the haplotypes inferred from sequence read ratios would provide the means to infer the ancestry of each haplotype detected. Second, the voluntary scrutineers tasked with manually recognising the mixture component haplotypes produced inferences that were compiled as 21 consensus MH profiles for the 1:1, 3:1, 9:1 mixture ratios. When a scrutineer inferred an MH profile with more haplotypes than the others, the consensus profile defaulted to the least number of haplotypes and was therefore a conservative estimate. It was not possible to reliably de-convolute the 1:1 mixture, as sequence read ratios were mostly closely matched from both contributors. As the mixture ratio became more skewed, it was easier to distinguish the major and minor contributors as indicated by Fig. 5B, where 81% of major contributor haplotypes could be inferred and 55% of minor contributor haplotypes in the 3:1 mixture. These values improved to 100% and 64% respectively, in the 9:1 ratio; underlining the fact that more accentuated mixture ratios are easier to deconvolute in this way. There is a very slight indication of incomplete profile reconstruction in the STRUCTURE plots for the minor contributor of the 3:1 ratio, with an increase in the negligible EAS co-ancestry proportion, but the sequences detected from this mixture component would be unequivocally inferred to show AFR ancestry. Typical sequence coverage data (numbers of reads) are shown for example locus MH22 in Fig. 5 C, indicating the difference in read ratios can reflect the mixture ratio to a large extent, and it is clear that the CAA/CTA haplotypes belong to the minor contributor in both 3:1 and 9:1 mixture ratios. However, these sequence coverage readings also show that balanced mixtures do not necessarily lead to balanced reads between the component haplotypes detected. The full set of plots of sequence coverage of the identified haplotypes in the 1:1, 1:3 and 1:9 mixture ratios are given in Supplementary Fig. S6.
      Fig. 5
      Fig. 5Pilot study to evaluate the ability of the 21 Microhaplotypes of ET to detect mixed DNA and identify the ancestry of contributors in simple 2-way mixtures. 5 A: STRUCTURE analysis of AFR, EUR, EAS reference populations indicates the 21 MHs differentiate these populations efficiently and MH profiles comprising haplotypes deconvoluted from a mixture can be included to infer their likely ancestry. 5B: Percentage of component SNP alleles called for MH profiles identified by a panel of scrutineers of the MPS data for 3:1 and 9:1 mixture ratios (1:1 was not successfully deconvoluted and is not shown). STRUCTURE cluster plots for the deconvoluted MH profiles shown right. 5 C: Example sequence coverage output for each identified haplotype in MH22. All three mixture ratios allow ‘pairing’ of two haplotypes per contributor, but this was not possible for the 1:1 mixture ratio in other MH loci or when less than four haplotypes are present.
      In the 1:1 mixture ratio, Supplementary Fig. S6 shows almost all MHs have more than 2 haplotypes (12 with 3 haplotypes, five with 4 haplotypes), and only MH18 and MH21 have the same level of sequence coverage for each of two haplotypes identified in these loci (MH locus 7pB has two haplotypes with a very skewed coverage ratio). These data emphasise the power of Microhaplotypes to detect mixtures even when full deconvolution is not feasible because individual haplotype combinations cannot be inferred from the sequence coverage ratios. Therefore, this pilot study using the 21 MH loci of ET suggests an efficient system for detecting mixed DNA can be applied independently of the single-site SNP data and can alert the user to the presence of a mixture. When two contributors are present in the mixed DNA at unequal proportions there is a good opportunity to identify individual haplotype pairs from each contributor, and if they have contrasting ancestries amongst African, European, or East Asian populations-of-origin, there is the ability to use STRUCTURE to identify these ancestries and assign them to both contributors.
      Beyond MH haplotype analysis, bi-allelic SNPs have limited capabilities for mixture deconvolution from MPS sequence data [
      • Bleka Ø.
      • Eduardoff M.
      • Santos C.
      • Phillips C.
      • Parson W.
      • Gill P.
      Open source software EuroForMix can be used to analyse complex SNP mixtures.
      ] and given the power of multiple-haplotype MH loci to detect simple mixed DNA components, we would advocate discounting single-site SNP data when such mixtures are detected. In contrast, tri-allelic SNPs can detect simple mixtures more efficiently from the detection of three different alleles in the sequence read data for each nucleotide. Compared to the use of MHs to deconvolute mixtures as described above, tri-allelic SNPs have much more limited power, but can add detail to the observations based on the MH sequence data.
      The parallel study of this paper, describing the development and inter-laboratory evaluation of the VISAGE ET MPS assay [
      • Xavier C.
      • de la Puente M.
      • Mosquera-Miguel M.
      • Freire-Aradas A.
      • Kalamara V.
      • Revoir A.
      • Gross T.E.
      • Schneider P.M.
      • Ames C.
      • Hohoff C.
      • et al.
      Development and inter-laboratory evaluation of the VISAGE Enhanced Tool for appearance and ancestry inference from DNA.
      ], examined the tri-allelic data from the same NA07000-NA18498 mixture series. We briefly summarise these findings below, which added evaluation of increased heterozygosity, and skews in the sequence coverage of the detected alleles of each tri-allelic SNP, in addition to recording the presence of three alleles. The level of tri-allelic SNP heterozygosity increased from ∼25% for the contributors to more than 56% in the 1:1 and 1:3 mixture ratios, falling to lower levels in the 1:9 mixture ratio. The observed skews in the sequence coverage of the alleles of the tri-allelic SNPs were close to expectations in the 1:3 and 1:9 ratios, suggesting tri-allelic SNPs are informative markers for the analysis of mixed DNA from MPS data beyond the simple sequence coverage skews occurring with bi-allelic SNPs. Finally, three-allele patterns were found in 11 of the 26 tri-allelic SNPs of ET, which matches the expected number from examination of the contributor SNP genotypes.

      3.4 Y-SNP genotypes in VISAGE Study populations

      Although Y-SNP data was not compiled from the main human genome variant datasets because of incomplete data, all Y-SNP genotypes obtained for the VISAGE Study population males have been compiled and are listed in Supplementary Table S3. The 5-SXC haplotypes are also listed alongside the haplogroup manually inferred from the Y-SNP alleles in each sample and the description of the region where that haplogroup is most commonly observed.
      Although a thorough analysis of the distribution of Y variability in the East African, Middle East and Fijian Study populations was not made, we decided the Brazilian samples would provide an informative pilot study for the comparison of X- and Y-SNP data in two populations likely to have different admixture histories. The Brazilian rural sample is of Kalungas who descended from escaped slaves and have lived in remote settlements in Goiás State for about 250 years. In contrast, the Brazilian urban sample is from Brasília, the capital of Brazil. Compilation of the X- and Y-SNP based matrilineal and patrilineal ancestries is summarised in Supplementary Fig. S7. This data revealed a noticeable European male – African female sex-biased admixture ratio in both population samples but was much more marked in the rural Kalungas. The urban Brazilian males had 6% African Y haplogroups and 45% African-specific 5-SXC haplotypes, while the rural Brazilian males had 39% African Y haplogroups and 92% African-specific 5-SXC haplotypes. European Y haplogroups were found in 94% of urban Brazilian males (19% were designated as ‘ME-EUR’ with a haplogroup distribution including East European, Caucasus and Middle East regions), and 18% of European-specific 5-SXC haplotypes, while the rural Kalungas had 67% European Y haplogroups and 8% European-specific 5-SXC haplotypes. An interesting point of comparison with the X-Y data is an independent study of the same Kalunga samples using 46 autosomal ancestry-informative Indels, by Carvalho Gontijo et al., in 2018 [
      • Carvalho Gontijo C.
      • Macêdo Mendes F.
      • Santos C.A.
      • de M.
      • Klautau-Guimarães N.
      • Lareu M.V.
      • Carracedo A.
      • Phillips C.
      • Oliveira S.F.
      Ancestry analysis in rural Brazilian populations of African descent.
      ]. Carvalho Gontijo’s study detected ∼68% African and ∼25% European co-ancestry proportions, with the other 7% American (plus a marginal East Asian proportion). These proportions are broadly positioned between the two contributor population ratios with a significant European male – African female sex bias, as we indicated with the X- and Y-SNP data for the Kalungas.
      The X ancestry profile of the urban Brazilians showed equal 18% proportions of EUR, EAS and AMR-specific 5-SXC haplotypes. Therefore, a discernible European male sex bias exists in both samples, but is particularly strong in the isolated rural sample, where almost all the observed matrilineal X haplotypes are African-specific and two thirds of the patrilineal Y haplogroups have their most common distribution in Europe. Although this simply represents an initial exploration of data where X- and Y-SNP genotypes can be compared, it suggests a forensic ancestry test that combines each gonosomal marker set will have a degree of power to analyse patrilineal and matrilineal patterns in persons with admixed backgrounds. This is encouraging, given the early decision by VISAGE not to pursue mtDNA analysis as part of the ET assay.

      3.5 STRUCTURE analysis of ET BGA SNP data

      3.5.1 Worldwide population structure patterns inferred from the autosomal BGA SNPs of ET

      We rely on STRUCTURE to analyse the ancestry of unknown donors in forensic DNA tests for several reasons: i. we have found it to be the most effective way to detect and analyse co-ancestry patterns in individuals with admixed backgrounds; ii. although not part of our studies here, STRUCTURE can combine and analyse variant data from different types of genomic marker, so MH loci, STRs, SNPs and Indels can be analysed together in a single run; iii. the availability of detailed genotype data from whole-genome-sequence variant datasets allows a wide range of reference populations to be compiled for almost any marker set, and nearly all SNPs identified in the human genome to date. Combining unknown forensic sample data marked as POPFLAG= 0, with size-adjusted reference data (approximately 100 samples per reference population) marked as POPFLAG= 1, provides an effective way to examine the likely ancestry of the unknown samples. A common problem with the use of STRUCTURE is overfitting the data to a number of inferred genetic clusters (K) greater than the actual clusters that can be properly discerned with the markers used. Since forensic BGA marker sets are limited in number to preserve assay sensitivity, the initial analysis of samples of unknown ancestry with STRUCTURE requires a cautious exploration of each K value, generally from K:2 to K:8. Our experience has indicated that data overfitting - when too many clusters are inferred and individual population groups begin to show irregular within-population cluster membership proportions - can occur after K:5, analysing a continentally-based reference population set of AFR, EUR, EAS, AMR, OCE that includes the less well differentiated population groups of ME and SAS. To counter these effects and to provide the optimum differentiation of genetic clusters, we have adopted a ‘nested’ approach to STRUCTURE analyses that runs a five-continent reference set with the unknown sample(s) set at K:5 expected clusters. Depending on the cluster membership patterns found in the POPFLAG= 0 samples, another K:5 run analyses the samples with a Eurasian sub-continental reference set of AFR, EUR, ME, SAS, EAS. We have found this improves the cluster patterns detected in admixed samples, which are predominantly from the Americas and therefore show co-ancestry proportions in varying degrees from AFR, EUR and/or AMR contributing populations. One problem can be the detection of SAS co-ancestry in the second Eurasian-centred STRUCTURE run, and in such cases the initial run’s reference population data can be adjusted for K:5 expected clusters by swapping out the OCE populations. One example of when the exploratory STRUCTURE runs can require adjustment depending on the results of both analyses, is the 1KG sample HG01880 shown in Fig. 3, with ∼30% SAS co-ancestry detected from 1000 Genomes’ own genetic structure analyses [
      • The 1000 Genomes Project Consortium A.
      • Auton L.D.
      • Brooks R.M.
      • Durbin E.P.
      • Garrison H.M.
      • Kang J.O.
      • Korbel J.L.
      • Marchini S.
      • McCarthy G.A.
      • McVean
      • et al.
      A global reference for human genetic variation.
      ]. Because we do not include SAS reference data in the first STRUCTURE run this would go undetected until the Eurasian sub-continental reference data run was completed, and a new run made with OCE reference genotypes swapped out for SAS.
      Applying the Continental K:5 - Eurasian Sub-Continental K:5 nested approach described above to the full range of 1KG, CEPH, Sanger ME and VISAGE Study populations produced a generally robust identification of the majority cluster membership proportions in each sample. The minority cluster membership patterns in almost all samples produced a coherent pattern which matched the geographic location of the populations analysed, particularly those from the Middle East regions. When performing these STRUCTURE analyses, we consistently observed well differentiated genetic cluster patterns at K:6 in the Eurasian Sub-Continental runs when the CEPH Mozabite Algerian samples were included as a sixth population reference set marked as ‘North African’ (NAF) POPFLAG= 1. For this reason, we show the K:6 patterns generated using six reference populations which includes distinct NAF and ME reference datasets (ME comprising the three Israeli Arab populations of Bedouin, Palestinian and Druze). Fig. 6 displays, in approximate geographic locations-of-sampling, the STRUCTURE cluster plot segments for the populations from each dataset that show detectable and varying degrees of co-ancestry. The cluster plots are generally arranged in descending order of major co-ancestry components have been expanded two-fold to show individual cluster patterns more clearly. The HGDP-CEPH Sardinian, Tuscan, Adygei EUR populations and the Pakistani SAS populations showed some co-ancestry patterns but are excluded for clarity. All other populations not shown in Fig. 6 had single cluster membership patterns matching those reported in numerous studies of the same samples using several forensic BGA SNP sets [
      • Phillips C.
      Forensic genetic analysis of bio-geographical ancestry.
      ,
      • de la Puente M.
      • Ruiz-Ramírez J.
      • Ambroa-Conde A.
      • Xavier C.
      • Pardo-Seco J.
      • Álvarez-Dios J.
      • Freire-Aradas A.
      • Mosquera-Miguel A.
      • Gross T.E.
      • Cheung E.Y.Y.
      • et al.
      Development and evaluation of the ancestry informative marker panel of the VISAGE basic tool.
      ,
      • Phillips C.
      • Parson W.
      • Lundsberg B.
      • Santos C.
      • Freire-Aradas A.
      • Torres M.
      • Eduardoff M.
      • Børsting C.
      • Johansen P.
      • Fondevila M.
      • et al.
      Building a forensic ancestry panel from the ground up: the EUROFORGEN Global AIM-SNP set.
      ,
      • Galanter J.M.
      • Fernandez-Lopez J.C.
      • Gignoux C.R.
      • Barnholtz-Sloan J.
      • Fernandez-Rozadilla C.
      • Via M.
      • Hidalgo-Miranda A.
      • Contreras A.V.
      • Figueroa L.U.
      • Raska P.
      • et al.
      Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas.
      ,
      • Phillips C.
      • Freire Aradas A.
      • Kriegel A.K.
      • Fondevila M.
      • Bulbul O.
      • Santos C.
      • Serrulla Rech F.
      • Perez Carceles M.D.
      • Carracedo A.
      • Schneider P.M.
      • Lareu M.V.
      Eurasiaplex: a forensic SNP assay for differentiating European and South Asian ancestries.
      ,
      • Santos C.
      • Phillips C.
      • Fondevila M.
      • Daniel R.
      • van Oorschot R.A.H.
      • Burchard E.G.
      • Schanfield M.S.
      • Souto L.J.
      • Uacyisrael J.
      • Via M.
      • et al.
      Pacifiplex: An ancestry-informative SNP panel centred on Australia and the Pacific region.
      ,
      • Carvalho Gontijo C.
      • Porras-Hurtado L.G.
      • Freire-Aradas A.
      • Fondevila M.
      • Santos C.
      • Salas A.
      • Henao J.
      • Isaza C.
      • Beltrán L.
      • Nogueira Silbiger V.
      • et al.
      PIMA: A population informative multiplex for the Americas.
      ,
      • The 1000 Genomes Project Consortium A.
      • Auton L.D.
      • Brooks R.M.
      • Durbin E.P.
      • Garrison H.M.
      • Kang J.O.
      • Korbel J.L.
      • Marchini S.
      • McCarthy G.A.
      • McVean
      • et al.
      A global reference for human genetic variation.
      ]. Therefore, we concentrated on results from admixed 1KG samples and the three VISAGE Study populations outside of Eurasia analysed with the initial K:5 STRUCTURE run; and the nine Sanger ME plus four VISAGE Study populations from North African, East African or Middle East regions, analysed with the Eurasian K:6 STRUCTURE run, which included a sixth NAF reference dataset. These runs analysed the 104 autosomal BGA SNPs in ET, comprising bi-allelic and tri-allelic loci. The average cluster membership proportions for the initial K:5 STRUCTURE run and the Eurasian K:6 STRUCTURE run for all 1KG, CEPH, Sanger ME and Study population samples included in each analysis, plus the corresponding segmented cluster plots from this data, are listed in full in Supplementary Tables S4A and S4B for Continental and Eurasian datasets, respectively.
      Fig. 6
      Fig. 6Cluster plots of STRUCTURE analyses of selected 1KG, CEPH, Sanger ME and Study populations. Nested STRUCTURE analyses consisted of first stage K:5 runs using Five-Continental reference population datasets (POPFLAG=1) comprising 1KG AFR (YRI); EUR (CEU); EAS (CHB); 2 CEPH OCE populations; 5 CEPH AMR populations plus a subset of 1KG PEL with no non-AMR co-ancestry. Populations studied (POPFLAG=0) are shown left and right of central group of populations, comprising six 1KG admixed African and American populations; 67 PEL with detected non-AMR co-ancestry; Study Brazilian rural and urban populations; two CEPH East Asian populations with co-ancestry from other populations; Study Fijians. The central group of Middle East region populations was analysed with the second stage K:6 runs using Eurasian Sub-Continental reference population datasets, comprising 1KG YRI; CEPH Algerian Mozabite; 3 CEPH Israeli Arab populations; 1KG CEU, 1KG SAS (GIH); 1KG CHB. Populations tested were five VISAGE Study populations and nine Sanger ME populations (Emirati A-D and Saudi A-B are arranged separately but not located to a specific region). The three samples in ASW and ACB with highest levels of non-AFR co-ancestry shown on the right as expanded columns.
      Reviewing the Five-Continental K:5 reference and study population cluster plots first. The five admixed 1KG population cluster patterns shown top left in Fig. 6 plus the 67/85 admixed PEL, are discussed in the next section. In the reference population cluster plot the inability to match the numbers of Oceanian reference samples to those of the other populations is evident, so a degree of bias may have occurred in identifying and quantifying the OCE cluster membership proportions when detected as co-ancestry components in Study Fijians and CEPH Cambodians. Nevertheless, the large-scale reduction of OCE-informative SNPs from 23 in BT to 3 in ET has not affected the ability of the ET BGA SNPs to differentiate this population group. In fact, the first OCE sample set of Papua New Guinea is distinguishable from the second of Bougainvillean samples, with the detectable presence of EAS co-ancestry in the latter. One other cluster pattern to highlight is the 5th AMR sample set comprising CEPH Maya, which shows EUR co-ancestry at a higher level than the other CEPH AMR sample sets (set 2 =Karitiana; 3 =Surui; 4 =Colombians; 5 =Maya; 6 =Pima). This represents a close match to patterns obtained from the two landmark studies of the HGDP-CEPH diversity panel, using larger marker sets (See Fig. 1 of [
      • Rosenberg N.A.
      • Pritchard J.K.
      • Weber J.L.
      • Cann H.M.
      • Kidd K.K.
      • Zhivotovsky L.A.
      • Feldman M.W.
      Genetic structure of human populations.
      ], and Fig. 1 of [
      • Li J.Z.
      • Absher D.M.
      • Tang H.
      • Southwick A.M.
      • Casto A.M.
      • Ramachandran S.
      • Cann H.M.
      • Barsh G.S.
      • Feldman M.
      • Cavalli-Sforza L.L.
      • Myers R.M.
      Worldwide human relationships inferred from genome-wide patterns of variation.
      ]). The Study Fijian plot indicates most samples would be identified as having OCE origin but note five of the rightmost columns are self-declared Indo-Fijians likely to have SAS co-ancestry, which would be undetected with this reference population data absent from the Continental STRUCTURE run, but present in the Eurasian STRUCTURE run. This exemplifies the need to adjust reference data according to both STRUCTURE analyses (Fijian cluster plots from Eurasian STRUCTURE analysis runs not shown). Lastly, the two Study Brazilian sample cluster plots illustrate the contrast in admixture patterns between them. The rural Brazilian sample has predominant AFR co-ancestry (apart from the rightmost two individuals), contrasting with urban Brazilians, who show predominant EUR co-ancestry, apart from the two rightmost individuals. The two Brazilian samples inferred to have AMR X chromosomes (Fig. 3B) showed 3% (rural, K113) and 10% (urban, BSB228) AMR co-ancestry in this analysis.
      The Eurasian sub-Continental K:6 reference and study cluster plots illustrate the successful differentiation of NAF and ME populations, although this was based on the single CEPH Algerian reference population, which could lead to biased analysis due to possible stratification of SNP variation in a population not necessarily representative of variability across a wider region. Therefore, the cluster patterns detected in the Study Moroccan sample are particularly relevant. Study Moroccans show a broad range of NAF co-ancestry proportions from 5–95% in two-thirds of samples, with the majority of Moroccans showing slightly higher proportions of ME co-ancestry than NAF, apart from two samples with AFR-EUR co-ancestry, and two with ME-EUR co-ancestry. AFR co-ancestry is detectable in 9 of the 27 Algerian reference samples, with majority AFR co-ancestry proportions in three. The other twelve Middle East region populations provide cluster patterns well matched to their locations. It is not possible to identify the Emirati A-D populations, but these appear to show a progression in SAS co-ancestry proportions in at least half of samples from C and D. The other Sanger ME populations show predominant ME cluster membership proportions in almost all samples, so would be distinguishable from a European individual apart from (rightmost) Turkish and Syrian samples, which retain a detectable ME co-ancestry. Considering the Middle East population sample set as a whole, a consistent geographic pattern is evident for the majority of samples in each population. This comprises i. a strong presence of the red NAF genetic cluster in half of samples from the Northwest corner of this region, which is shared with the grey ME cluster; ii. a detectable co-ancestry presence of the blue EUR cluster in about a third of samples in the North or Northeast corner, with a predominant ME co-ancestry in these samples from Turkey, Syria, and Iraq (plus minor SAS co-ancestry in most samples); iii. two East African sample sets with equal proportions of AFR and NAF-ME cluster memberships, in patterns which are generally distinct from the other ME populations; iv. a predominant ME cluster, mostly > 90%, in a majority of samples from populations around the Saudi Arabian Peninsula, comprising nearly all Yemeni, Saudi A and B, Emirati A and B, and half of the Iraqis and Syrians. Therefore, using a second STRUCTURE analysis with six reference populations, it is possible to identify ME co-ancestry in the majority of ‘unknown’ test samples in this study, with a NAF co-ancestry signal detected in half of Moroccans. As a rule of thumb, the presence of AFR and ME, and/or NAF joint cluster memberships suggests a pattern characteristic of East African ancestry. The presence of 15%− 25% ME co-ancestry membership proportions in 4/99 CEU reference samples, suggests a conservative approach would be to infer Middle East ancestry using a threshold of 20–25% or higher ME and/or NAF co-ancestry proportions. Note that this would identify most of the Study Turkish samples as having distinct patterns compared to Europeans. Even applying a stringent threshold of 25% minimum ME/NAF membership proportions to signify Middle East ancestry, rates of non-inference are low amongst these test populations. The two East African populations would have 2% non-inference; Emirati 12%; Moroccans 3%; Iraqis 10%; Turkish 18%, with secure ME inferences possible for all Syrian, Saudi Arabian and Yemeni samples.

      3.5.2 Analysing co-ancestry in admixed population samples with STRUCTURE

      In a criminal investigation, a forensic ancestry test that can reliably identify co-ancestry in a person with an admixed background would, in such cases, provide important information about the likely appearance of a suspect. When previously evaluating the ability of the VISAGE BT ancestry panel to detect admixture and estimate the co-ancestry proportions in such a sample, we made a formal comparison between the cluster membership patterns from analysing the same 504 1KG admixed samples with 572,000 Human Origins array SNPs vs the 115 BGA SNPs of BT. With the BGA SNPs of ET we did a similar comparison of the same samples but used the co-ancestry proportions estimated from genome-wide SNP data published by 1000 Genomes [
      • The 1000 Genomes Project Consortium A.
      • Auton L.D.
      • Brooks R.M.
      • Durbin E.P.
      • Garrison H.M.
      • Kang J.O.
      • Korbel J.L.
      • Marchini S.
      • McCarthy G.A.
      • McVean
      • et al.
      A global reference for human genetic variation.
      ]. Supplementary Figs S8A-S8D shows the cluster plots from both analyses with the sample order dictated by the 1KG data arranged by descending majority co-ancestry membership proportions in each population. These plots show the complete 1KG sample set in Supplementary Fig. S8A, followed by expanded plots for admixed Africans ACB, ASW in Supplementary Fig. S8B, and admixed Americans CLM, PEL, PUR, MXL in Supplementary Fig. S8C. Supplementary Fig. S8D shows the correlation analyses and r² values used to gauge the levels of correlation between the co-ancestry proportion estimates made with each SNP set, combining AFR and AMR co-ancestry proportions into a single value and comparing EUR co-ancestry proportion estimates directly.
      Several factors are evident from a review of the correlation values and STRUCTURE cluster plots produced by ET BGA SNP analyses. First, there is a good match between both SNP sets in the estimates of majority co-ancestry across all samples and populations, particularly when this is above 90%. Consequently, r² values are highest for comparisons of AFR co-ancestry proportion estimates in ACB and ASW, and those for EUR in CLM and MXL. Closely matched cluster plot patterns and correlation values are also seen in PEL, although combining AFR-AMR cluster proportion estimates to simplify analysis reduces these correlation values. Lastly, the three co-ancestry outliers (rightmost columns) in ACB and ASW, which are also highlighted in Fig. 4 and 8, present good cluster plot matches, with the SAS co-ancestry proportion recognised by the ET BGA SNPs when this population reference dataset is included as POPFLAG= 1 genotypes. The three ASW outliers indicate an overestimation of AMR cluster proportions with ET BGA SNPs, and the same marginal but consistent effect is seen in PEL and MXL cluster plot patterns. The worst correlation and cluster plot matches are observed in the PUR comparisons. This appears to stem from a higher level of three-way admixture in this population, although CLM have similar admixture patterns, but produce much better correlation values of r²= 0.758 for the combined AFR/AMR co-ancestry proportion estimates, compared to r²= 0.334 in PUR. Much of the AMR co-ancestry estimation in PUR is eroded by many samples with EAS and SAS co-ancestry proportions, and it might be beneficial to consider a K:3 STRUCTURE analysis with AFR, EUR and AMR reference datasets, when three components of admixture are identified in unknown samples and two are either EUR and AMR, or EUR and AFR. A review of the cluster plot for PUR in Fig. 6 indicates this population is generally problematic to analyse for co-ancestry and a significant proportion of samples have low-level EAS and OCE co-ancestry proportions, when the 1KG data suggests these should be recognised as AMR co-ancestry. It is noteworthy that similar studies of the BGA SNPs in BT gave the lowest r² value for the PUR combined AFR/AMR co-ancestry proportion estimates of 0.446.
      Overall, genetic cluster differentiations become less reliable in individuals with three different co-ancestry components, so STRUCTURE analyses of populations such as Brazil must be approached with caution. Three-way admixture continues to present a considerable challenge for STRUCTURE-based analysis of co-ancestry patterns when using ancestry tests on a much smaller scale than those used for population genetics studies. Therefore, a prudent measure is to explore a series of K:3 runs with different combinations of reference population datasets. Although we do not present further autosomal SNP analysis data for Fijians, this population would be optimally analysed with EUR, SAS, EAS and OCE reference data in various combinations.

      3.5.3 Comparisons of STRUCTURE analyses using 104 BGA SNPs vs combined 104 BGA plus 184 autosomal EVC SNPs

      As three EVC-SNPs have been shared for pigmentation trait prediction and ancestry analysis purposes in both VISAGE SNP genotyping assays, it was considered worthwhile to formally evaluate the effect of combining all 104 autosomal BGA SNPs in ET with the 184 autosomal EVC-SNPs. The same Continental K:5 - Eurasian Sub-Continental K:6 nested analysis was made of five and six population reference datasets, respectively, as described in Section 3.5.1. Supplementary Fig. S9 shows the K:5 and K:6 cluster plots for both SNP sets, with accompanying Evanno charts of DeltaK and L(K) [
      • Evanno G.
      • Regnaut S.
      • Goudet J.
      Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.
      ]. The overall quality of genetic clusters is noticeably reduced in the expanded 288 autosomal SNP dataset compared to the dedicated BGA SNP dataset, particularly for the less divergent populations of NAF, ME and SAS, where a significant number of mixed genetic cluster patterns are observed amongst these three populations, using all 288 SNPs.

      4. Discussion

      The studies described here have largely concentrated on the added benefit brought by including Y-SNP, X-SNP, and MH markers that all have strong population differentiation properties in the ET ancestry panel. While it is important to acknowledge that many of the autosomal SNPs originally part of the VISAGE BT ancestry panel were replaced with new markers for ET, most of these new BGA SNPs are already well established for forensic use. It was only necessary to adjust the balance of markers towards EUR, EAS and AMR differentiations, and reduce those for AFR and OCE. Expanding the set of ME-informative SNPs in ET has provided considerable benefits in terms of the successful identification of ME and NAF genetic clusters when STRUCTURE is run at K:6 with Eurasian-orientated reference populations (grey and red genetic clusters, respectively, in Fig. 6). Our analyses show in almost all central Middle East population samples from the Sanger ME variant datasets and VISAGE Study samples from these regions, there are majority membership proportions from one or both ME and NAF genetic clusters. Where samples are from regions on the periphery of the central Middle East area, the other genetic clusters the STRUCTURE analyses identified correspond well to their geographic position in this broadly-based region. Specifically, many Turkish show EUR co-ancestry; East Africans have predominant AFR co-ancestries; and the Sanger Emirati in the East (although we cannot place populations A-D in specific geographic positions) show SAS co-ancestry in many individuals. Although it is not appropriate or viable to use STRUCTURE to assign a sample to a specific population, the patterns we have generated with ‘nested’ Eurasian reference population STRUCTURE runs allow a sample with ‘grey’ and/or ‘red’ clusters proportions above 10% to be identified as coming from the Middle East, and exclude an origin from sub-Saharan Africa, Europe, South Asia, or East Asia. In many cases, individuals show a characteristic signature of North African or East African population origins as distinct from the central Middle Eastern regions shown in Fig. 6. Therefore, we consider the goal set by VISAGE of developing an ancestry panel that can efficiently differentiate Middle East population origins from the neighbouring population groups, was largely met and did not require a very large expansion of BGA SNP numbers in ET to accomplish this goal. The adaptation of STRUCTURE runs into a nested approach which analyses a reduced set of reference populations with a narrow range of possible K values, has helped to focus ancestry analyses on the most appropriate regions and as our analyses show, enables more detailed genetic cluster differentiations to be made for the Middle East.
      The other expansion made for the ET ancestry panel - that of broadening the types of ancestry informative markers to include X-SNPs, Y-SNPs and MH loci have more specialised application in ancestry analyses used for forensic casework. With the distinctions that can be reliably made between AFR, EUR and AMR co-ancestries with the autosomal BGA SNPs of ET, admixed American individuals can be detected and then analysed efficiently. Consequently, more detail is obtained for male samples by adding the analysis of patterns of variation observed in X and Y chromosome markers. The level of detail we were able to achieve in the analysis of Brazilian samples, which are often too complex in their co-ancestry patterns to be easily studied with small-scale marker sets, highlights the power of combining marker sets with slightly contrasted genetic histories. Such histories often follow admixture events from up to three different contributing populations, and with the complicating effect of varied sex bias in different parts of the same geographic region. Nevertheless, we highlight the problems we encountered in reliably differentiating three-way co-ancestry cluster patterns in Puerto Ricans (PUR) and obtaining comparable data to those of the genome-wide SNP data from 1000 Genomes. Therefore, it is necessary to remain cautious when three different co-ancestries are detected in an individual, as small-scale autosomal BGA SNP panels may not reliably measure their relative proportions compared to genome-wide data.
      A key characteristic favouring the use of Microhaplotypes in forensic DNA analysis has been their ability to analyse mixed DNA without the hindrance of non-allelic PCR stutter products complicating the patterns seen [
      • Bennett L.
      • Oldoni F.
      • Long K.
      • Cisana S.
      • Madella K.
      • Wootton S.
      • Chang J.
      • Hasegawa R.
      • Lagacé R.
      • Kidd K.K.
      • Podini D.
      Mixture deconvolution by massively parallel sequencing of microhaplotypes.
      ]. Previously, we developed an approach for analysing mixed DNA with MHs which specifically exploited MH loci with strongly contrasting haplotype frequencies in different population groups. Despite comprising a simple pilot study limited to a single mixed DNA at a few ratios, we have demonstrated the 21 MHs chosen for ET successfully assign ancestries to the components of 2-way mixed DNA, notably when there is imbalance in their ratios, making sequence comparisons easier to achieve. This approach is helped by the ease with which MH loci can be analysed with STRUCTURE and the differentiation they provide of Europe, Africa, and East Asia. Although such analyses are not amenable to a high-throughput MPS pipeline, since haplotypes must be reconstructed locus-by-locus and their sequence ratios estimated, the ability to detect the likely ancestry of contributors could potentially provide key extra information for investigators.
      The adaptation of the BT ancestry panel comprising mainly established autosomal forensic BGA SNPs, into the much more broadly based set of BGA markers in ET represents a considerable enhancement of the scope and power of forensic ancestry analysis using MPS, as the chosen name for the VISAGE Enhanced Tool implies.

      Acknowledgments

      The study was supported by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No. 740580 within the framework of the VISible Attributes through GEnomics (VISAGE) Project and Consortium. M.d.l.P. is supported by a post-doctorate grant funded by the Consellería de Cultura, Educación e Ordenación Universitaria e da Consellería de Economía, Emprego e Industria from Xunta de Galicia, Spain (ED481D-2021–008). J.R. is supported by the “Programa de axudas á etapa predoutoral” funded by the Consellería de Cultura, Educación e Ordenación Universitaria e da Consellería de Economía, Emprego e Industria from Xunta de Galicia, Spain (ED481A-2020/039). C.P., A.F.A., A.M.M., M.d.l.P., M.V.L. and the work to compile ancestry informative tri-allelic SNPs and microhaplotypes are supported by MAPA, ‘Multiple Allele Polymorphism Analysis’ (BIO2016–78525-R), a research project funded by the Spanish Research State Agency (AEI) and co-financed with ERDF funds. The population studies by S.O. at University of Santiago de Compostela, were financed by the Fundação de Apoio a Pesquisa do Distrito Federal (FAPDF), Brazil.
      The authors gratefully acknowledge the sharing of genetic cluster analysis information from the 1000 Genomes Phase III SNP data, kindly provided by Adam Auton, Department of Genetics, Albert Einstein College of Medicine, Bronx, NYC, USA. The authors thank Luciana Maia Escher dos Santos and Sabrina Guimarães Paiva for their dedicated work in the collection of samples from rural and urban Brazil used in this study. All STRUCTURE analyses were performed by the FinisTerrae II supercomputer at the Centro de Supercomputación de Galicia, Santiago de Compostela (CESGA), Spain.

      Appendix A

      Centres and investigators of the VISible Attributes through GEnomics (VISAGE) Consortium, Website: http://www.visage-h2020.eu/ (accessed 1st February 2023).
      • Erasmus MC University Medical Center Rotterdam, Rotterdam, the Netherlands: Manfred Kayser, Vivian Kalamara, Arwin Ralf, Athina Vidaki.
      • Jagiellonian University, Krakow, Poland: Wojciech Branicki, Ewelina Pośpiech, Aleksandra Pisarek.
      • Universidade de Santiago de Compostela, Santiago de Compostela, Spain: Ángel Carracedo, Maria Victoria Lareu, Christopher Phillips, Ana Freire-Aradas, Ana Mosquera-Miguel, María de la Puente.
      • Medizinische Universität Innsbruck, Innsbruck, Austria: Walther Parson, Catarina Xavier, Antonia Heidegger, Harald Niederstätter.
      • Universität zu Köln, Cologne, Germany: Michael Nothnagel, Maria-Alexandra Katsara, Tarek Khellaf.
      • King’s College London, London, UK: Barbara Prainsack, Gabrielle Samuel.
      • Klinikum der Universität zu Köln, Cologne, Germany: Peter M. Schneider, Theresa E. Gross, Jan Fleckhaus, Elaine Cheung.
      • Bundeskriminalamt, Wiesbaden, Germany: Ingo Bastisch, Nathalie Schury, Jens Teodoridis, Martina Unterländer.
      • Institut National de Police Scientifique, Lyon, France: François-Xavier Laurent, Caroline Bouakaze, Yann Chantrel, Anna Delest, Clémence Hollard, Ayhan Ulus, Julien Vannier.
      • Netherlands Forensic Institute, The Hague, the Netherlands: Titia Sijen, Kris van der Gaag, Marina Ventayol-Garcia.
      • National Forensic Centre, Swedish Police Authority, Linköping, Sweden: Johannes Hedman, Klara Junker, Maja Sidstedt.
      • Metropolitan Police Service, London, United Kingdom: Shazia Khan, Carole E. Ames, Andrew Revoir.
      • Centralne Laboratorium Kryminalistyczne Policji, Warsaw, Poland: Magdalena Spólnicka, Ewa Kartasinska, Anna Woźniak.

      Appendix B. Supplementary material

      References

        • Phillips C.
        Forensic genetic analysis of bio-geographical ancestry.
        Forensic Sci. Int. Genet. 2015; 18: 49-65
        • Kayser M.
        Forensic DNA Phenotyping: Predicting human appearance from crime scene material for investigative purposes.
        Forensic Sci. Int. Genet. 2015; 18: 33-48
        • Freire-Aradas A.
        • Phillips C.
        • Lareu M.V.
        Forensic individual age estimation with DNA: from initial approaches to methylation tests.
        Forensic Sci. Rev. 2017; 29: 121-144
        • de la Puente M.
        • Ruiz-Ramírez J.
        • Ambroa-Conde A.
        • Xavier C.
        • Pardo-Seco J.
        • Álvarez-Dios J.
        • Freire-Aradas A.
        • Mosquera-Miguel A.
        • Gross T.E.
        • Cheung E.Y.Y.
        • et al.
        Development and evaluation of the ancestry informative marker panel of the VISAGE basic tool.
        Genes. 2021; 12: 1284
        • Xavier C.
        • de la Puente M.
        • Mosquera-Miguel A.
        • Freire-Aradas A.
        • Kalamara V.
        • Vidaki A.
        • Gross T.E.
        • Revoir A.
        • Pośpiech E.
        • Kartasinśka E.
        • et al.
        Development and validation of the VISAGE AmpliSeq basic tool to predict appearance and ancestry from DNA.
        Forensic Sci. Int. Genet. 2020; 48102336
        • Palencia-Madrid L.
        • Xavier C.
        • de la Puente M.
        • Hohoff C.
        • Phillips C.
        • Kayser M.
        • Parson W.
        VISAGE consortium, evaluation of the VISAGE basic tool for appearance and ancestry prediction using PowerSeq chemistry on the MiSeq FGx system.
        Genes. 2020; 11: 708
        • Heidegger A.
        • Xavier C.
        • Niederstätter H.
        • de la Puente M.
        • Pośpiech E.
        • Pisarek A.
        • Kayser M.
        • Branicki W.
        • Parson W.
        VISAGE consortium, development and optimization of the VISAGE basic prototype tool for forensic age estimation.
        Forensic Sci. Int. Genet. 2020; 48102322
        • Woźniak A.
        • Heidegger A.
        • Piniewska-Róg D.
        • Pośpiech E.
        • Xavier C.
        • Pisarek A.
        • Kartasińska E.
        • Boroń M.
        • Freire-Aradas A.
        • Wojtas M.
        • et al.
        Development of the VISAGE enhanced tool and statistical models for epigenetic age estimation in blood, buccal cells and bones.
        Aging. 2021; 13: 6459-6484
        • Pisarek A.
        • Pośpiech E.
        • Heidegger A.
        • Xavier C.
        • Papież A.
        • Piniewska-Róg D.
        • Kalamara V.
        • Potabattula R.
        • Bochenek M.
        • Sikora-Polaczek M.
        • et al.
        Epigenetic age prediction in semen - marker selection and model development.
        Aging. 2021; 13: 19145-19164
        • Heidegger A.
        • Pisarek A.
        • de la Puente M.
        • Niederstätter H.
        • Pośpiech E.
        • Woźniak A.
        • Schury N.
        • Unterländer M.
        • Sidstedt M.
        • Junker K.
        • et al.
        Development and inter-laboratory validation of the VISAGE enhanced tool for age estimation from semen using quantitative DNA methylation analysis.
        Forensic Sci. Int. Genet. 2020; 56102596
        • de la Puente M.
        • Ruiz-Ramírez M.J.
        • Ambroa-Conde A.
        • Xavier C.
        • Amigo J.
        • Casares de Cal M.A.
        • Gómez-Tato A.
        • Carracedo A.
        • Parson W.
        • Phillips C.
        • Lareu M.V.
        Broadening the applicability of a custom multi-platform panel of Microhaplotypes: Bio-geographical ancestry inference and expanded reference data.
        Front. Genet. 2020; 11581041
        • Pereira V.
        • Freire-Aradas A.
        • Ballard D.
        • Børsting C.
        • Diez V.
        • Pruszkowska-Przybylska P.
        • Ribeiro J.
        • Achakzai N.M.
        • Aliferi A.
        • Bulbul O.
        • et al.
        Development and validation of the EUROFORGEN NAME (North African and Middle Eastern) ancestry panel.
        Forensic Sci. Int. Genet. 2019; 42: 260-267
        • Phillips C.
        • Parson W.
        • Lundsberg B.
        • Santos C.
        • Freire-Aradas A.
        • Torres M.
        • Eduardoff M.
        • Børsting C.
        • Johansen P.
        • Fondevila M.
        • et al.
        Building a forensic ancestry panel from the ground up: the EUROFORGEN Global AIM-SNP set.
        Forensic Sci. Int. Genet. 2014; 11: 13-25
        • Galanter J.M.
        • Fernandez-Lopez J.C.
        • Gignoux C.R.
        • Barnholtz-Sloan J.
        • Fernandez-Rozadilla C.
        • Via M.
        • Hidalgo-Miranda A.
        • Contreras A.V.
        • Figueroa L.U.
        • Raska P.
        • et al.
        Development of a panel of genome-wide ancestry informative markers to study admixture throughout the Americas.
        PLoS Genet. 2012; 8e1002554
        • Phillips C.
        • Freire Aradas A.
        • Kriegel A.K.
        • Fondevila M.
        • Bulbul O.
        • Santos C.
        • Serrulla Rech F.
        • Perez Carceles M.D.
        • Carracedo A.
        • Schneider P.M.
        • Lareu M.V.
        Eurasiaplex: a forensic SNP assay for differentiating European and South Asian ancestries.
        Forensic Sci. Int. Genet. 2013; 7: 359-366
        • Santos C.
        • Phillips C.
        • Fondevila M.
        • Daniel R.
        • van Oorschot R.A.H.
        • Burchard E.G.
        • Schanfield M.S.
        • Souto L.J.
        • Uacyisrael J.
        • Via M.
        • et al.
        Pacifiplex: An ancestry-informative SNP panel centred on Australia and the Pacific region.
        Forensic Sci. Int. Genet. 2016; 20: 71-80
        • Carvalho Gontijo C.
        • Porras-Hurtado L.G.
        • Freire-Aradas A.
        • Fondevila M.
        • Santos C.
        • Salas A.
        • Henao J.
        • Isaza C.
        • Beltrán L.
        • Nogueira Silbiger V.
        • et al.
        PIMA: A population informative multiplex for the Americas.
        Forensic Sci. Int. Genet. 2020; 44102200
        • The 1000 Genomes Project Consortium A.
        • Auton L.D.
        • Brooks R.M.
        • Durbin E.P.
        • Garrison H.M.
        • Kang J.O.
        • Korbel J.L.
        • Marchini S.
        • McCarthy G.A.
        • McVean
        • et al.
        A global reference for human genetic variation.
        Nature. 2015; 526: 68-74
        • Amigo J.
        • Phillips C.
        • Lareu M.
        • Carracedo Á.
        The SNPforID browser: an online tool for query and display of frequency data from the SNPforID project.
        Int. J. Leg. Med. 2008; 122: 435-440
        • Bergström A.
        • McCarthy S.A.
        • Hui R.
        • Almarri M.A.
        • Ayub Q.
        • Danecek P.
        • Chen Y.
        • Felkel S.
        • Hallast P.
        • Kamm J.
        • et al.
        Insights into human genetic variation and population history from 929 diverse genomes.
        Science. 2020; 367: 1339-1349
        • Byrska-Bishop M.
        • Evani U.S.
        • Zhao X.
        • Basile A.O.
        • Abel H.J.
        • Regier A.A.
        • Corvelo A.
        • Clarke W.E.
        • Musunuri R.
        • Nagulapalli K.
        • et al.
        High coverage whole-genome-sequencing of the expanded 1000 Genomes Project cohort including 602 trios.
        Cell. 2022; 185 (VCF data available online: https://www.internationalgenome.org/dataportal/data-collection/30x-grch38 and): 3426-3440
        • Almarri M.A.
        • Haber M.
        • Lootah R.A.
        • Hallast P.
        • Al Turki S.
        • Martin H.C.
        • Xue Y.
        • Tyler-Smith C.
        The genomic history of the Middle East.
        Cell. 2021; 184: 4612-4625
        • Phillips C.
        • Amigo J.
        • Tillmar A.O.
        • Peck M.A.
        • de la Puente M.
        • Ruiz-Ramírez J.
        • Bittner F.
        • Idrizbegović Š.
        • Wang Y.
        • Parsons T.J.
        • et al.
        A compilation of tri-allelic SNPs from 1000 Genomes and use of the most polymorphic loci for a large-scale human identification panel.
        Forensic Sci. Int. Genet. 2020; 46102232
        • Ralf A.
        • van Oven M.
        • Montiel González D.
        • de Knijff P.
        • van der Beek K.
        • Wootton S.
        • Lagacé R.
        • Kayser M.
        Forensic Y-SNP analysis beyond SNaPshot: High-resolution Y-chromosomal haplogrouping from low quality and quantity DNA using Ion AmpliSeq and targeted massively parallel sequencing.
        Forensic Sci. Int. Genet. 2019; 41: 93-106
        • Li J.Z.
        • Absher D.M.
        • Tang H.
        • Southwick A.M.
        • Casto A.M.
        • Ramachandran S.
        • Cann H.M.
        • Barsh G.S.
        • Feldman M.
        • Cavalli-Sforza L.L.
        • Myers R.M.
        Worldwide human relationships inferred from genome-wide patterns of variation.
        Science. 2008; 319: 1100-1104
        • Phillips C.
        • Ballard D.
        • Gill P.
        • Court D.S.
        • Carracedo A.
        • Lareu M.V.
        The recombination landscape around forensic STRs: accurate measurement of genetic distances between syntenic STR pairs using HapMap high density SNP data.
        Forensic Sci. Int. Genet. 2012; 6: 345-365
        • Phillips C.
        • McNevin D.
        • Kidd K.K.
        • Lagacé R.
        • Wootton S.
        • de la Puente M.
        • Freire-Aradas A.
        • Mosquera-Miguel A.
        • Eduardoff M.
        • Gross T.E.
        • et al.
        MAPlex-A massively parallel sequencing ancestry analysis multiplex for Asia-Pacific populations.
        Forensic Sci. Int. Genet. 2019; 42: 213-226
        • Cheung E.Y.Y.
        • Phillips C.
        • Eduardoff M.
        • Lareu M.V.
        • McNevin D.
        Performance of ancestry-informative SNP and microhaplotype markers.
        Forensic Sci. Int. Genet. 2019; 43102141
        • Kidd K.K.
        • Speed W.C.
        • Pakstis A.J.
        • Podini D.S.
        • Lagacé R.
        • Chang J.
        • Wootton S.
        • Haigh E.
        • Soundararajan U.
        Evaluating 130 microhaplotypes across a global set of 83 populations.
        Forensic Sci. Int. Genet. 2017; 6: 29-37
        • Mallick S.
        • Li H.
        • Lipson M.
        • Mathieson I.
        • Gymrek M.
        • Racimo F.
        • Zhao M.
        • Chennagiri N.
        • Nordenfelt S.
        • Tandon A.
        • et al.
        The simons genome diversity project: 300 genomes from 142 diverse populations.
        Nature. 2016; 538: 201-206
        • Pagani L.
        • Lawson D.J.
        • Jagoda E.
        • Mörseburg A.
        • Eriksson A.
        • Mitt M.
        • Clemente F.
        • Hudjashov G.
        • DeGiorgio M.
        • Saag L.
        • et al.
        Genomic analyses inform on migration events during the peopling of Eurasia.
        Nature. 2016; 538: 238-242
        • Phillips C.
        • Amigo J.
        • McNevin D.
        • de la Puente M.
        • Cheung E.Y.Y.
        • Lareu M.V.
        Online population data resources for forensic SNP analysis with Massively Parallel Sequencing: An overview of online population data for forensic purposes.
        in: Pilli E. Berti A. In Forensic DNA Analysis: Technological Development and Innovative Applications. CRC Press, Boca Raton, FL, USA2021
      1. Available online: http://mathgene.usc.es/Snipper/ Multiple profiles classifier at: 〈http://mathgene.usc.es/snipper/analysismultipleprofiles.html〉 (both accessed 1st February 2023).

        • Pritchard J.K.
        • Stephens M.
        • Donnelly P.
        Inference of population structure using multilocus genotype data.
        Genetics. 2000; 155: 945-959
        • Kopelman N.M.
        • Mayzel J.
        • Jakobsson M.
        • Rosenberg N.A.
        • Mayrose I.
        Clumpak: a program for identifying clustering modes and packaging population structure inferences across K.
        Mol. Ecol. Resour. 2015; 15: 1179-1191
        • Evanno G.
        • Regnaut S.
        • Goudet J.
        Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.
        Mol. Ecol. 2005; 14: 2611-2620
        • Santos C.
        • Phillips C.
        • Gomez-Tato A.
        • Alvarez-Dios J.
        • Carracedo A.
        • Lareu M.V.
        Inference of ancestry in forensic analysis II: analysis of genetic data.
        Methods Mol. Biol. 2016; 1420: 255-285
        • de la Puente M.
        • Phillips C.
        • Xavier C.
        • Amigo J.
        • Carracedo A.
        • Parson W.
        • Lareu M.V.
        Building a custom large-scale panel of novel microhaplotypes for forensic identification using MiSeq and Ion S5 massively parallel sequencing systems.
        Forensic Sci. Int. Genet. 2020; 48102213
        • Li H.
        • Durbin R.
        Fast and accurate short read alignment with Burrows-Wheeler transform.
        Bioinformatics. 2009; 25: 1754-1760
        • Li H.
        • Handsaker B.
        • Wysoker A.
        • Fennell T.
        • Ruan J.
        • Homer N.
        • Marth G.
        • Abecasis G.
        • Durbin R.
        The sequence Alignment/Map format and SAMtools.
        Bioinformatics. 2009; 25: 2078-2079
      2. N. Thomas, R Package - Microhaplot, (2019) 〈https://github.com/ngthomas/microhaplot〉. (Accessed 1st February 2023).

        • Phillips C.
        • Amigo J.
        • Carracedo A.
        • Lareu M.V.
        Tetra-allelic SNPs: Informative forensic markers compiled from public whole-genome sequence data.
        Forensic Sci. Int. Genet. 2015; 19: 100-106
        • Lek M.
        • Karczewski K.J.
        • Minikel E.V.
        • Samocha K.E.
        • Banks E.
        • Fennell T.
        • O’Donnell-Luria A.H.
        • Ware J.S.
        • Hill J.A.J.
        • Cummings B.B.
        • et al.
        Analysis of protein-coding genetic variation in 60,706 humans.
        Nature. 2016; 536: 285-291
      3. 〈http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=6:60527829–60528829;v=rs3857620;vdb=variation;vf=169483878〉, (Accessed 1st February 2023).

        • Bleka Ø.
        • Eduardoff M.
        • Santos C.
        • Phillips C.
        • Parson W.
        • Gill P.
        Open source software EuroForMix can be used to analyse complex SNP mixtures.
        Forensic Sci. Int. Genet. 2017; 31: 105-110
        • Xavier C.
        • de la Puente M.
        • Mosquera-Miguel M.
        • Freire-Aradas A.
        • Kalamara V.
        • Revoir A.
        • Gross T.E.
        • Schneider P.M.
        • Ames C.
        • Hohoff C.
        • et al.
        Development and inter-laboratory evaluation of the VISAGE Enhanced Tool for appearance and ancestry inference from DNA.
        Forensic Sci. Int. Genet. 2022; 61102779
        • Carvalho Gontijo C.
        • Macêdo Mendes F.
        • Santos C.A.
        • de M.
        • Klautau-Guimarães N.
        • Lareu M.V.
        • Carracedo A.
        • Phillips C.
        • Oliveira S.F.
        Ancestry analysis in rural Brazilian populations of African descent.
        Forensic Sci. Int. Genet. 2018; 36: 160-166
        • Rosenberg N.A.
        • Pritchard J.K.
        • Weber J.L.
        • Cann H.M.
        • Kidd K.K.
        • Zhivotovsky L.A.
        • Feldman M.W.
        Genetic structure of human populations.
        Science. 2002; 298: 2381-2385
        • Bennett L.
        • Oldoni F.
        • Long K.
        • Cisana S.
        • Madella K.
        • Wootton S.
        • Chang J.
        • Hasegawa R.
        • Lagacé R.
        • Kidd K.K.
        • Podini D.
        Mixture deconvolution by massively parallel sequencing of microhaplotypes.
        Int. J. Leg. Med. 2019; 133: 719-729