If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Sequencing of autosomal, mitochondrial and Y-chromosomal forensic markers in the People of the British Isles cohort detects population structure dominated by patrilineages
Well-characterised population samples from People of the British Isles cohort.
•
Simultaneous sequence analysis of Y, autosomal and mitochondrial forensic markers.
•
Increased diversity of STRs via sequencing.
•
Patrilineal population structure suggests male-biased Anglo-Saxon contribution.
•
Some autosomal structure at sequence level only.
Abstract
Short tandem repeat (STR) polymorphisms are traditionally assessed by measuring allele lengths via capillary electrophoresis (CE). Massively parallel sequencing (MPS) reveals differences among alleles of the same length, thus improving discrimination, but also identifying groups of alleles likely related by descent. These may have relatively restricted geographical distributions and thus MPS could detect population structure more effectively than CE-based analysis. We addressed this question by applying an MPS multiplex, the Promega PowerSeq™ Auto/Mito/Y System prototype, to 362 individuals chosen to represent a wide geographical spread from the People of the British Isles (PoBI) cohort, which represents at least three generations of local rural ancestry. As well as 22 autosomal STRs (aSTRs; equivalent to PowerPlex Fusion loci) the system sequences 23 Y-STRs (the PowerPlexY 23 loci) and the control region (CR) of mitochondrial DNA (mtDNA), allowing population structure to be compared across biparentally and uniparentally inherited segments of the genome. For all loci, FST-based tests of population structure were done based on historical, linguistic, and geographical partitions, and for aSTRs the clustering algorithm STRUCTURE was also applied. STRs were considered using both length and sequence. Sequencing increased aSTR allele diversity by 87.5% compared to CE-based designations, reducing random match probability to 1.25E-30, compared to a CE-based 6.72E-27. Significant population structure was detectable in just one pairwise comparison (Central/South East England compared to the rest), and for sequence-based alleles only. The 362 samples carried 308 distinct mtDNA CR haplotypes corresponding to 13 broad haplogroups, representing a haplotype diversity of 0.9985 ( ± 0.0005), and a haplotype match probability of 0.0043. No significant population structure was observed. Y-STR haplotypes belonged to ten broad predicted Y-haplogroups. Allele diversity increased by 33% when considered at the sequence rather than length level, although haplotype diversity was unchanged at 0.999969 ( ± 0.000001); haplotype match probability was 2.79E-03. In contrast to the biparentally and maternally inherited loci, Y-STR haplotypes showed significant population structure at several levels, but most markedly in a comparison of regions subject to Anglo-Saxon influence in the east with the rest of the sample. This was evident for both length- and sequence-based allele designations, with no systematic difference between the two. We conclude that MPS analysis of aSTRs or Y-STRs does not generally reveal stronger population structure than length-based analysis, that UK maternal lineages are not significantly structured, and that Y-STR haplotypes reveal significant population structure that may reflect the Anglo-Saxon migrations to Britain in the 6th century.
The traditional approach to analysing short tandem repeats (STRs) uses capillary electrophoresis (CE) to distinguish alleles by fragment length. Alleles reach these lengths via a multitude of mutational pathways, so in effect each length-based category represents a set of alleles with diverse evolutionary and mutational histories. This fact is becoming increasingly clear with the application of massively parallel sequencing (MPS), which resolves STR alleles of the same lengths (isometric alleles) via detection of repeat pattern variation (RPV) within arrays, and single nucleotide polymorphisms (SNPs) and indels both in flanking and repeat sequences [
], but in doing so it identifies groups of alleles that are more likely to be related by descent since they share features, such as RPV and SNPs, that have relatively low mutation rates. Such allele groups may have relatively restricted geographical distributions, and this then raises the question of whether sequence-based analysis will reveal a higher degree of population structure than analysis by length alone. For STRs on the male-specific region of the Y chromosome (MSY; Y-STRs), sequence-based features show strong associations with SNP-defined haplogroups [
]. For autosomal STRs (aSTRs), sequence analysis in five continental groups with the ForenSeq™ DNA Signature Prep Kit showed no difference in FST compared to CE-based measures [
]. However, little work has so far been done to assess detailed population structure at a finer geographical scale using sequence-based analysis and to compare this to traditional length-based analysis.
In this study we address this question by applying an MPS multiplex, the Promega PowerSeq™ Auto/Mito/Y System prototype, which analyses 22 aSTRs and 23 Y-STRs, as well as the control region of mitochondrial DNA (mtDNA), to a carefully recruited and well-characterised population set. The inclusion of mtDNA allows a comparison of structuring across biparentally inherited and the two uniparentally inherited segments of the genome, and the potential detection of sex-biased effects at the population level.
The population samples analysed here were collected as part of the People of the British Isles (PoBI) project [
]. DNA donors were recruited to represent local ancestry, relatively unaffected by recent migration, by stipulating that all four grandparents’ birthplaces should fall within an 80-km radius in a rural area. Given this recruitment strategy and the fact that most DNA donors were over 60 years of age [
], the PoBI study effectively sampled the genomes of ancestors who were born in the late nineteenth century in rural locations: this should maximise the opportunity to observe any population structure in forensic markers.
Members of the PoBI cohort were previously typed using genome-wide SNP chips [
], a haplotype-based clustering algorithm that captures subtle levels of genetic differentiation by modelling linkage disequilibrium (LD) between SNPs. Individuals were assigned to one of 17 clusters, and the geographical distribution of cluster membership was assessed post facto by placing each individual on a map at the centroid of grandparental birthplaces. This revealed a high degree of population structure (Fig. 1a), with clusters on the map mostly non-overlapping, and striking examples of local differentiation. Westerly regions sometimes grouped together as ‘Celtic’ because of a history of speaking a Celtic language showed considerable heterogeneity, while 51% of the individuals in the study formed a large eastern cluster (red ellipse in Fig. 1a) including central and southern England and extending up the east coast to north Yorkshire. Analysis of this major cluster in terms of contributions from European populations suggested ~35% ancestry from a north-west German group, interpreted as an influence of Anglo-Saxon mass migration. Absence of any cluster corresponding to the Danelaw, the region under Scandinavian administrative control from the 9th – 11th centuries (Fig. 1b), was taken as evidence of a relatively minor contribution from Scandinavian Viking migrants [
]. Discovery of this structure relied upon the analysis of a very large number of autosome-wide markers, and consideration of LD between them. With the small number of forensic markers analysed here we do not expect to observe similar signals, but it is of interest to ask if any significant structure emerges, and furthermore to compare different components of the genome in this respect.
Fig. 1FineSTRUCTURE clusters in PoBI autosomal SNP data, cultural subdivisions, and geographical origins of samples used in this study. a) Each of the 17 fineSTRUCTURE clusters found in PoBI autosomal SNP data on 2039 individuals is shown as an ellipse representing the 90% probability region of a fitted t-distribution, with the cluster’s symbol in the centre. The tree (bottom left) shows the hierarchical order of cluster merging, and to the right of the tree the five divisions of clusters tested in this study are indicated, corresponding to Fig. S1d. b) The extent of regions of Anglo-Saxon influence (based on
; in blue) the Danelaw (in yellow), and membership of both categories (in green). Map images in parts a and b modified from an original work by Wikishire, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=36830415. c) Geographical origins of samples of PoBI males used in this study. Circles indicate the 37 mostly county-based sample locations, with areas proportional to sample size. Colours indicate assignment to the ten regional groupings as indicated in the key. SCO: Scotland; IRE: Ireland plus the Isle of Man; NW: North West; NE: North East; WAL: Wales, WM: West Midlands; EM: East Midlands; EA: East; SW: South West; SE: South East. Map generated using Microreact
], chosen to represent the geographical spread of the whole cohort. We sequence 22 aSTRs, the mtDNA control region, and 23 Y-STRs, describe the observed diversity, compare length- and sequence-based data on STRs, and ask if significant population structure can be observed, based on a range of historical, linguistic, and geographical divisions of the sample set.
2. Materials and methods
2.1 DNA samples and classification of individuals
The 362 individuals analysed in this study were selected from the males of the People of the British Isles (PoBI) sample set [
]. Ethical review was undertaken by the Leeds (West) Research Ethics Committee (Reference 05/Q1205/35). In selecting samples for this study, we aimed to cover a wide geographical range evenly.
In the complete PoBI dataset each individual was assigned to one of 47 mostly county-based geographical groups [
]. In the smaller subset chosen here some neighbouring groups were combined to increase and balance population size, leading to 37 geographically based populations (Fig. 1c; Table S1). Five larger-scale divisions of the sample set were also considered for analysis, to allow tests of our data against patterns of geography, historical regions, or previously described regions of genetic differentiation (Table S1; Fig. S1): (i) Ten large geographic administrative regions: SCO: Scotland; IRE: Ireland plus the Isle of Man; NW: North West; NE: North East; WAL: Wales, WM: West Midlands; EM: East Midlands; EA: East; SW: South West; SE: South East (Fig. 1c). (ii) A two-way split following regions of Anglo-Saxon control around 600 CE (as defined in [
]) compared to the rest of the British Isles (Fig. 1b; Fig S1a). (iii) A two-way split reflecting the historically distinct Danelaw region compared to the remaining ‘non-Danelaw’ regions (Fig. 1b; Fig. S1b). (iv) Divisions reflecting five major PoBI genetic clusters identified from fineSTRUCTURE analysis of autosomal SNP array data [
]: Orkney; Wales; Scotland, Ireland, Isle of Man, Cumbria and the North East; Cornwall; and the remaining areas of Central and South East England (Fig. 1; Fig. S1c). These are among the deepest splits in the fineSTRUCTURE cladogram ([
] excluding the north vs south Wales split; Fig. 1a); (v) A two-way division reflecting the PoBI fineSTRUCTURE-based Central/South England cluster compared to the remainder (represented by the red square symbol in Fig. 1a; Fig. S1d).
2.2 DNA quantitation, PCR amplification, and sequencing
DNA samples were quantified using the Qubit® dsDNA BR assay kit (Thermo Fisher Scientific) following the manufacturer’s recommended protocol.
The Promega PowerSeq™ Auto/Mito/Y System prototype was applied following the manufacturer’s recommended protocol, with 0.5 ng genomic DNA as template, to amplify the following loci in a single multiplex: 22 autosomal STRs (D1S1656, D2S1338, D2S441, D3S1358, D5S818, D7S820, D8S1179, D10S1248, D12S391, D13S317, D16S539, D18S51, D19S433, D21S11, D22S1045, CSF1P0, FGA, PentaD, PentaE, TH01, TPOX, vWA - equivalent to the PowerPlex Fusion loci), the Amelogenin sex-test marker, 23 Y-STRs (DYS19, DYS385a,b, DYS389I/II, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS481, DYS533, DYS549, DYS570, DYS576, DYS635, DYS643, Y-GATA-H4 - equivalent to the PowerPlexY 23 loci) [
The generated amplicons were purified and quantified, then ~500 ng each were taken through TruSeq™ (Illumina) DNA PCR-free HT library preparation. Sequencing of the prepared libraries followed previously described procedures [
The generated single-end reads were exported in FASTQ format to be analysed by other software, including quality control. Analysis of autosomal and Y-STR sequences using FDSTools v1.1.1 [
FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.
] provided allele length and sequence variant information for Amelogenin, 23 Y-STRs and 22 autosomal STRs using the default settings for the inbuilt PowerSeq anchor sets. Sequences of these loci were also analysed using STRait Razor 3.0 [
], and no discrepancies were found between the two tools. Mitochondrial control region reads were analysed using a standard variant calling pipeline and applying the Overarching Read Enrichment Option (OREO) corrections for reference sequence bias [
], which had been typed with the Affymetrix SNP6.0 and Illumina 1.2 M Duo SNP microarrays. These chips include respectively 411 and 223 SNPs for mtDNA, and 901 and 1850 for MSY: to ensure robust MSY haplogroup definition we focused on SNPs included in the current International Society for Genetic Genealogy (ISOGG) SNP list (isogg.org/tree/ISOGG_YDNA_SNP_Index.html) and excluded SNPs that were not phylogenetically concordant with other more derived SNPs, or between microarrays. Our approach allowed definition of 80 different mtDNA haplogroups, and 31 different MSY haplogroups in this dataset (Table S2). Sequence-based Y-STR and autosomal STR data were converted into length-based CE-equivalent data for some analyses. Previous direct comparison of CE (PowerPlex Y23; Promega) and MPS (Promega PowerSeq™ Auto/Mito/Y System prototype) for Y-STR data in a set of 100 diverse global samples [
] demonstrated 99.83% concordance among 2311 alleles called. In addition, a subset of 74 of the samples analysed here (20%) have also been typed by CE using the PowerPlex Y23 Y-STR multiplex and submitted to the Y Haplotype Reference Database [
] as part of submission YA004317; all these haplotypes were completely concordant between CE and MPS. Other independent studies support concordance with CE for aSTR data generated using the PowerSeq system [
], including molecular diversity indices, and F-statistics for population comparisons and differentiation tests. All tests were Bonferroni corrected (see Supplementary Tables). Traditionally, analysis of length-based STR data has been done using the FST analog RST, which incorporates the single-step mutation model [
]. Since our analysis identifies sequence variants that conform better to an infinite alleles model, we use only FST throughout here, even for length-based comparisons. To visualise genetic distances among the samples, pairwise FST values were used to construct multidimensional scaling (MDS) plots using the R software ‘MASS’ package and principal component analysis (PCA) plots using the R ‘stats’ package and ‘factoextra’ package to visualise the outputs. STRAF 1.0.5 [
], using both length- and sequence-based allele definitions, and settings of 100,000 burn-in runs, 100,000 MCMC repetitions, each in 20 iterations from k = 1–10. The results were analysed in Structure Harvester Web v0.6.94 [
]. The desktop version of the Y-STR-based NevGen Y-DNA Haplogroup Predictor software (Gentula and Nevski, Serbian DNA Project, 2015, available at: https://www.nevgen.org/ - with NevGenGeneralPredictor_23_STR.dat file dated 15–10–19) was used to predict MSY haplogroups.
] were used to visualise phylogenetic relationships of mtDNA and Y-STR haplotypes using Network 5.0.1.1 and Network Publisher 2.1.2.5 (both from Fluxus Technology Ltd.; www.fluxus-engineering.com/). For the mtDNA CR, variable positions were coded into a ‘DNA nucleotide data’ input file (with weighting of 10 for substitutions, 20 for insertions/deletions and 1 for the poly-C stretch and other ambiguous regions); for Y-STR data, length-based alleles were coded into an ‘STR data’ input file (weighting of 5 for rapidly mutating [RM] Y-STRs and 10 for the remainder). Sequence-based alleles were coded into a ‘combined data’ input file of STR and binary data (weighting as before for Y-STR repeat components, plus weighting of 40 for flanking SNPs and indels). A network addressing only the sequence variation of these alleles regardless of allele length was constructed by coding into ‘binary data’ input files (weighting for events within RM-Y-STR repeat components of 5, other Y-STR repeat components 10, and any SNPs and indels 99). In all Y-STR-based networks the bilocal DYS385a,b was omitted, and for other occasional duplicated alleles the allele showing the higher depth of coverage was used in network construction; no null alleles were observed in the dataset.
Sample sets and their distinct features were encoded in Microreact [
Using the Promega PowerSeq™ Auto/Mito/Y System prototype, the sequences of 22 autosomal STRs (aSTRs), the control region of mtDNA, and 23 Y-STRs were determined in a single MPS multiplex in each of 362 males from the People of the British Isles (PoBI) sample set [
], which had previously been analysed using genome-wide SNP-chips. One individual was excluded from autosomal STR analysis due to low read counts affecting accurate genotyping of heterozygous alleles, leaving data on 361 samples for aSTRs and 362 for the uniparentally inherited markers.
Details of run statistics, comparative run efficiencies and coverage values for each marker type can be found in [
]. Fig. S2 shows mean coverage per marker for all individuals analysed.
3.1 Variation in aSTR sequences and population structure analysis
In total, 14,571 aSTR calls were generated in the set of 361 samples, representing 435 distinct sequence-based alleles across 22 STR loci (Table S3). No duplications or null alleles were observed. Of the 435 alleles, 31 have not yet been catalogued in the STRSeq database [
]; 14 include flanking variants, four carry indels, and 13 are combinations of repeat blocks not yet recorded. Within this PoBI sample subset, eight individuals also form part of the GBR (British in England and Scotland) population of the 1000 Genomes Project [
]: among the sequenced alleles in these individuals, 40 calls of 11 different aSTR flanking-region SNPs were identified, and all were found to be concordant with 1000 Genomes Project sequence data (Table S4). Amelogenin amplicons were also sequenced: a previously described SNP (rs188865418) was observed in AmelX in one individual, but based on sequence read depths there were no discordances with expected 46,XY sex chromosome constitutions.
Observed overall allele frequencies by length are given for the 22 aSTR loci in Table S5, and frequencies of the sequenced alleles in Table S3. Sequencing increases allele diversity by 87.5% overall compared to length-based designations, with only two loci that did not improve in resolution (D22S1045 and PentaE). The sample set contained 1592 pairs of alleles that were homozygous by length: 277 of these (17.4%) were distinguishable as isometric heterozygotes by sequencing. This discrimination increased the overall observed proportion of heterozygous alleles from a length-based 80.84% to a sequence-based 84.18%. Forensic statistics were calculated based on both aSTR lengths and sequences using STRAF 1.0.5 [
], and are given in Table S6. Random match probability is 1.25E-30, compared to a value of 6.72E-27 when length only (CE equivalent) is considered.
Allele frequency tables for each of the ten geographical regions are given in Table S7. The observed aSTR length and sequence variants were used to test FST-based population differentiation in the 361 samples at the level of 37 individual populations, ten regions (Fig. 1), and the various other regional groupings reflecting geography and history (Fig. S1). Only one of these comparisons showed any significant differentiation (Table S8): the PoBI-defined Central/South East England cluster was significantly different from the remainder of the dataset when aSTR sequences were considered, but not for aSTR lengths. As well as this FST-based assessment, the data (both length- and sequence-based) were analysed using STRUCTURE [
]. For both data types, the most likely value of k (cluster number) = 1, indicating a lack of population structure, and visual inspection of cluster plots at higher values of k confirms this (data not shown).
3.2 Variation in the mtDNA control region and population structure analysis
Among the 362 samples analysed for mtDNA, 308 distinct CR haplotypes were observed containing a total of 212 polymorphic sites; this represents a haplotype diversity of 0.9985 ( ± 0.0005), and a haplotype match probability of 0.0043.
Haplogroups were predicted from CR sequence data using both EMPOP [
]. The two approaches were generally concordant: 86% matched exactly (both broad and high-definition haplogroup designations were identical), and the remaining 14% matched at the broader haplogroup level, with one predictor being slightly more specific than the other.
In the 362 sequenced samples, CR sequences in 268 could be compared to available SNP chip data in the same individuals from PoBI, which overlapped with a total of 48 targeted mtDNA variants and detected 35 variable CR positions in this subset of samples [
]. Comparing MPS-derived variants in the control region itself with PoBI data indicated that SNP-chip calls here were unreliable (data not shown), likely reflecting the high degree of CR sequence variation and the consequent difficulty of reliable probe design for SNP typing. This does not, therefore, provide a useful direct validation for the data obtained here. As an indirect comparison, predicted mtDNA haplogroups based on CR sequences were compared with SNP chip-based haplogroup designations from whole mtDNA, which is based on more robustly called coding region variants (Table S3). Overall, 96% of samples were either concordant or broadly correctly predicted based on phylogenetic considerations: the 4% discordance is likely due to the limited predictability of some haplogroups from CR variants in our sequenced set.
The 308 distinct mtDNA CR haplotypes correspond to 13 broad haplogroups (Fig. 2). A summary of haplogroup frequencies for each of the ten large geographical regions is shown in Table 1, and higher resolution haplogroup predictions for each sample are listed in Table S1. The range of haplogroups observed in the sample is typical of that observed in previous surveys of western Europe [
], and visual inspection of haplogroup distributions shows no clear geographical pattern (Fig. 2a). The observed CR variants were used to build a MJ phylogenetic network; as with Fig. 2a, this shows no obvious structuring relating to geographic origin (Fig. 2b), but haplotype clusters correlate well with the predicted broad haplogroups (Fig. 2c) and also with the more highly resolved haplogroup designations (Fig. S3).
Fig. 2Geographical distributions of predicted mtDNA haplogroups, and networks of control region (CR) haplotypes. a) Map showing distribution of predicted haplogroups. Pie charts and their sectors are proportional to sample size, and coloured by broad haplogroup as shown in the key. b) Median-joining (MJ) haplotype network based on mtDNA CR variants and coloured by assignment to the ten regional groupings. Abbreviations as in Fig. 1. c) MJ haplotype network based on mtDNA CR variants and coloured by predicted broad haplogroup. For correspondence with higher resolution haplogroups see Fig. S3. In both networks, circles represent haplotypes with areas proportional to sample size. Lines between circles represent mutational steps, the shortest line representing a single-nucleotide change.
Table 1Frequencies of predicted mtDNA haplogroups in ten geographical regions.
Geographical region
Hg
SCO
IRE
NW
NE
WAL
WM
EM
EA
SW
SE
All
H(xH1,H2)
0.105
0.091
0.233
0.167
0.152
0.138
0.250
0.207
0.207
0.125
0.171
H1
0.289
0.182
0.200
0.167
0.152
0.207
0.205
0.138
0.190
0.208
0.196
H2a
0.132
0.182
0.067
0.056
0.087
0.103
0.045
0.172
0.103
0.188
0.113
HV
0.026
0.000
0.000
0.000
0.022
0.000
0.045
0.000
0.052
0.000
0.019
I
0.026
0.000
0.033
0.000
0.000
0.034
0.000
0.069
0.017
0.042
0.022
I1a1
0.000
0.000
0.000
0.000
0.065
0.034
0.000
0.000
0.000
0.021
0.014
J1
0.026
0.045
0.000
0.000
0.000
0.034
0.000
0.000
0.000
0.000
0.008
J1b
0.026
0.000
0.000
0.000
0.022
0.000
0.000
0.000
0.000
0.000
0.006
J1c
0.132
0.091
0.033
0.056
0.109
0.034
0.068
0.034
0.086
0.083
0.077
J2
0.000
0.000
0.000
0.000
0.000
0.034
0.023
0.034
0.017
0.042
0.017
K
0.026
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.003
K1
0.026
0.136
0.033
0.000
0.043
0.034
0.091
0.034
0.052
0.000
0.044
K2
0.000
0.045
0.033
0.056
0.000
0.000
0.023
0.034
0.034
0.042
0.025
N1a1
0.000
0.000
0.000
0.000
0.000
0.000
0.023
0.000
0.000
0.021
0.006
P2
0.000
0.000
0.033
0.000
0.000
0.034
0.000
0.000
0.000
0.000
0.006
R9
0.000
0.000
0.000
0.000
0.000
0.034
0.000
0.000
0.000
0.000
0.003
T
0.000
0.000
0.067
0.056
0.087
0.000
0.000
0.034
0.000
0.000
0.022
T1
0.000
0.000
0.033
0.167
0.022
0.034
0.000
0.069
0.000
0.021
0.025
T2
0.000
0.000
0.033
0.111
0.065
0.034
0.045
0.069
0.086
0.021
0.047
U2,U3,U4, U6,U7
0.053
0.045
0.033
0.000
0.022
0.034
0.023
0.034
0.052
0.000
0.030
U5a
0.053
0.091
0.067
0.167
0.043
0.034
0.068
0.034
0.052
0.104
0.066
U5b
0.053
0.000
0.000
0.000
0.022
0.138
0.091
0.000
0.017
0.042
0.039
V
0.026
0.045
0.000
0.000
0.043
0.000
0.000
0.000
0.000
0.021
0.014
W
0.000
0.000
0.067
0.000
0.022
0.000
0.000
0.034
0.000
0.021
0.014
X2
0.000
0.045
0.033
0.000
0.022
0.000
0.000
0.000
0.034
0.000
0.014
n
38
22
30
18
46
29
44
29
58
48
362
Hg: haplogroup; SCO: Scotland; IRE: Ireland plus the Isle of Man; NW: North West; NE: North East; WAL: Wales, WM: West Midlands; EM: East Midlands; EA: East; SW: South West; SE: South East.
The observed CR variants were used in FST-based tests of population differentiation at the level of 37 individual populations, ten regions (Fig. 1) and the various other regional groupings (Figure S1). None of these comparisons showed any significant differentiation (Table S9). The mtDNA CR data obtained here therefore suggest a high degree of homogeneity, and no evidence of stratification of maternal lineages in the British Isles.
3.3 Variation in Y-STR sequences and population structure analysis
A total of 8334 Y-STR alleles were sequenced in 362 samples (assuming two alleles for each homoallelic combination of DYS385a and b). Allele structures and sequence strings are given in Table S10, and length-based (CE equivalent) and sequence-based haplotypes are listed in Table S11. The 207 distinguishable alleles were compared with a set of 2311 (267 distinct structures) sequenced using the same technology in a diverse global sample of 100 males [
], revealing 49 previously unobserved alleles. Sixteen of these are singletons resulting from SNPs in flanking regions or within repeat arrays, while nine displayed lengths at the peripheries of the previously observed allele size ranges. The called variants included eight supernumerary alleles at six loci (descriptions and interpretations are given in Table S12).
As has been observed previously in sequence-based analysis of Y-STR alleles (e.g. [
]), MPS provides an increase in diversity: in this sample the 207 distinguishable alleles represent an increase of 33% compared to the 139 that would be detectable by CE (Figure S4). A total of 360 distinct sequence-level Y-STR haplotypes were found in the 362 analysed samples, two sample pairs (Table S11) sharing variants of both length and sequence; in both cases these pairs are in the same sub-population (one pair in Banff, and one pair in Forest of Dean). This represents a haplotype diversity of 0.999969 ( ± 0.000001), and a haplotype match probability of 2.79E-03. Forensic statistics were calculated based on Y-STR lengths and sequences using STRAF 1.0.5 [
CE-equivalent Y-STR haplotypes were used to predict MSY haplogroups via the NevGen tool. Of the 362 samples, predictions in 267 (74%) could be compared to directly determined haplogroups based on available Y-chromosomal SNP chip data from the same individuals. These comparisons were completely concordant (Table S1), giving us confidence that predictions are accurate for this population.
The analysed samples belong to ten broad predicted haplogroups (Fig. 3), typical of those found in previous studies of western Europe [
]; frequencies are represented in Fig. 3a and listed in Table 2. Visual inspection of the geographical distribution suggests some degree of structuring - for example, the generally high-frequency haplogroup R1b is most prevalent in the west, with some sub-populations in Cornwall and Wales being fixed for this lineage.
Fig. 3Geographical distributions of predicted MSY haplogroups, and networks of Y-STR haplotypes. a) Map showing distribution of predicted haplogroups. Pie charts and their sectors are proportional to sample size, and coloured by broad haplogroup as shown in the key. b) Median-joining (MJ) haplotype network based on Y-STR length variants and coloured by assignment to the ten regional groupings. Abbreviations as in Fig. 1. c) MJ haplotype network based on Y-STR length variants and coloured by predicted broad haplogroup. In both networks, circles represent haplotypes with areas proportional to sample size. Lines between circles represent mutational steps, the shortest line representing a single repeat unit change. For networks including Y-STR information from sequencing, see Figure S5.
Table 2Frequencies of predicted MSY haplogroups in ten geographical regions.
Geographical region
Hg
SCO
IRE
NW
NE
WAL
WM
EM
EA
SW
SE
All
E1b
0.000
0.000
0.000
0.056
0.022
0.069
0.023
0.000
0.000
0.021
0.017
F
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.021
0.003
G2a
0.000
0.091
0.067
0.056
0.022
0.034
0.023
0.034
0.000
0.021
0.028
I1
0.053
0.045
0.133
0.222
0.043
0.138
0.136
0.103
0.086
0.208
0.113
I2
0.000
0.000
0.033
0.111
0.000
0.000
0.000
0.034
0.000
0.000
0.011
I2a
0.026
0.136
0.067
0.111
0.043
0.000
0.023
0.138
0.086
0.104
0.069
J2a
0.000
0.000
0.033
0.056
0.022
0.000
0.023
0.034
0.017
0.000
0.017
J2b
0.026
0.000
0.000
0.000
0.000
0.000
0.000
0.034
0.000
0.042
0.011
N1c
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.017
0.000
0.003
R1a
0.158
0.045
0.033
0.000
0.022
0.000
0.045
0.034
0.052
0.021
0.044
R1b
0.737
0.682
0.633
0.389
0.826
0.759
0.705
0.586
0.741
0.563
0.682
T
0.000
0.000
0.000
0.000
0.000
0.000
0.023
0.000
0.000
0.000
0.003
n
38
22
30
18
46
29
44
29
58
48
362
Hg: haplogroup; SCO: Scotland; IRE: Ireland plus the Isle of Man; NW: North West; NE: North East; WAL: Wales, WM: West Midlands; EM: East Midlands; EA: East; SW: South West; SE: South East.
Haplotypes based on Y-STR allele lengths were used to build a MJ network. No obvious geographical structure emerged from this network (Fig. 3b), but haplotype clusters correlate well with the predicted broad haplogroups (Fig. 3c). Sequencing Y-STR alleles can reveal an additional layer of variation within arrays, repeat pattern variation (RPV), which involves changes in the number of repeat units in the different components of a compound array. Such variation might reveal clearer population structuring and haplogroup association. The effect of including these RPVs within a network is shown in Figure S5a-c. Again, there is no obvious stratification relating to the ten large geographic regions, but the clusters correlate well with the predicted major haplogroups and even better with the more highly resolved haplogroup designations. Alleles can be further resolved by taking SNPs and indels into consideration; many of these occur in the flanking region outside the array itself. In networks based on complete allele sequences (Figure S5d-f) clusters again associate well with the predicted main haplogroups, and even more strongly with the more resolved haplogroup designations.
To better understand the effect of sequence variants, a network using only Y-STR sequence features, flanking SNPs and indels was built and annotated to show the characteristic variants that are associated with particular haplogroups (Fig. 4). Superhaplogroup P samples carry an insertion of two times two extra repeat units ([TAGA]2[TACA]2) in their DYS635 alleles; within that phylogenetic group, hg R1a samples have a characteristic DYS393 structure in which the first nucleotide of the first repeat is changed from A to C; furthermore, a sublineage of R1a (R1a-GML2 or ~R1a1a1a [
]) has a DYS390 structure including a last repeat change from TAGA to GAGA. In another example, a SNP within the last repeat of DYS458 is characteristic of hg J2a Y-chromosomes. However, there are also sequence features that can arise multiple times; for example, the initial CTG-repeat of DYS481 can expand to (CTG[2]), which is characteristic of all hg G2a lineages tested here, but also occurred in hg I2. Such Y-haplogroup associated sequence variants have also been observed in other datasets [
Fig. 4Network representing array sequence features, SNPs and indels of the sequenced Y-STRs. Length information is omitted to show only structural features of arrays, and SNPs and indels within arrays or within flanking sequences. Circles represent haplotypes with areas proportional to sample size, and coloured by haplogroup as shown in the key to the left. Lines between circles represent mutational steps, the shortest line representing a single mutational event.
As noted above, inspection of the geographical distribution of haplogroups suggests some structure in MSY lineages (Fig. 3a). Both observed Y-STR length and sequence variation were used in FST-based tests (Table S14) of population differentiation at the level of individual populations, regions (Fig. 1) and the various other regional groupings (Figure S1). This revealed a contrast to the lack of population structure observed for aSTRs and mtDNA: Table 3 brings together comparable results for the three different genome components. Based on both allele lengths and sequences, AMOVA analysis shows no significant overall differentiation at the level of 37 populations, ten regions, and five PoBI cluster-defined regions; FST values are similar between the length- and sequence-based comparisons in each case. In pairwise comparisons for the other geographical divisions, the region of Anglo-Saxon influence and the Central/South East England cluster of the PoBI study are significantly differentiated from the rest both at the level of Y-STR sequences and lengths. The highest and most significant FST values are observed for the Central/South East England cluster comparisons. By contrast, the historical Danelaw region shows no significant differentiation either by Y-STR length or based on sequence data. Fig. 5 shows MDS plots at the population level, based on FST values for both length- and sequence-based Y-STR data.
Table 3AMOVA and pairwise FST results for different genome components.
Fig. 5MDS plots illustrating regional differentiation according to historical and cultural divisions, based on FST calculated from Y-STR lengths and sequences. a) Plot based on Y-STR length data; b) plot based on Y-STR sequence data. Populations within regions of Anglo-Saxon influence or within the Danelaw are highlighted by colours as shown in the key below. Population abbreviations: ARG: Argyll; BAN: Banff And Buchan; BER: Berkshire + Buckinghamshire; CHE: Cheshire; COR: Cornwall; CUM: Cumbria; DEV: Devon; DOR: Dorset; ESS: Essex + Cambridgeshire + Bedfordshire; FOD: Forest Of Dean; GLO: Gloucestershire; HAM: Hampshire; HER: Herefordshire + Shropshire; IoI: Island of Ireland; IOM: Isle of Man; KEN: Kent; LAN: Lancashire; LEI: Leicestershire; LIN: Lincolnshire; MWA: Mid Wales; NEA: North East; NOR: Norfolk; NOT: Nottinghamshire + Derbyshire; NPE: North Pembrokeshire; NTH: Northampton; NWA: North Wales; ORK: Orkney; OXF: Oxfordshire; SCO: Scotland; SPE: South Pembrokeshire; STA: Staffordshire; SUF: Suffolk; SUS: Sussex; WAL: Wales; WIL: Wiltshire; WOR: Worcestershire + Warwickshire; YOR: Yorkshire.
The Promega PowerSeq™ Auto/Mito/Y System prototype is one of several MPS systems developed to simultaneously amplify multiple forensically informative segments of the human genome. Its 56 amplicons allow variation in the biparentally inherited autosomes and the uniparentally inherited MSY and mtDNA to be surveyed via a single PCR, and here we have applied this to a carefully recruited population sample from the British Isles, allowing us to assess the population structure detectable using sequence-based approaches. From the People of the British Isles cohort [
] 362 males were selected for their wide geographical coverage, and their forensic markers sequenced, including 22 aSTRs, the mtDNA control region (in ten amplicons), and 23 Y-STRs. The autosomal and MSY markers are equivalent to the widely used PowerPlex Fusion and PowerPlexY 23 loci (Promega), and therefore our study contributes reference data relevant to users of those kits.
As expected from previous STR-based studies, sequencing of both autosomal and Y-STRs increases observed allelic diversity compared to length-based approaches. For aSTRs, this increase was 87.5%, reducing random match probability from 6.72E-27 to 1.25E-30, while for Y-STRs allelic diversity increased by 33%, though with no concomitant change in observed haplotype diversity or random match probability.
Autosomal STRs are suited to individual identification thanks to their independent assortment, multiallelic nature, and high mutation rates, which lead to high heterozygosity. FST-based analysis of a very large global set of forensic aSTR genotypes (the 13 CODIS loci) suggested that these forensic loci systematically underestimate the levels of genetic variation among human populations, likely due to their ascertainment for inter-individual discrimination [
]. By contrast, another study applying the model-based clustering algorithm STRUCTURE in the Human Genome Diversity Project (HGDP) panel showed that the CODIS loci produce clustering patterns similar to those of other independent sets of 13 tetranucleotide aSTRs [
], and concluded that while forensic aSTRs display relatively low FST, their high heterozygosity strengthens ancestry inference compared to less heterozygous STR sets. MPS analysis of aSTRs, by increasing observed heterozygosity and by identifying rare and locally distributed sequence variants, might increase ancestry informativeness; analysis at the continental scale [
In fact, our analysis at the local scale of the UK using both FST and STRUCTURE analyses finds little evidence for autosomal population structure (Table 3; Table S8); the only significant signal observable is of differentiation between the major ‘red’ Central/South East England cluster of PoBI ([
]; Fig. 1a) and the rest of the dataset. The fact that this differentiation is only observed with aSTR sequences rather than lengths gives some support to the idea that such data are more structured geographically. Given the conclusions of genome-wide SNP analysis, which considered 500,000 biallelic markers in 2039 individuals [
], the relative lack of observable population structure with 22 aSTRs and a smaller sample size is not surprising. Analysis of the SNP-chip data using a frequency-based clustering algorithm, STRUCTURE, was able to identify just three clusters, corresponding to Orkney, Wales, and the remainder of the dataset (primarily England and Scotland). Only with the application of the LD-sensitive fineSTRUCTURE was the subtle structure of multiple geographically differentiated clusters evident [
Mitochondrial DNA is, of course, not individually identifying, although the CR sequences analysed here yield a haplotype diversity of 0.9985 ( ± 0.0005). At a global scale mtDNA is well known to show strong population structure [
] with different spectra of haplogroups predominant in the indigenous peoples of different continents. At the more local scale of Western Europe, however, there is a lack of maternal lineage differentiation [
]. In this context, our finding that mtDNA CR haplotypes show no significant geographical differentiation within Britain (Table 3; Table S9) is not surprising; it may be ascribable to the effect of patrilocality (the tendency of women to move to a husband’s place of birth on marriage) which over time is expected to have a homogenising effect on mtDNA landscapes [
In contrast to the maternal and biparental picture, the distribution of Y-STR haplotypes showed marked and significant substructuring of male lineages (Table 3; Table S14). In all comparisons, the use of sequence- as opposed to length-based allele designations does not reveal higher degrees of geographical differentiation. The highest FST values are seen for the Anglo-Saxon regional comparison [
], which could be taken to support the idea of Anglo-Saxon mass migration around 500 CE as the major historical source for Y-chromosomal differentiation within the British Isles (as has been previously suggested [
]). Given this differentiation, it is unsurprising to see significant differences when the PoBI Central/Southern England comparison is carried out. It is also unsurprising that the Danelaw (which spans from the east to the west of Britain) shows no significant differentiation from the remainder of the sample set (Table 3). However, here we have performed an overall Y-chromosomal comparison: comparisons of the Danelaw differentiation using specific lineages of putative Scandinavian origin may show significant differences, as has been indicated for haplogroup R1a [
Our study focused on a sample set that was carefully curated to reflect at least three generations of local ancestry in rural locations. It is thus highly atypical of the current population of the United Kingdom which, like those of many modern nations, contains a rich diversity of groups of much more diverse ancestries, and is primarily (81.5%, based on the 2011 England & Wales census; www.ethnicity-facts-figures.service.gov.uk) resident in urban areas. The census lists 18 standardised ethnic groups, and among these the PoBI cohort best corresponds to the ‘White British’ category. The overall census proportion of that group is high, at 80.5%, but varies regionally from 44.9% in London to 93.6% in the North-East (and 97.1% in Scotland; www.scotlandscensus.gov.uk). Other ethnic groups have a tendency to reside in urban areas - for example 98.2% of people self-identifying as Black African live in cities compared to 78.2% of White British. Beyond the dominant metropolis of London, particular ethnic groups tend to be concentrated in specific regions - for example self-identified Asians in the West Midlands constitute 10.8% of that local population. This ethnic diversity and its non-uniform geographical distribution, together with the well-established differentiation between continental groups for aSTRs [
] suggests that the general population of the UK today is in practice highly structured, a fact that must be accounted for in any consideration of forensic evidence.
TIH was supported by a Biotechnology and Biological Sciences Research Council iCASE studentship, grant BB/M016706/1, partnered by Key Forensic Services. WFB and KH were supported by Wellcome Trust grants 072974 and 088262. We thank all DNA donors, the many workers who supported the PoBI project, and Andy Hopwood and Nikki Peake of Promega for access to the PowerSeq™ Auto/Mito/Y System prototype kit. We acknowledge NUCLEUS Genomic Services at the University of Leicester for training and access to Illumina MiSeq sequencing. This research used the SPECTRE High Performance Computing Facility at the University of Leicester for data analysis.
FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.