If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
A small panel of 24 highly informative microhaplotype loci has been identified.
•
The MH panel is good for individualization, ancestry inference, and mixture analysis.
•
The MHs are more informative than the 24 augmented CODIS STRs typed by CE.
•
The 24 MH panel is good enough to be a stand-alone panel for forensic casework.
Abstract
A small panel of highly informative loci that can be genotyped on the same equipment as the standard CODIS short tandem repeat (STR) markers has strong potential for application in forensic casework. Single nucleotide polymorphisms (SNPs) can be typed by a couple of methods on capillary electrophoresis (CE) machines and on sequencers, but the amount of information relative to the laboratory effort has hindered use of SNPs in actual casework. Insertion-deletion markers (InDels) suffer from similar problems. Microhaplotypes (MHs) are much more informative per locus but have similar technical difficulties unless they are typed by massively parallel sequencing (MPS). As forensic labs are acquiring sequencing machines, MHs become more likely to be used in casework, especially if multiplexed with STRs. Here we present the details of a multipurpose panel of 24 MHs with the highest effective number of alleles (Ae) from previous work. An augmented STR panel of 24 loci (20 CODIS markers plus four commonly typed STRs) is also considered. The Ae and ancestry informativeness (In) distributions of these two datasets are compared. The MH panel is shown to have better individualization and population distinction than the augmented CODIS STRs. We note that the 24 MHs should be better for mixture analyses than the STRs. Finally, we suggest that a commercial kit including both the standard CODIS markers and this set of 24 MH would greatly improve the discrimination power over that of current commercial assays.
]. Such short molecular regions with several SNPs within an amplicon can have very high heterozygosity when evaluated as haplotypes. Such regions are best typed by massively parallel sequencing (MPS). Several panels of microhaplotypes have been proposed and a few have been typed by MPS on several populations [
] and can provide highly significant individualization. Presently, use of MHs in casework only occurs, if at all, subsequent to traditional short tandem repeat (STR) analysis. Microhaplotypes, as proposed in 2013 [
]. The MHs provide excellent mixture, ancestry analysis, and individualization and the CODIS STRs provide nearly unique individualization and mixture interpretation when probabilistic genotyping is used.
It is likely that the CODIS STR loci will remain the standard for forensic casework for a considerable period because of the large reference databases of criminal results for the CODIS markers and the financial burden of implementing a new technology in forensic laboratories. Any new marker routinely used in casework will need to be used in conjunction with the standard STR markers and it should use the same PCR and analysis equipment. With current capillary electrophoresis (CE) methods the likelihood of incorporating MHs is low. However, as forensic laboratories begin to have MPS equipment and sequencing begins to be used for STR typing, the possibility of incorporating MHs becomes tenable. For MHs to be incorporated into routine casework it would be most efficient to multiplex STRs and MHs into a single assay. The Forenseq DNA Signature Prep Kit (Verogen) provides a proof of principle that multiplexing STRs and SNP-based markers is possible. The question arises, “What MHs to incorporate?”.
Our long term objective has been to identify MHs worthy of converting to MPS [
]. To evaluate whether a set of highly informative loci might be appropriate for multiplexing with STRs in forensic casework applications we selected 24 loci to match the number in the augmented STR set ( Table 1 and S1). We could use larger numbers of microhaplotypes but the goal was to create a subset that matches the number of loci in the two sets of markers. This eliminates the statistical issues arising from using different numbers of markers when comparing forensic characteristics of different panels. We chose the MHs with the highest average effective number of alleles (Ae) values [
] by MH name. Ideally the additional information provided by these 24 MHs would help in addressing all of the major forensic questions: individualization, ancestry inference, kinship analysis, and mixture resolution.
Table 1The 24 microhaplotypes. Characteristics for the 79 population analysis.
]. The actual Ae and In values are given in Supplemental Table 1. The sequence amplicons used for these 24 MHs, as for all 90 MHs, are concentrated around the upper end of the size range of the CODIS STR amplicons (~ 100 bp to ~ 300 bp). However, these MH amplicons could be made smaller for many loci should that be an issue in optimizing the multiplex (Table S1). While most microhaplotypes published to date have relatively low Ae values < 4, this selection shows that it is possible to find MH that are highly heterozygous with Ae values >> 4. Such loci are rare but do exist and can be found [
Fig. 1Scatterplot of the 24 microhaplotypes by their Ae and In values for 79 populations. The values are in Supplemental Table S1. The values plotted are based on Pakstis et al.
The choice of 24 MH loci for our comparison study was motivated by the availability of data on a global distribution of populations with genotypes for the 20 CODIS markers plus four other STR loci commonly studied. The STR allele frequencies were downloaded for the 57 available population samples from the popSTR [
] online database (http://spsmart.cesga.es/about.php?dataset=strs_local) for the 20 Core CODIS STRs plus D6S1043, SE33, Penta D, and Penta E. The database is version 5.1.1 of SPSmart and the data were last updated in July of 2015 (dbSNP version 132). Some of the population samples downloaded lacked allele frequencies for various STRs. A final set of 32 populations had data on all 24 loci. We refer to these STRs as the augmented CODIS markers. For comparison to the MH data in Fig. 1, the Ae and In values of the 24 augmented CODIS markers are plotted in Fig. 2 and the values are in Table S2. Table S3 shows the chromosomal locations of the 24 STRs and the 24 MHs. Having selected these two sets of markers—24 MHs and 24 STRs—we proceed to document and compare their forensic characteristics. We show that the MH panel would provide valuable additional information if integrated into casework analyses; indeed, the 24 MHs are generally better than the 24 augmented CODIS markers.
Fig. 2Scatterplot of 24 augmented CODIS markers for Ae and In based on 32 populations. The same scale is used for both Fig. 1, Fig. 2 to allow better visual comparison. Because SE33 has such a high Ae value, average Ae = 14.69 and In = 0.674, it is out of range and does not appear in this image.
The 24 augmented CODIS markers have a noteworthy range of average Ae values (Fig. 2). As a test of the generality of this specific finding, we compared the popSTR average Ae values by locus with individual locus values for the four U.S. Census data from NIST. The graph (Fig. S1), sorted by the augmented CODIS values for popSTR loci shows that all of the data sources have a high correlation of values for the various loci.
2. Documentation of value for individualization
The population specific combined random match probability (RMP) of these markers shows the same geographic gradient seen for other panels of microhaplotypes (Fig. 3) [
], the RMP range necessarily involves much larger values: from 10−24 in South American Indians down to 10−39 for most African populations. This gradient reflects the most common genotype frequency which ranges from 10−17 down to 10−38. Stated in words, the American Indians have the fewest different alleles and therefore the highest genotype frequencies leading to the highest probability of a random match. All other populations show values intermediate between the larger values for the American Indian populations and the much smaller values in the African populations. The African populations have the most alleles and the lowest genotype frequencies and the lowest probabilities of a random match.
Fig. 3A negative logarithm plot of the population specific Random Match Probabilities and most common genotype frequencies for the 24 MH. Populations are in the same order as in Pakstis et al.
and Fig. 4. For most of the world the RMP is around or less than 10−30. See Table S4 for the population names corresponding to the 3-character abbreviations.
Fig. 4A negative logarithm plot of the population specific RMP values for the augmented CODIS panel of 24 loci. (Data from popSTR downloaded January, 2022.) The values for this global set of populations fall around 10−28 except for the Pacific and Native American populations. Note that the Dominicans are Afro-Caribbean and are “American” by geography, not ancestry.
] (RMP values for Africans as low as 10−115), they are also a few orders of magnitude smaller than the range of the values for the augmented CODIS STR panel (Fig. S2 compares Fig. 3 with Fig. 4). The stand-alone values for these 24 MHs are at least as probative as the STRs for most of the world. The relevant RMP when MHs and STRs are multiplexed would be at most 10−60. Thus, this small but highly selected panel has forensic value, exceeding the individualization value of the augmented CODIS panel. The information can be combined with the STR data if the markers are genetically independent. As seen in Supplemental Table S3 most pairs of markers are more than a megabase apart. The few that are closer are still far enough apart that no LD is expected.
Kinship and parentage testing are related to the heterozygosity of the markers [
]. The Ae value can be a measure of the statistical power for parentage testing. As average Ae values are higher the statistical power for clarifying more distant relationships is also higher [
3. Documentation of value for biogeographic ancestry
Several methods are used to show the ability of a panel of ancestry informative markers (AIMs) to illuminate population relationships. STRUCTURE and Principal Components Analysis (PCA) are two that are commonly used. We have used both. We note that many previously published studies of SNPs on these population samples have demonstrated that they show no significant deviation from random mating expectations.
] was used to evaluate and illustrate the clustering of individuals into predefined groups of genetic similarity based on the set of 24 MH loci. The STRUCTURE analysis parameters employed include: 10,000 burn-ins and 10,000 Markov Chain Monte Carlo iterations, admixture model, correlated allele frequencies, 20 independent replicates per predefined number of clusters (K) from K = 5 to K = 10. The input data file for the STRUCTURE analyses contained the individual genotypes for each individual. Analysis used the individual genotypes with no prior information on how they clustered individuals into populations. Graphic output then grouped the individuals into their populations of origin with cluster inference indicated by color. The results for K = 8 and K = 9 are shown in Fig. 5.
Fig. 5STRUCTURE results for highest likelihood runs at K = 8 and 9 for the 24-microhaplotype, 79-population dataset. Each fine vertical line represents one individual. Blowups are shown for regions of several small populations to make the clustering and population labels clearer.
] and in Fig. 3 but here the 79 populations were analyzed using only the 24 MH selected for this study (Fig. 1; Table 1 and S1). The 24 MH STRUCTURE result in Fig. 5 is very similar to the corresponding figure of K = 6 and K = 7 in Pakstis et al. [
PCA used XLSTAT 2017 (http://www.xlstat.com/en/about-us/company.html), to compare the similarities and differences among the populations. The PCA of the 79 populations based on the 24 MH loci (Fig. 6) shows a complete separation on PC1 of the sub-Saharan African populations from the remaining populations and a distribution of those remaining on PC2 from a cluster of Europeans to a cluster of East Asians. This analysis and the STRUCTURE analysis document that the MH markers contain significant global ancestry information. Although these 24 MH were selected for high Ae, they also have high In: the In range is from 0.25 to 0.86 (Fig. 1, Table S1).
Fig. 6PCA of 79 populations based on the 24 microhaplotypes.
The 24 augmented CODIS markers have not been the subject of any STRUCTURE analyses or PCA that we are aware of. The popSTR dataset does not contain the individual-specific genotype profiles that would allow STRUCTURE analysis of the populations. However, other statistical approaches have shown that they can provide some ancestry information [
]. We have used PCA on the population frequencies of the 24 STR loci (Fig. 7). We see that relationships similar to those in Fig. 5, Fig. 6 exist. While the few populations with data for the STR PCA are different from the 79 used in Fig. 5, Fig. 6, a global distribution of populations exists in both datasets. It is clear that combining the MH and STR data on a single set of populations would at minimum reinforce the major clusters and may clarify the relationships of many intermediate populations.
Fig. 7PCA of 32 populations based on the augmented CODIS data from popSTR database.
]. High Ae is especially relevant to better mixture resolution. We note that mixture analyses have two different objectives. One is a forensic question of whether a known individual might or might not have contributed to a mixture. While such probabilistic genotyping [
] is now common with forensic STR loci, interpretation requires population allele frequencies and is complicated by stutter especially when a minor contributor is in the stutter range of a major contributor to the mixture. Microhaplotypes have an advantage in absence of stutter. For many microhaplotypes good population frequency data, including for the 90 MH that include these selected 24, are becoming available [
]. The second objective is the complete deconvolution of the mixture to estimate the genotypes contributing to the mixture. Again, absence of stutter helps resolve the alleles in a mixture but only quantitative data and allele frequency data can provide the additional conversion into genotypes of individuals. The combinatorics of multiple loci complicates knowing the multiple locus genotype even with perfect single locus deconvolution.
The maximum amount of information about a two-person mixture occurs when there are four alleles (haplotypes) observed at a locus in the analysis of a mixture. Obviously, this cannot occur when the haplotyped locus has only two or three alleles in the population. If only two alleles (haplotypes) exist in the population, as is generally the case for individual SNPs, one can infer a mixture if there are very different quantitative values for the two alleles. If three alleles exist in the population, similar quantitative differences will allow some inference of genotypes although the existence of a mixture can be certain if three alleles are seen in an analysis [
]. It is only possible to see four alleles in a two-person mixture if at least four or more haplotypes (alleles) exist in the population. The probability of fully resolving the mixture at a locus will be a function of the allele frequencies in the population.
It is possible to estimate a probability of seeing three or four alleles at a locus as proof that a mixture exists if some simplifying assumptions are made. Actual probabilities are functions of the array of allele frequencies of the persons in the mixture. That is too complex to deal with other than by simulations; instead, a simplifying assumption of an effective number of alleles is an approximation. For simplicity we are using the immediately lower integer for each Ae value to give a minimum estimate of observing all four alleles in a 2-person mixture (Table 2).
Table 2Probabilities of finding four different alleles in a mixture detection analysis for a two-person mixture. The Ae values are based on the observation for the 24 loci in Table S1 using the 30-population dataset sequenced in Gandotra et al.
]. In this case we are calculating the probability of seeing 4 alleles for a two-person mixture given the number of alleles with the integer Ae value. Table 2 shows that whether we use the Ae values based on 30 populations [
], the probability of seeing at least one locus of the 24 loci with 4 alleles is greater than 0.999. Note, this is a conservative estimate using the lower bound of each Ae interval; the true estimate using the exact Ae values would be higher. As the Ae increases, the number of combinations of four different alleles increases even as the allele frequencies become smaller. The result is an increasing probability of at least one locus having four different alleles in a two-person mixture.
Results from actual mixture studies illustrate the value of the high Ae markers in this set of 24 MHs (Fig. 8). These examples are based on the SNPs originally used to define the loci (cf. ALFRED; https://alfred.med.yale.edu) and incorporated in the ThermoFisher software accompanying the 74-locus multiplex [
]. In the four mixture examples (Fig. 8) at least one locus (illustrated) allows an estimate of the minimum number of contributors in the mixture. In some cases reasonable quantitative considerations can help estimate the genotypes contributing to the actual mixture. Even if full deconvolution is not possible, a valid estimate of the minimum number of contributors is important for probabilistic genotyping [
Fig. 8Examples of mixture results. The mixture ratios are given for each example and the read numbers for the haplotypes seen are plotted. The four-person results for mh13KK-218 indicate at least four persons contributed to the mixture. The four-person results for mh21KK-320 indicate at least three persons contributed. The three-person results for mh13KK-218 indicate at least three persons contributed. The three-person results for mh02KK-134 indicate at least three persons contributed to the mixture.
To date most studies of MHs have assumed by default that their use would be relevant independently of or subsequently to ordinary casework analysis of a sample with forensic STR loci. A major focus has been on demonstrating value for ancestry inference (e.g., de la Puente et al. [
], two areas of weakness for the forensic STR loci. Our analyses are directed toward documenting that a small, selected set of MHs can address casework issues and supplement the CODIS markers. Actual use in casework is becoming possible as more and more labs are considering sequencing for casework analyses.
We have compared the two sets of markers in Fig. 1, Fig. 2 for their Ae and In values. We show that the 24 MHs we selected are better, on average, than the 24 augmented CODIS markers in terms of both Ae and In. All of the MHs have an Ae that exceeds 4.8 whereas 10 of the STR loci fall below that value. Only 8 of the MHs have an In value below 0.40 whereas only 3 of the augmented CODIS markers have an In value above 0.40.
In Table S1 we show that the In and Ae values of the 24 MH are dependent on the set of populations used to determine those values. The 30 populations are mostly 1000 Genomes [
]. Inclusion of those additional populations may have resulted in an increase in Ae from that based on just the 1000 Genomes alone. The 79 populations include more East Asian and Native American populations, populations that generally have lower values of Ae but would contribute to a higher In.
The RMP values of the two marker sets illustrate several points. First, the reference populations for the STR loci constitute a poor global reference. Second, data for the MH reference populations show the large differences in the allele frequencies also seen for several panels of SNPs. Though it shows no large difference in the STR RMPs across Africa and EurAsia, the set of East Asian populations is not broad. One possible explanation for the absence of a difference is the higher mutation rate for the STRs compared to MHs that would counter the loss of alleles by random genetic drift. Alternatively, the one East Asian population may just be an outlier.
With the exception of three loci, the values of average Ae are > 5.0 for the 24 MH markers in this panel. To account for differences in sets of populations studied, an Ae of 4.5 seems to be a good working level for future selection of candidate high-Ae markers. Other studies on fewer populations have found some of these 24 MHs to have high Ae. Turchi [
]. They should be evaluated in efforts toward a better panel. A problem with comparing studies for statistics like Ae is that different sets of populations have been used. The Ae rankings can differ depending on the population (Table S1). However, a large global panel should be adequate for identifying the markers with very high Ae. Also, while Ae and In are correlated theoretically [
], the correlation in our studies is weak at the lower levels of Ae. Considerations of In value may be relevant in decisions among individual MHs when the number of MHs is to be kept small. The 1000 Genomes dataset can be used for comparison but does not have a good representation of Native Americans.
An advantage of the mMHseq methodology used for the 90 microhaplotypes [
] and applicable to the 24 MHs in this study is that markers can easily be removed or added. We expect this initial set may be modified when more high Ae markers have been tested on a large global set of populations comparable to the 79 populations studied for these 24 MHs. A future research project will be to find more microhaplotypes that have an average Ae > 4.5 for a large set of populations. Given the lower levels of genetic variation in human populations located farther from Africa, MHs with high Ae will be less common and will require a focused search in those populations. It is especially important to find more markers that have higher Ae values for the East Asian, Pacific, and Native American populations. While the Pacific populations are small in a global context, the Native American and East Asian (including Chinese) populations are not. Fortunately, there are resources that will allow such searches.
] deliberately searched for MH with Ae > 4 using the 1000 Genomes data. They were successful in identifying, in just the Chinese, many with an average Ae > 5 and a few with an average Ae > 6. These loci should help balance the RMP for East Asians. Because the analyses [
] were in the 1000 Genomes database, it is a close approximation to compare the Ae rankings of the two sets. Sorting the two sets of MH together by Ae shows that the top 24 MHs consist of 14 MHs from Gandotra [
] (Table S5). Given the different rankings when more populations are typed (cf. Table S1), this sorting is not final but an example of the need to compare using the same global set of populations. Some of the Wu markers might displace some of the top 24 of the Gandotra markers were identical reference populations used. Resolution of an improved set comparable to these best 24 MHs by Ae will necessarily await more comparable population sets. Thus, data suggest an improved panel could be developed from MHs already identified. What is missing is a comparison based on studies of the same populations.
Ultimately, a set of MHs needs to be agreed upon by the forensic community. Such agreement should enhance the development of a commercial panel, one that optimizes the multiplexing of the STRs and MHs. Software to separate the interpretation of the different amplicons—those for STRs and those for MHs—from one sequencing run will need to be written but the software already exists for each type of sequence alone. Moreover, Verogen already markets a kit that multiplexes STRs and SNPs, providing a proof of principle that STR loci can be multiplexed with small amplicons containing SNP-like information.
Microhaplotypes are often considered much less heterozygous than STRs with one estimate that 86 % of ~ 380 MHs had Ae values ranging from 2.0 to 4.0. MHs with Ae values > 5 were especially rare based on review of 7 different publications on MHs [
]. Many of those studies used a minimum of the 1000 Genomes data; so, there is a global, albeit imperfect, perspective. Our study shows that, while rare, MHs with high Ae values do exist in sufficient numbers for meaningful analyses.
A very large MH array may be difficult to multiplex while preserving the depth of coverage needed to identify alleles of the minor contributors in mixtures. To avoid this potential problem we have focused on a smaller panel of size comparable to the augmented CODIS panel. Amplicons could be made smaller for many loci in future iterations of a MH panel for multiplexing should that be an issue in optimizing the multiplex. Even if these MH are never multiplexed with the standard CODIS markers, this 24-MH panel is an excellent stand-alone panel for follow-up testing when information from STR analyses is insufficient in casework. Of course, the entire set of 90 MHs is an even better panel for forensic analysis if MH analyses by MPS are an independent follow-up to STR typing by CE.
6. Conclusion and recommendation
This panel of 24 microhaps has been shown for its size to be excellent for individualization, for ancestry, and, in theory, for mixture analysis. It has the advantage of using the same sequencing analysis as is becoming useful for the forensic standard STRs. We are proposing that a panel of markers for forensic casework be developed to include these MHs in addition to the CODIS markers. We are recommending the set of 24 microhaplotypes in this study be that initial addition to the new casework kit. We think that these 24 MH loci are adequately spaced among the CODIS markers to be statistically independent for forensic analyses. We have shown that the 24 MHs are very informative and add forensic value in individualization, ancestry inference, and mixture resolution. They are worth incorporating into a forensic casework panel. Indeed, as the database of MHs from casework accumulates, MHs will become sufficient to be a casework panel by themselves.
Funding
This work was funded in part by National Institute of Justice (NIJ) Grant 2018-75-CX-0041 awarded to KKK by the National Institute of Justice, Office of Justice Programs of the United States Department of Justice and by the United States National Institutes of Health Grant R01 HD102537 to CS and by NIJ Grant 2017-DN-BX-0164 to DP. Points of view in this presentation are those of the authors and do not necessarily represent the official position or policies of the U.S. Department of Justice.
CRediT authorship contribution statement
KKK and AJP designed the study, analyzed the data, and wrote the initial draft of the paper. All authors read the paper and helped edit the initial draft.
The authors thank Dr. Francoise R. Friedlaender for her expert help in formatting and labeling the STRUCTURE bar plots. Special thanks go to the many hundreds of individuals who volunteered to give blood or saliva samples for studies of gene frequency variation and to the many colleagues who helped collect the samples. Some cell lines were obtained from the National Laboratory for the Genetics of Israeli Populations at Tel Aviv University.