If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Technische Universität Dresden, Faculty of Computer Science, Dresden, 01187, GermanyMax Planck Institute of Molecular Cell Biology and Genetics, Dresden, 01307, GermanyCenter for Systems Biology Dresden, Dresden, 01307, Germany
Convergence criteria strongly influence precision of MCMC DNA mixture algorithms.
•
Our Hamiltonian Monte Carlo can fulfil strong convergence criteria.
•
Our tool significantly reduces standard deviation of the log-likelihood ratios.
•
It is the first probabilistic genotyping algorithm to benefit from GPU acceleration.
Abstract
Motivation: Analysing mixed DNA profiles is a common task in forensic genetics. Due to the complexity of the data, such analysis is often performed using Markov Chain Monte Carlo (MCMC)-based genotyping algorithms. These trade off precision against execution time. When default settings (including default chain lengths) are used, as large as a 10-fold changes in inferred log-likelihood ratios (LR) are observed when the software is run twice on the same case. So far, this uncertainty has been attributed to the stochasticity of MCMC algorithms. Since LRs translate directly to strength of the evidence in a criminal trial, forensic laboratories desire LR with small run-to-run variability.
Results:
We present the use of a Hamiltonian Monte Carlo (HMC) algorithm that reduces run-to-run variability in forensic DNA mixture deconvolution by around an order of magnitude without increased runtime. We achieve this by enforcing strict convergence criteria. We show that the choice of convergence metric strongly influences precision. We validate our method by reproducing previously published results for benchmark DNA mixtures (MIX05, MIX13, and ProvedIt). We also present a complete software implementation of our algorithm that is able to leverage GPU acceleration for the inference process. In the benchmark mixtures, on consumer-grade hardware, the runtime is less than 7 min for 3 contributors, less than 35 min for 4 contributors, and less than an hour for 5 contributors with one known contributor.
Investigators present at a crime scene identify and collect the available physical evidence. As a part of this evidence, DNA samples containing material from multiple contributors (i.e. mixed DNA samples) are often collected. The resulting short tandem repeat data are occasionally affected by stochastic events such as severe peak height imbalance, drop-outs, and drop-ins [
Executive Office Executive Office of the President Report to the President - Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods.
Executive Office Executive Office of the President Report to the President - Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods.
] for reporting results of DNA mixture analysis is the likelihood ratio (LR):
(1)
where is the observed EPG, represents a genotype set—a list of tuples denoting the allele designations of contributors. The summations are over all possible genotype sets . and are the hypotheses of the prosecutor and the defendant respectively. A hypothesis assumes inclusion of certain contributors (suspect, victim, etc.) in the mixture, as well as the background allele frequencies of the populations the contributors allegedly belong to. Usually, the difference between the hypothesis of the prosecutor and the hypothesis of the defendant is the inclusion of the suspect in the former. In the setting we consider, the number of considered contributors is fixed beforehand. can be estimated based on the background frequencies of the alleles in the populations of interest for any hypothesis , [
]. Probabilistic genotyping (PG) refers to the set of statistical methods used to compute LR for given EPG data.
In order to estimate , assumptions about the underlying data-generating process are made. These assumptions lead to a probabilistic model with latent variables . Models where probability is estimated based on the heights of the EPG peaks are called “fully continuous”. Some of the main approaches used to infer such models are: finding the most likely set of parameters by maximum likelihood estimation [
J. R. Stat. Soc. Ser. C. Appl. Stat.2015; 64 (This is the accepted version of the following article) (which has been published in final form at http://dx.doi.org/10.1111/rssc.12071): 1-48
with the prior probability density function (PDF) of the parameters . In an ideal scenario, LR is independent of any choices made by the laboratory technician and of any random confounding factors. In practice, however, LR depends on the variation in the samples from the crime scene, stochastic events occurring during PCR amplification, allele frequency sampling, parameter settings in the data-processing software,
So far, this run-to-run variability has been attributed to the inherent stochasticity of the Markov Chain Monte Carlo (MCMC) methods used to estimate mixture model parameters [
]. However, as we show here, the apparent run-to-run variability is more likely caused by the choice of convergence criteria used in the MCMC sampler. This is supported by our results demonstrating that run-to-run variability can be reduced when using an MCMC method with strict convergence criteria. For this, we formulate a probabilistic model of DNA mixture deconvolution that only has continuous model parameters, marginalising over the discrete dimensions (genotype sets). While such marginalised models can be properly convergence-controlled, they are generally more expensive to compute. As we show here, however, the intrinsic structure of the posterior probability distribution can be exploited by Hamiltonian Monte Carlo (HMC), maintaining the runtimes of conventional MCMC solutions. We show that the strict convergence criteria afforded by our method significantly reduce run-to-run variability. We further present a deduplication system that efficiently handles the combinatorial growth in the number of genotype sets with increasing numbers of contributors.
1.1 Precision of DNA mixture deconvolution
We define precision of DNA mixture deconvolution as the inverse variance between results of runs with identical hyper-parametrisation on the same EPG data for the same hypotheses. Precision has to be considered in addition to the accuracy of a PG system [
]: “The argument is that the existence of variability [across PG runs — our note] raises doubts about whether any of the results should be accepted.” [
] suggest a scale that defines an LR between 100 and 1000 as “moderately strong support” and LR between 1000 and 10000 as “strong support”. Let us assume that we use a PG system, which, for the given case, outputs results from a normal distribution: . A single run of the software would give “strong support” in of cases. A technique that reports confidence intervals [
], however, would provide the conservative answer of “moderately strong support”. This highlights the importance of high precision (i.e. low variance) in PG results.
The precision of available commercial solutions has been quantified in several studies [
]. A standard deviation of LR of has been reported between identical runs on a three-contributor mixture (sample ‘3-2’) when the TrueAllele® software was used [
]. In order to increase precision, it has been recommended to increase the number of MCMC iterations, at the expense of a larger computational runtime [
], which compares pooled and within-chain variances of samples to indicate possible convergence. For given model parameters, this diagnostic has a value close to 1 if the samples from different chains result in similar estimates for the marginal distribution. Since actual convergence cannot be quantified as long as the true posterior distribution is not known, convergence criteria measure the stability of samples, and the term “convergence” in MCMC is always relative to the chosen metric.
Some of the available PG software solutions provide users with convergence diagnostics. STRmix™ [
] for example calculates the ratio of pooled and within-chain variances of the likelihood of the model (personal communication, Kevin Cheng, Institute of Environmental Science and Research, Ltd., Auckland). GenoProof Mixture [
] reports the univariate GRs for the continuous parameters. By default, both software solutions use a predefined constant number of post burn-in samples and then report the value of the diagnostic to the user. If the desired threshold (by default 1.2 in STRmix™, 1.05 in GenoProof Mixture) of the convergence diagnostic has not been achieved, the software offers an option to run additional iterations. The default GR threshold should be rather low. The authors of the diagnostic state [
]: “The condition of GR near 1 depends on the problem at hand; for most examples, values below 1.1 are acceptable, but for a final analysis in a critical problem, a higher level of precision may be required.” [
] This seems even more important since the statistical models used in both of these software tools combine continuous (e.g. peak intensities) and discrete (e.g. genotype sets) dimensions. However, GR cannot monitor convergence in discrete dimensions. It is therefore possible that convergence is deduced purely from the continuous parameters, while the genotype set distributions may not have converged at all, offering a possible explanation for the large run-to-run variability observed despite low GR thresholds. In our model, we avoid the issue of assessing convergence of genotype sets by marginalising them out (i.e., the genotype set is not part of the MCMC sampling process).
1.2 Trade-off with execution time
In forensic DNA mixture deconvolution, computational runtime is of great importance, since:
•
Laboratories might have to run software multiple times with different hyper-parametrisations in order to check the robustness of the results or to test hypotheses with different numbers of contributors, different analytical thresholds, different priors, etc.
•
Laboratories might want to quantify the precision of the results over several identical replicates.
•
Forensic laboratories are often working under time pressure, e.g., if cases attract great media attention or laws limit detention time without charges.
In addition to the efficiency of the software implementation, there are multiple factors that influence the execution time, including the number of contributors, the number of alleles per locus, the techniques used to limit the number of considered genotype sets, the convergence criteria, the choice of the optimisation problem (maximum likelihood vs. Bayesian inference), the specification of the model, the priors used, etc.
In general, there is a trade-off between the precision and runtime. Lower runtimes can trivially be achieved by running a smaller number of MCMC iterations, at the expense of precision. Achieving perfect precision is theoretically possible, if runtime is not a priority, by integrating over the latent variables (see Eq. (2)).
The model presented here integrates over the discrete dimensions, i.e., marginalises over genotype sets in order to be able to properly monitor convergence. Such a marginalisation imposes additional computational cost, and a proper estimation of the posterior requires well-suited computational methods. We render our method practically applicable by using a Hamiltonian Monte Carlo (HMC) sampler and by introducing a deduplication system for genotype sets. Thanks to these contributions, we are able to use a strict GR threshold of 1.05 with similar or faster runtime than existing solutions.
2. Materials and methods
We base our probabilistic genotyping model on the work by Taylor et al. [
]. The main assumptions in this model are the independence of the loci data given the nuissance parameters and the log-normal distribution of the ratio of the observed EPG peak heights to the peak heights predicted by the generative model (called “expected” peak heights). The generative model consists of several steps, as illustrated in Fig. 1: First, the expected contributions are computed for each genotype set from the considered set of parameters. Then, peak stutter models are applied to predict expected peaks. The next step is to calculate the standard deviation of the log-normal distribution. Finally, the likelihood of the observed data given the parameters is calculated as the product of the likelihoods of all peaks. The model handles stochastic dropout and drop-in events.
Fig. 1Illustration of the present probabilistic genotyping model. Top left: the observed peaks at two selected loci from the green channel of the analysed mixture. Centre left: We analyse the genotype set [(15,16),(14,Q)] for locus D3S1358 and [(6,9),(7,9,3)] for locus TH01. The catch-all dropout allele denotes any dropped-out peaks. The lines show the expected contributions for all alleles . The function is decreasing within a locus due to degradation modelling. The TH01 locus was modelled with a larger amplification efficiency parameter than D3S1358. Bottom left: Expected peaks are created by applying stutter ratios to for two contributors. In this example, we consider only backward stutter for illustration purposes. Composed peaks (i.e. those that consist of both allelic and stutter contributions) are 14 and 15 at D3S1358, and 6 at TH01. Top right: PDF of the log-normal model for peak 15 at D3S1358 as a function of the expected peak height. Centre right: dropout probability as a function of expected peak height. Bottom right: drop-in probability as a function of observed peak height. Peak 18 at D3S1358 is a drop-in, since it is observed but not expected in the considered genotype set. The dotted lines denote the analytical thresholds.
We consider the posterior probability of parameters given the evidence
(3)
is assumed to be a flat prior. This assumption leads to genotype weights that are independent from the person(s) of the interest. In Bayesian inference, evidence is usually neglected as it is too expensive to compute and constant w.r.t. model parameters. We thus obtain the unnormalised posterior
(4)
which we use for estimating (see Eq. (2)). We assume peak heights to be conditionally independent given and , and alleles of a contributor in different loci to be independent from each other. To provide a mathematical formulation of the model, we denote:
•
: random variable of observed peak height at locus , allele , replicate ;
•
: expected peak height at locus , allele , replicate ;
: the analytical threshold of the EPG for locus and replicate ;
•
: a normal distribution with mean and standard deviation ;
•
: the peak height variance parameter for peak type (e.g. stutter).
The resulting model is a function of and . The likelihood of the observed EPG given parameters and a genotype set is then:
(5)
where denotes the available data at locus for replicate , and is the genotype set at locus . The conditional probability is modelled as:
(6)
The inner multiplications are performed over the set of observed peaks and the set of hypothetical dropout peaks. This model formulates separately the relative likelihood of observed peaks and the probabilities of dropout events. In the following, we abbreviate the notation for to .
The genotype set weights are calculated by:
(7)
where is the sample from the posterior.
2.1 Observed peaks
In case a peak is observed (i.e. ) our model considers a compound distribution that combines sub-models for peaks that are expected () and for drop-in events :
(8)
For the drop-in events, we use the model introduced by Euroformix [
Two hyper-parameters based on the level of noise in negative controls are required: the drop-in rate and the of the exponential distribution. For the expected peaks, we assume a log-normal distribution following previous works [
The standard deviation of this distribution depends on the components of the expected peaks and their heights:
(11)
We define . This means that for a single allele , we consider contributions from the allelic peak , the backward stutter from the peak, the forward stutter from the peak, and the double backward stutter from the peak. Our method currently does not consider half-backward stutter.
We use one parameter for allelic peak standard deviation ( when ) and a different one for stutter peak standard deviations ( when ). Individual parameters for the standard deviations of different types of stutter could be introduced without significant changes to the model. The expected peak heights are:
The product contribution at a selected allele is defined as:
(14)
where is the molecular weight of the allele, and is the average of the largest and smallest observed molecular weights on the EPG within the called peaks. The integer is 1 if the genotype of contributor includes allele in locus , and 0 otherwise. It is equal to 2 in case of a homozygote. The scalars are the locus-specific amplification efficiencies (LSAE), are the weights of the contributors that sum up to 1, are the degradation parameters, and is the total allelic product expressed in relative fluorescent units (RFU).
Eq. (12) includes the normalised stutter ratios and the product contributions from a contributor , . To obtain , the unnormalised stutter ratios are deduced from unambiguous profiles (see Supplementary Materials for the list of models with unnormalised stutter ratios). This is done using linear regressions based on allele designation or longest uninterrupted sequences [
]. The rationale behind the formula is explained in Section 1 of the Supplementary Material:
(16)
2.2 Dropout events
In case of a dropout, the peak is unobserved because it is below the analytical threshold. This corresponds directly to
(17)
See Supplementary Material section 1 for details.
2.3 Parameters of the model and priors
Table 1 summarises the parameters of our model. In typical scenarios, the total number of posterior dimensions will range from 22 (2-contributor, 1-replicate Identifiler™ mixture without modelling of the AMEL locus) to 49 (5-contributor, 3-replicate Fusion 6C mixture with AMEL locus included).
Table 1The parameters of our model. One contributor weight is omitted, as it can be inferred from the other weights (the sum of all weights has to be 1). is the number of loci, is the number of replicates, and is the number of contributors. We further reduce the number of parameters if multiple replicates are performed using the same kit. In such a case, the analysis shares the LSAEs across the replicates. The domain boundaries are treated as described in Subsection 3.1 of the Supplementary Materials.
We define the prior probability of the parameters as:
(18)
Here, is the prior ln that prevents the amplification efficiencies from drifting away from 1 too far. The prior variance , where is a hyper-parameter to be optimised by the laboratory [
] unless specified otherwise. We consider dropout allelic contributions as separate peaks, i.e., (Q,Q) is considered heterozygous. The total number of genotype sets is reduced by considering at most one drop-in per locus and using a drop-in cap. Peaks which are in stutter position and are not included in the genotype set definition are considered drop-ins if they are abnormally tall w.r.t. the origin, see the maximum stutter ratios in Table 3. In summary, if we consider a genotype set for a locus with a tall peak, that genotype set will not be removed during preprocessing if:
•
the corresponding allele score is a part of any of the genotypes from the set;
•
or the height is lower than the drop-in cap and there is no other peak that can be explained as drop-in;
•
or the peak can be explained as a stutter of another peak with an observed stutter ratio lower than the maximum stutter ratio.
2.5 Sampling algorithm
The marginalised (over genotype sets ) PG model from Eq. (4) is prohibitively costly to infer using random-walk MCMC, due to the large number of log-probabilities that need to be calculated when all possible genotype sets are considered. We thus reduce the number of required calculations of log-probabilities by using an adaptive-proposal sampler that has been successfully used in other fields: Hamiltonian Monte Carlo (HMC) [
]. The difference between HMC and random-walk MCMC is how the proposal distribution is chosen. HMC simulates physical (i.e. Hamiltonian) system dynamics instead of choosing a random point from the neighbourhood of the current sample. This renders HMC very efficient for posteriors with multi-modal or multi-funnel shapes, significant parameter correlations, and/or high dimensionality [
]. Unlike GenoProof Mixture, which only changes the value of a single continuous parameter in each iteration, our sampler considers multi-variate moves. The proposal distribution for these moves is dynamically adapted across iterations. It is determined in each iteration based on the local gradient of the log-probability of the model [
Thanks to the portability of the TensorFlow library, we can implement both CPU and GPU versions of our inference procedure. However, a naïve TensorFlow implementation would perform below expectations due to the combinatorial growth of the number of possible genotype sets with increasing number of contributors. We therefore introduce an important performance improvement: a deduplication system that can be used with any model that considers all genotypes in a single iteration (e.g. our work, Euroformix). Additional performance optimisations are described in Section 3 of the Supplementary Material.
For the deduplication system, we consider a single locus . The expected peak heights and the standard deviation of their distribution (Eq. (10)) depend on the genotype sets. If we consider the output for a single allelic position, these values depend only on the continuous parameters and on the alleles at the allelic and stutter (backward, forward, double backward) positions. When multiple genotype sets are considered, the same values are needed multiple times. As an example, consider the locus shown in Fig. 2 and the likely genotype sets from Table 2 for two contributors. Computing every expected peak (and the likelihoods of observing the ratios between the observed and expected peaks) for genotype set {(10,13), (12,15)} entails computations that are also identically required for the other genotype sets. Our deduplication system ensures that each such computation is performed only once, and the result is cached and reused. This leads to large savings in multi-contributor mixtures. As an example, deduplication reduces the number of evaluations of the likelihood ratio of observed and expected peak heights in the D18S51 locus of the MIX13 Case 5 mixture from 1.3 million to 86 thousand when 4 unknown contributors are considered in . We use deduplication during both the calculation of the log-probability and the calculation of the HMC gradient. The deduplication system works as follows: Before a run is performed, we precompute the indices of all duplicate entries in the functions and create two data structures, one containing the information required to evaluate the deduplicated expected peak heights, and one containing the indices for a gather operation to be performed after the log-probabilities of the deduplicated peaks have been calculated. This gather operation then unfolds the deduplicated results into a matrix that stores the log-probabilities per replicate, allelic position, chain, and genotype set. This procedure can be thought of as a cache: the forward propagation first generates the cache by calculating the deduplicated log-probabilities. Then, the cache is used during back-propagation to provide the results for all genotype sets.
Fig. 2Example of a locus from a DNA mixture with two contributors.
Table 2Three likely genotype setsand the resulting contributions to the peaks of the locus inFig. 2from two contributors. is the contribution of contributor from a single copy (i.e. ) of allele . is the contribution from stutter originating from allele . For simplicity, only single-backward stutter and a single replicate are considered. Duplicate entries are highlighted with the same colour.
Table 2Three likely genotype setsand the resulting contributions to the peaks of the locus inFig. 2from two contributors. is the contribution of contributor from a single copy (i.e. ) of allele . is the contribution from stutter originating from allele . For simplicity, only single-backward stutter and a single replicate are considered. Duplicate entries are highlighted with the same colour.
3. Results and discussions
Reliable forensic genotyping should minimise the number of false inclusions and false exclusions, exhibit high precision and have low inference runtime. We quantify these aspects for our proposed solution on publicly available test mixtures from previously published benchmarks: the ProvedIt dataset [
We denote the contributors in the hypotheses by plus-delimited strings. U stands for an unknown contributor, W is a witness, V is a victim, and all other entries denote suspects. The hypothesis V+W+S+U for example has 4 contributors: the victim, the witness, the suspect, and one unknown person. In all benchmark cases, the defendant’s hypothesis is the prosecutor’s hypothesis with the suspect replaced by an unknown contributor. All benchmark mixtures were created in laboratories with known ground-truth genotypes of the contributors.
Table 3Hyper-parameter values used in the present benchmarks. SR stands for stutter ratio. The values of the variance rates and the standard deviation of the locus-specific amplification efficiency (LSAE) were adjusted to our formulation of the model (i.e., natural logarithm instead of , 1/mean for LSAE standard deviation).
The linear stutter models are fitted on single-source profiles (forward, backward, and double-backward stutter for ProvedIt and MIX13) or on data provided by the kit manufacturer (only backward stutter for MIX05). The stutter models are available in Section 4 of the Supplementary Material.
All experiments are performed on affordable hardware in the cloud. We use NC8as_T4_v3 instances from Azure Cloud (8 vCPUs, Nvidia Tesla T4 GPU with 16GB VRAM). An exception has been made for ProvedIt Sample 3, which does not fit within 16 GB of VRAM. For this case we rented a Google Cloud virtual machine with a Nvidia A100 GPU with 40GB VRAM.
Table 4Accuracy of our method. Average LRs over 10 runs for our method (“HMC”) in comparison with other solutions on the MIX05 and MIX13 benchmarks with known ground truth. In most cases, all tested methods correctly return LRs larger than 1 for ground-truth inclusion or smaller than 1 for ground-truth exclusion. For MIX13, we compare with the reported results of Euroformix (“EFM”) version 1.11.4 and STRmix™
]. For MIX05, we analyse simultaneously replicates from different kits: ABI’s COFiler, ABI’s SGM Plus, Promega’s Powerplex 16, and ABI’s Profiler Plus. For MIX13, we follow the published studies in using only ABI’s AmpFLSTR IdentiFiler Plus replicate. All cases are analysed with a global analytical threshold of 50 and the ground truth number of contributors with two exceptions: Case 5 from MIX13 is also analysed with 3 contributors (since most laboratories taking part in the original study estimated this number), and Case 2 from MIX13 uses an analytical threshold of 30 (following the recommendation from NIST). For the capillary electrophoresis fragment analysis files, we use default GeneMapper™ ID-X 1.4 analysis settings. The results are presented in Table 4. Our algorithm reproduces most of the results of other solutions, suggesting its validity. Similar to other solutions, our algorithm provides more conservative LR values when a smaller number of contributors is chosen [
]. The only case in which our model provided LR larger than 1 for a false suspect is S05C in MIX13 Case 5. The genotype of this suspect had been deliberately constructed to share alleles with the true contributors in every locus. 74 out of 108 laboratories have included this suspect in the original study [
]; our method excludes it in a 3-contributor scenario. For MIX13 Case 4, our algorithm provides a higher LR than the reciprocal of the random match probability. An explanation for this behaviour is given in Subsection 2.1 of the Supplementary Material.
Fig. 3Precision of our HMC method using different stopping criteria in comparison with STRmix™. We use four different GeneMapper™ analysis methods (A, B, C, D) [14] on Sample 1. Every case is run 100 times, and the resulting per-run LRs are shown as dots. The corresponding maximum likelihood estimations of a normal distribution are shown in the plots below. With standard stopping criteria, our method (cyan) reduces the standard deviation of around 10 fold over STRmix™ (blue). From the published STRmix™ results, we ignored the result provided by participant L4A1 (no known contributor included) and the second run of participant L1A1 (missing in the plots of the original work [14]). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Our method displays high precision also for a simpler ‘Sample 2’ with in case of Analysis Method A.
Four different analysis methods of GeneMapper™ are used in comparison (called A, B, C, and D). Analysis method A includes smoothing, no normalisation, and a baseline window of 51. The other methods differ slightly: B uses normalisation, C uses a baseline window of 33, and D does not use smoothing. The results are shown in Fig. 3. For STRmix™, a 10-fold run-to-run variability in the LRs is observed with the default stopping criterion (blue), which has been attributed to the stochastic nature of MCMC [
]. In our HMC method, we check the GR diagnostic every 300 iterations and stop when it is below 1.05 for all parameters. This results in roughly 10 orders of magnitude reduction in the LR standard deviation (cyan). For GeneMapper™ analysis method A, the standard deviation of is 10.08 times lower in our method than in STRmix™ [
] (Fig. 3A). For analysis method B (Fig. 3B), it is 8.76 times lower, for analysis method C 10.99 times lower (Fig. 3C), and for analysis method D 10.76 times lower (Fig. 3D).
In the same Fig. 3, we also quantify the influence of the convergence diagnostic on the precision of our algorithm by testing two alternative stopping criteria: In the first, we calculate the mean of GR values and stop when this mean is 1.2 (orange). In the second, we use the same convergence metric as STRmix™, but with our threshold value of 1.05 (brown). Due to the efficiency of our HMC sampler, we were unable to simulate chains that resulted in values of this metric reaching as high as 1.2. Interestingly, however, we find that the STRmix™ criterion was sometimes satisfied with threshold 1.05 when some GR values were still 1.2.
Taken together, these results show that our approach is able to significantly improve precision over the STRmix™ method, and that this improvement is due to the stricter convergence criteria, as enabled by our purely continuous model parametrisation with genotype sets marginalised out.
Fig. 4Precision for computationally challenging ProvedIt mixtures. Panel titles indicate the case (top row) and the suspect (bottom row). Each combination is run 10 times, and we plot sub-source . We do not plot exclusion scenarios with LR 0 for all runs. These are: Sample 2b—non-contributors C18C_Cauc, C99B_AA, and ZT80925_Hisp; Sample 3—non-contributors GT38089_Cauc, JT51484_AA, and WT51359_Cauc. Inclusion cases with correct suspect are plotted in cyan, correct exclusion cases in orange. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Finally, we quantify the performance of our HMC method on the most challenging cases of mixtures with 3 to 5 contributors. This serves to test our approach on cases that are not simple to resolve and that are challenging both from a precision point of view and for computational runtime. For these cases, we use the set of low analytical thresholds from Riman et al. [
] analysed here without a known contributor (referred here to as Sample 1b and Sample 2b, respectively), and two following 5-person mixtures from ProvedIt with one known contributor:
•
A05_RD14-0003-30_31_32_33_34-1;1;1;1;1-M3I22-0.315GF-Q1.3_01.15 s
(named here Sample 3, known contributor is Contributor 31)
•
E04_RD14-0003-48_49_50_29_30-1;1;2;4;1-M2d-0.279GF-Q2.1_05.15 s
(named here Sample 4, known contributor is Contributor 29)
We use previously suggested analytical thresholds: 35 RFU for the blue channel, 65 for the green, 45 for the yellow, 50 for the red, and 60 for the purple [
]. The filtered .csv data are used. For each case, we construct all possible prosecutor hypotheses with 1 suspect. We also construct the same number of false hypotheses by choosing the suspects randomly from the NIST 1036 U.S. Population Dataset [
]. To quantify precision, we run our method 10 times for each ProvedIt case. The results are shown in Fig. 4. In all cases, our algorithm correctly classifies contributors and non-contributors with high precision. In all but one case where the true contributors are considered, the difference between extremal values of is under 0.2. The only exception is Sample 2b when the suspect is Contributor 30. This is further analysed in Section 2 of the Supplementary Material.
The inference runtimes on the benchmark cloud instances are shown in Table 5. We show the results for all the mixtures we analysed with 3 or more unknown contributors. The results are better than the reported runtimes of previous versions of PG software solutions [
]. This suggests that despite the increased computational complexity of our marginalised model, the efficiency of HMC sampling and the efficient GPU implementation recover state-of-the-art runtimes as required for practical use of the method.
Table 5Inference runtimes. We report the average (over 10 repetitions) inference times on a single-GPU cloud instance for the listed cases with different numbers of contributors (NoC) in minutes (m) and seconds (s).
High precision, i.e. low run-to-run variability, of the results provided by probabilistic genotyping methods is key to building trust and to ensuring reliable discriminatory power of the analyses. While run-to-run variability has previously been attributed to the inherent stochasticity of MCMC algorithms [
], we have shown that it can be significantly reduced by an adjusted model formulation and stricter convergence criteria. We hypothesised that the convergence of probabilistic genotyping models is hard to assess if models contain both continuous and discrete (e.g. genotype sets) dimensions. We therefore presented a model where the discrete dimensions are marginalised out, leading to a purely continuous and differentiable formulation. We achieved state-of-the-art inference runtimes by implementing an efficient marginalisation strategy using a deduplication system, exploiting GPU acceleration, and using a HMC sampler.
The benchmark experiments presented have shown a reduction in the standard deviation of the resulting log-likelihood ratios by around an order of magnitude when using our method compared to the state-of-the-art STRmix™ software. They have also provided validation of the inference results against known ground truth by close reproduction of previously published results.
In the future, we plan to compare our method with other PG software on the ProvedIt benchmarks [
], provide a comparative analysis of the two main algorithmic approaches (Bayesian inference vs. maximum likelihood estimation) when the same probabilistic model is used, and work on further improvements of the model.
In addition to the run-to-run variability of the inference algorithm, the overall precision observed on a sample in the laboratory also depends on multiple other factors, including the frequency of the suspect’s genotype in the background population, the proportion of the suspect’s template, the quality of the sample, and the hyper-parametrisation of the data-processing methods. All of these must therefore be fixed when comparing different probabilistic genotyping algorithms. However, it might be insightful to explore which of these factors have the largest influence on the precision of final results, and to bound the precision in the worst case.
Acknowledgements
We thank Dr. Sachin Krishnan (Center for Advanced Systems Understanding, Görlitz) and Zofia Dziedzic (University of Wrocław) for discussions on improving the mathematical notation, and Kevin Cheng (Institute of Environmental Science and Research, Ltd., Auckland) for valuable insights on the published STRmix™ results and the diagnostics used in PG software tools
Appendix A. Supplementary data
The following is the Supplementary material related to this article.
J. R. Stat. Soc. Ser. C. Appl. Stat.2015; 64 (This is the accepted version of the following article) (which has been published in final form at http://dx.doi.org/10.1111/rssc.12071): 1-48