Research Paper| Volume 60, 102744, September 2022

Ok

# Hamiltonian Monte Carlo with strict convergence criteria reduces run-to-run variability in forensic DNA mixture deconvolution

Open AccessPublished:July 11, 2022

## Highlights

• Convergence criteria strongly influence precision of MCMC DNA mixture algorithms.
• Our Hamiltonian Monte Carlo can fulfil strong convergence criteria.
• Our tool significantly reduces standard deviation of the log-likelihood ratios.
• It is the first probabilistic genotyping algorithm to benefit from GPU acceleration.

## Abstract

Motivation: Analysing mixed DNA profiles is a common task in forensic genetics. Due to the complexity of the data, such analysis is often performed using Markov Chain Monte Carlo (MCMC)-based genotyping algorithms. These trade off precision against execution time. When default settings (including default chain lengths) are used, as large as a 10-fold changes in inferred log-likelihood ratios (LR) are observed when the software is run twice on the same case. So far, this uncertainty has been attributed to the stochasticity of MCMC algorithms. Since LRs translate directly to strength of the evidence in a criminal trial, forensic laboratories desire LR with small run-to-run variability.

### Results:

We present the use of a Hamiltonian Monte Carlo (HMC) algorithm that reduces run-to-run variability in forensic DNA mixture deconvolution by around an order of magnitude without increased runtime. We achieve this by enforcing strict convergence criteria. We show that the choice of convergence metric strongly influences precision. We validate our method by reproducing previously published results for benchmark DNA mixtures (MIX05, MIX13, and ProvedIt). We also present a complete software implementation of our algorithm that is able to leverage GPU acceleration for the inference process. In the benchmark mixtures, on consumer-grade hardware, the runtime is less than 7 min for 3 contributors, less than 35 min for 4 contributors, and less than an hour for 5 contributors with one known contributor.

## 1. Introduction

Investigators present at a crime scene identify and collect the available physical evidence. As a part of this evidence, DNA samples containing material from multiple contributors (i.e. mixed DNA samples) are often collected. The resulting short tandem repeat data are occasionally affected by stochastic events such as severe peak height imbalance, drop-outs, and drop-ins [
• Balding D.J.
• Buckleton J.
Interpreting low template DNA profiles.
], especially in case of low-template samples. Manual analysis of the electropherograms (EPG) might be unreliable and biased [
• Thompson W.C.
Painting the target around the matching profile: the texas sharpshooter fallacy in forensic DNA interpretation.
,
Executive Office Executive Office of the President
Report to the President - Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods.
]. Therefore, laboratories routinely rely on validated statistical software when analysing complex DNA mixtures [
Executive Office Executive Office of the President
Report to the President - Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods.
].
The recommended metric [
• Aitken C.
• Nordgaard A.
• Taroni F.
• Biedermann A.
Commentary: Likelihood ratio as weight of forensic evidence: A closer look.
] for reporting results of DNA mixture analysis is the likelihood ratio (LR):
$LR=P(V|Hp)P(V|Hd)=∑jP(V|Sj)P(Sj|Hp)∑jP(V|Sj)P(Sj|Hd),$
(1)

where $V$ is the observed EPG, $Sj$ represents a genotype set—a list of tuples denoting the allele designations of contributors. The summations are over all possible genotype sets $j$. $Hp$ and $Hd$ are the hypotheses of the prosecutor and the defendant respectively. A hypothesis assumes inclusion of certain contributors (suspect, victim, etc.) in the mixture, as well as the background allele frequencies of the populations the contributors allegedly belong to. Usually, the difference between the hypothesis of the prosecutor and the hypothesis of the defendant is the inclusion of the suspect in the former. In the setting we consider, the number of considered contributors is fixed beforehand. $P(Sj|Hn)$ can be estimated based on the background frequencies of the alleles in the populations of interest for any hypothesis $Hn$, $n={p,d}$ [
• Bright J.-A.
• Coble M.
Forensic DNA Profiling : A Practical Guide to Assigning Likelihood Ratios.
]. Probabilistic genotyping (PG) refers to the set of statistical methods used to compute LR for given EPG data.
In order to estimate $P(V|Sj)$, assumptions about the underlying data-generating process are made. These assumptions lead to a probabilistic model $P(V|M,Sj)$ with latent variables $M$. Models where probability is estimated based on the heights of the EPG peaks are called “fully continuous”. Some of the main approaches used to infer such models are: finding the most likely set of parameters by maximum likelihood estimation [
• Bleka Ø.
• Storvik G.
• Gill P.
EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
,
• Cowell R.
• Graversen T.
• Lauritzen S.L.
• Mortera J.
Analysis of forensic DNA mixtures with artefacts.
• Cowell R.G.
• Graversen T.
• Lauritzen S.L.
• Mortera J.
Analysis of forensic dna mixtures with artefacts.
] or estimating the expression [
• Bleka Ø.
• Storvik G.
• Gill P.
EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
,
• Perlin M.W.
• Legler M.M.
• Spencer C.E.
• Smith J.L.
• Allan W.P.
• Belrose J.L.
• Duceman B.W.
Validating TrueAllele ® DNA mixture interpretation.
,
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
]:
$P(V|Sj)=∫MP(V|m,Sj)f(m)dm$
(2)

with the prior probability density function (PDF) of the parameters $f(⋅)$. In an ideal scenario, LR is independent of any choices made by the laboratory technician and of any random confounding factors. In practice, however, LR depends on the variation in the samples from the crime scene, stochastic events occurring during PCR amplification, allele frequency sampling, parameter settings in the data-processing software,
E.g. GeneMapper™.
hyper-parametrisation of the PG software, etc. [
• Morrison G.S.
Special issue on measuring and reporting the precision of forensic likelihood ratios: Introduction to the debate.
,
• Taylor D.
• Bright J.-A.
• Buckleton J.
• Curran J.
An illustration of the effect of various sources of uncertainty on DNA likelihood ratio calculations.
]. Still, even when fixing all of these influences across identical runs on the same EPG data, residual run-to-run variability remains [
• Perlin M.W.
• Legler M.M.
• Spencer C.E.
• Smith J.L.
• Allan W.P.
• Belrose J.L.
• Duceman B.W.
Validating TrueAllele ® DNA mixture interpretation.
,
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
,
• Bauer D.W.
• Butt N.
• Hornyak J.M.
• Perlin M.W.
Validating TrueAllele ® interpretation of DNA mixtures containing up to ten unknown contributors.
,
• Bright J.-A.
• Taylor D.
• McGovern C.
• Cooper S.
• Russell L.
• Abarno D.
• Buckleton J.
Developmental validation of STRmix™, expert software for the interpretation of forensic DNA profiles.
,
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
].
So far, this run-to-run variability has been attributed to the inherent stochasticity of the Markov Chain Monte Carlo (MCMC) methods used to estimate mixture model parameters [
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
,
• Bright J.-A.
• Stevenson K.E.
• Curran J.M.
• Buckleton J.S.
The variability in likelihood ratios due to different mechanisms.
]. However, as we show here, the apparent run-to-run variability is more likely caused by the choice of convergence criteria used in the MCMC sampler. This is supported by our results demonstrating that run-to-run variability can be reduced when using an MCMC method with strict convergence criteria. For this, we formulate a probabilistic model of DNA mixture deconvolution that only has continuous model parameters, marginalising over the discrete dimensions (genotype sets). While such marginalised models can be properly convergence-controlled, they are generally more expensive to compute. As we show here, however, the intrinsic structure of the posterior probability distribution can be exploited by Hamiltonian Monte Carlo (HMC), maintaining the runtimes of conventional MCMC solutions. We show that the strict convergence criteria afforded by our method significantly reduce run-to-run variability. We further present a deduplication system that efficiently handles the combinatorial growth in the number of genotype sets with increasing numbers of contributors.

### 1.1 Precision of DNA mixture deconvolution

We define precision of DNA mixture deconvolution as the inverse variance between results of runs with identical hyper-parametrisation on the same EPG data for the same hypotheses. Precision has to be considered in addition to the accuracy of a PG system [
• Morrison G.S.
Special issue on measuring and reporting the precision of forensic likelihood ratios: Introduction to the debate.
], as also the authors of PG algorithms note [
• Gill P.
• Benschop C.
• Buckleton J.
• Bleka Ø.
• Taylor D.
A review of probabilistic genotyping systems: EuroForMix, DNAStatistX and STRmix™.
]: “The argument is that the existence of variability [across PG runs — our note] raises doubts about whether any of the results should be accepted.” [
• Gill P.
• Benschop C.
• Buckleton J.
• Bleka Ø.
• Taylor D.
A review of probabilistic genotyping systems: EuroForMix, DNAStatistX and STRmix™.
]
Courts are often unaware of run-to-run variability, as expert witnesses usually report a single LR number [
• Garrett B.L.
• Crozier W.E.
Error rates, likelihood ratios, and jury evaluation of forensic evidence.
]. The issue is even more severe when the verbal scale for reporting LRs is used [
Association of Forensic Science Providers
Standards for the formulation of evaluative forensic science expert opinion.
,
European Network of Forensic Science Institutes
ENFSI guideline for evaluative reporting in forensic science.
]. The European Network of Forensic Science Institutes [
European Network of Forensic Science Institutes
ENFSI guideline for evaluative reporting in forensic science.
] suggest a scale that defines an LR between 100 and 1000 as “moderately strong support” and LR between 1000 and 10000 as “strong support”. Let us assume that we use a PG system, which, for the given case, outputs results from a normal distribution: $log10LR∼N(2.3,0.5)$. A single run of the software would give “strong support” in $≈73%$ of cases. A technique that reports confidence intervals [
• Bright J.-A.
• Coble M.
Forensic DNA Profiling : A Practical Guide to Assigning Likelihood Ratios.
,
• Bright J.-A.
• Lee S.-I.
• Buckleton J.
• Taylor D.
Revisiting the STRmix™ likelihood ratio probability interval coverage considering multiple factors.
], however, would provide the conservative answer of “moderately strong support”. This highlights the importance of high precision (i.e. low variance) in PG results.
The precision of available commercial solutions has been quantified in several studies [
• Perlin M.W.
• Legler M.M.
• Spencer C.E.
• Smith J.L.
• Allan W.P.
• Belrose J.L.
• Duceman B.W.
Validating TrueAllele ® DNA mixture interpretation.
,
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
,
• Bauer D.W.
• Butt N.
• Hornyak J.M.
• Perlin M.W.
Validating TrueAllele ® interpretation of DNA mixtures containing up to ten unknown contributors.
,
• Bright J.-A.
• Taylor D.
• McGovern C.
• Cooper S.
• Russell L.
• Abarno D.
• Buckleton J.
Developmental validation of STRmix™, expert software for the interpretation of forensic DNA profiles.
,
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
]. A standard deviation of LR of $>104$ has been reported between identical runs on a three-contributor mixture (sample ‘3-2’) when the TrueAllele® software was used [
• Bauer D.W.
• Butt N.
• Hornyak J.M.
• Perlin M.W.
Validating TrueAllele ® interpretation of DNA mixtures containing up to ten unknown contributors.
]. Results obtained with the STRmix™ software displayed 10-fold LR difference across runs [
• Bright J.-A.
• Taylor D.
• McGovern C.
• Cooper S.
• Russell L.
• Abarno D.
• Buckleton J.
Developmental validation of STRmix™, expert software for the interpretation of forensic DNA profiles.
,
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
]. In order to increase precision, it has been recommended to increase the number of MCMC iterations, at the expense of a larger computational runtime [
• Bright J.-A.
• Taylor D.
• McGovern C.
• Cooper S.
• Russell L.
• Abarno D.
• Buckleton J.
Developmental validation of STRmix™, expert software for the interpretation of forensic DNA profiles.
,
• Riman S.
• Iyer H.
• Vallone P.M.
Examining performance and likelihood ratios for two likelihood ratio systems using the PROVEDIt dataset.
].
To determine when to terminate an MCMC sampler in Bayesian inference, convergence criteria are used [
• Gelman A.
Bayesian Data Analysis.
]. The most popular criterion is the univariate Gelman–Rubin (GR) diagnostic [
• Gelman A.
Bayesian Data Analysis.
,
• Brooks S.P.
• Gelman A.
General methods for monitoring convergence of iterative simulations.
], which compares pooled and within-chain variances of samples to indicate possible convergence. For given model parameters, this diagnostic has a value close to 1 if the samples from different chains result in similar estimates for the marginal distribution. Since actual convergence cannot be quantified as long as the true posterior distribution is not known, convergence criteria measure the stability of samples, and the term “convergence” in MCMC is always relative to the chosen metric.
Some of the available PG software solutions provide users with convergence diagnostics. STRmix™ [
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
] for example calculates the ratio of pooled and within-chain variances of the likelihood of the model (personal communication, Kevin Cheng, Institute of Environmental Science and Research, Ltd., Auckland). GenoProof Mixture [
• Götz F.M.
• Schönborn H.
• Borsdorf V.
• Pflugbeil A.-M.
• Labudde D.
Genoproof Mixture 3—new software and process to resolve complex DNA mixtures.
] reports the univariate GRs for the continuous parameters. By default, both software solutions use a predefined constant number of post burn-in samples and then report the value of the diagnostic to the user. If the desired threshold (by default 1.2 in STRmix™, 1.05 in GenoProof Mixture) of the convergence diagnostic has not been achieved, the software offers an option to run additional iterations. The default GR threshold should be rather low. The authors of the diagnostic state [
• Gelman A.
Bayesian Data Analysis.
]: “The condition of GR near 1 depends on the problem at hand; for most examples, values below 1.1 are acceptable, but for a final analysis in a critical problem, a higher level of precision may be required.” [
• Gelman A.
Bayesian Data Analysis.
]
Providing evidence in court should be considered a critical problem, as the consequences of wrong or doubtful answers are significant [
• Kwong K.
The algorithm says you did it: The use of black box algorithms to analyze complex DNA evidence.
]. Other scientists researching convergence diagnostics noted [
• Vats D.
• Knudson C.
Revisiting the Gelman–Rubin diagnostic.
]: “We argue that a cutoff of GR $≤$1.1 is much too high to yield reasonable estimates of target quantities.” [
• Vats D.
• Knudson C.
Revisiting the Gelman–Rubin diagnostic.
]
This seems even more important since the statistical models used in both of these software tools combine continuous (e.g. peak intensities) and discrete (e.g. genotype sets) dimensions. However, GR cannot monitor convergence in discrete dimensions. It is therefore possible that convergence is deduced purely from the continuous parameters, while the genotype set distributions may not have converged at all, offering a possible explanation for the large run-to-run variability observed despite low GR thresholds. In our model, we avoid the issue of assessing convergence of genotype sets by marginalising them out (i.e., the genotype set is not part of the MCMC sampling process).

### 1.2 Trade-off with execution time

In forensic DNA mixture deconvolution, computational runtime is of great importance, since:
• Laboratories might have to run software multiple times with different hyper-parametrisations in order to check the robustness of the results or to test hypotheses with different numbers of contributors, different analytical thresholds, different priors, etc.
• Laboratories might want to quantify the precision of the results over several identical replicates.
• Forensic laboratories are often working under time pressure, e.g., if cases attract great media attention or laws limit detention time without charges.
In addition to the efficiency of the software implementation, there are multiple factors that influence the execution time, including the number of contributors, the number of alleles per locus, the techniques used to limit the number of considered genotype sets, the convergence criteria, the choice of the optimisation problem (maximum likelihood vs. Bayesian inference), the specification of the model, the priors used, etc.
In general, there is a trade-off between the precision and runtime. Lower runtimes can trivially be achieved by running a smaller number of MCMC iterations, at the expense of precision. Achieving perfect precision is theoretically possible, if runtime is not a priority, by integrating over the latent variables (see Eq. (2)).
The model presented here integrates over the discrete dimensions, i.e., marginalises over genotype sets in order to be able to properly monitor convergence. Such a marginalisation imposes additional computational cost, and a proper estimation of the posterior requires well-suited computational methods. We render our method practically applicable by using a Hamiltonian Monte Carlo (HMC) sampler and by introducing a deduplication system for genotype sets. Thanks to these contributions, we are able to use a strict GR threshold of 1.05 with similar or faster runtime than existing solutions.

## 2. Materials and methods

We base our probabilistic genotyping model on the work by Taylor et al. [
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
], due to the large number of studies that describe and evaluate this model [
• Bleka Ø.
• Storvik G.
• Gill P.
EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
,
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
,
• Taylor D.
• Bright J.-A.
• Buckleton J.
Using probabilistic theory to develop interpretation guidelines for Y-STR profiles.
,
• Buckleton J.S.
• Lohmueller K.E.
• Inman K.
• Cheng K.
• Curran J.M.
• Pugh S.N.
• Bright J.-A.
• Taylor D.A.
Testing whether stutter and low-level DNA peaks are additive.
,
• Cheng K.
• Bright J.-A.
• Kerr Z.
• Taylor D.
• Ciecko A.
• Curran J.
• Buckleton J.
Examining the additivity of peak heights in forensic DNA profiles.
]. The main assumptions in this model are the independence of the loci data given the nuissance parameters and the log-normal distribution of the ratio of the observed EPG peak heights to the peak heights predicted by the generative model (called “expected” peak heights). The generative model consists of several steps, as illustrated in Fig. 1: First, the expected contributions are computed for each genotype set from the considered set of parameters. Then, peak stutter models are applied to predict expected peaks. The next step is to calculate the standard deviation $σ$ of the log-normal distribution. Finally, the likelihood of the observed data given the parameters is calculated as the product of the likelihoods of all peaks. The model handles stochastic dropout and drop-in events.
We consider the posterior probability of parameters $M$ given the evidence $V$
$P(M|V)=P(V|M)P(M)P(V)=P(M)∑jP(V|M,Sj)P(Sj|M)P(V).$
(3)

$P(Sj|M)$ is assumed to be a flat prior. This assumption leads to genotype weights that are independent from the person(s) of the interest. In Bayesian inference, evidence is usually neglected as it is too expensive to compute and constant w.r.t. model parameters. We thus obtain the unnormalised posterior
$P(M|V)∝P(M)∑jP(V|M,Sj),$
(4)

which we use for estimating $P(V|Sj)$ (see Eq. (2)). We assume peak heights to be conditionally independent given $Sj$ and $M$, and alleles of a contributor in different loci to be independent from each other. To provide a mathematical formulation of the model, we denote:
• $Oarl$: random variable of observed peak height at locus $l$, allele $a$, replicate $r$;
• $xarl$: expected peak height at locus $l$, allele $a$, replicate $r$;
• $fX$: the PDF of a random variable $X$;
• $Q$: the “catch-all” dropout allele [
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
];
• $hrl$: the analytical threshold of the EPG for locus $l$ and replicate $r$;
• $N(0,s2)$: a normal distribution with mean $0$ and standard deviation $s$;
• $cψ$: the peak height variance parameter for peak type $ψ$ (e.g. stutter).
The resulting model is a function $xarl(M,Sj)$ of $M$ and $Sj$. The likelihood of the observed EPG given parameters $M$ and a genotype set $Sj$ is then:
$∑jP(V|M=m,Sj)=∏l∑j∏rP(Vl,r|m,Sj,l),$
(5)

where $Vl,r$ denotes the available data at locus $l$ for replicate $r$, and $Sj,l$ is the genotype set $j$ at locus $l$. The conditional probability is modelled as:
$P(Vl,r|m,Sj,l)=((∏b∈BPObrl∣xbrl(m,Sj,l),Obrl>0)∏d∈DPdropout∣xdrl(m,Sj,l))dm.$
(6)

The inner multiplications are performed over the set $B$ of observed peaks and the set $D$ of hypothetical dropout peaks. This model formulates separately the relative likelihood of observed peaks and the probabilities of dropout events. In the following, we abbreviate the notation for $xarl(m,Sj,l)$ to $xarl$.
The genotype set weights are calculated by:
$P(V|Sj,l,l)∝∑m∈M′P(m)∏rP(Vl,r|m,Sj,l),$
(7)

where $M′$ is the sample from the posterior.

### 2.1 Observed peaks

In case a peak is observed (i.e. $Oarl>0$) our model considers a compound distribution that combines sub-models for peaks that are expected ($fZ$) and for drop-in events $fOarl∣xarl=0,Oarl>0$:
$POarl∣xarl,Oarl>hrl∝(1−drate)fZ,ifxarl>0dratefOarl∣xarl=0,Oarl>hrl,otherwise.$
(8)

For the drop-in events, we use the model introduced by Euroformix [
• Bleka Ø.
• Storvik G.
• Gill P.
EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
]:
$xarl=0,Oarl>hrl⟹Oarl−hrl∼Exp(λ).$
(9)

Two hyper-parameters based on the level of noise in negative controls are required: the drop-in rate $drate$ and the $λ$ of the exponential distribution. For the expected peaks, we assume a log-normal distribution following previous works [
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
]:
$Z=lnOarlxarl∼N(0,σarl(x)2).$
(10)

The standard deviation of this distribution depends on the components of the expected peaks and their heights:
$σarl(x)=1xarl∑ψ∈Ψ(a)cψxψarlχψrl.$
(11)

We define $ψ∈Ψ(a)={a+2,a+1,a,a−1}$. This means that for a single allele $a$, we consider contributions from the allelic peak $a$, the backward stutter from the $a+1$ peak, the forward stutter from the $a−1$ peak, and the double backward stutter from the $a+2$ peak. Our method currently does not consider half-backward stutter.
We use one parameter for allelic peak standard deviation ($cψ=cp$ when $ψ=a$) and a different one for stutter peak standard deviations ($cψ=cs$ when $ψ≠a$). Individual parameters for the standard deviations of different types of stutter could be introduced without significant changes to the model. The expected peak heights are:
$xψarl=∑nρ(2a−ψ)arltψnrl$$xarl=∑ψ∈Ψ(a)xψarl.$

The product contribution at a selected allele $a$ is defined as:
$tanrl=ζanlαrlwne0.001δrn(mal−m¯)tr′,$
(14)

where $mal$ is the molecular weight of the allele, and $m¯$ is the average of the largest and smallest observed molecular weights on the EPG within the called peaks. The integer $ζanl$ is 1 if the genotype of contributor $n$ includes allele $a$ in locus $l$, and 0 otherwise. It is equal to 2 in case of a homozygote. The scalars $αrl$ are the locus-specific amplification efficiencies (LSAE), $wn$ are the weights of the contributors that sum up to 1, $δrn$ are the degradation parameters, and $tr′$ is the total allelic product expressed in relative fluorescent units (RFU).
Eq. (12) includes the normalised stutter ratios $ρ$ and the product contributions from a contributor $n$, $tψnrl$. To obtain $ρ$, the unnormalised stutter ratios $π$ are deduced from unambiguous profiles (see Supplementary Materials for the list of models with unnormalised stutter ratios). This is done using linear regressions based on allele designation or longest uninterrupted sequences [
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
]. Normalisation is subsequently required, since multiple types of stutter are considered at the same time:
$ρψ′arl=1+π(a−2)arl+π(a−1)arl+π(a+1)arl−1,ifψ′=aπψ′arlρaarl,otherwise.$
(15)

The sum in Eq. (12) is over the assumed contributors. Finally, $χψrl$ models the fact that peak variance is inversely proportional to peak height []. The rationale behind the formula is explained in Section 1 of the Supplementary Material:
$χψrl=1000xψrl+1+xψrl.$
(16)

### 2.2 Dropout events

In case of a dropout, the peak is unobserved because it is below the analytical threshold. This corresponds directly to
$Pdropout∣xarl=PZ≤lnhrlxarl.$
(17)

See Supplementary Material section 1 for details.

### 2.3 Parameters of the model and priors

Table 1 summarises the parameters of our model. In typical scenarios, the total number of posterior dimensions will range from 22 (2-contributor, 1-replicate Identifiler™ mixture without modelling of the AMEL locus) to 49 (5-contributor, 3-replicate Fusion 6C mixture with AMEL locus included).
Table 1The parameters of our model. One contributor weight is omitted, as it can be inferred from the other weights (the sum of all weights has to be 1). $L$ is the number of loci, $R$ is the number of replicates, and $N$ is the number of contributors. We further reduce the number of parameters if multiple replicates are performed using the same kit. In such a case, the analysis shares the LSAEs $αrl$ across the replicates. The domain boundaries are treated as described in Subsection 3.1 of the Supplementary Materials.
ParameterMeaningDomainQuantity
$cs,cp$Stutter and peak height standard deviation$R>0$2
$tr′$Total allelic product$R>0$$R$
$wn$Weights of contributors(0,1)$N−1$
$δrn$Degradation parameters$R>0$$RN$
$αrl$Locus-specific amplification efficiencies (LSAE)$R>0$$Lall$
$i2$LSAE variance$R>0$1
We define the prior probability of the parameters as:
$P(M)=P(α|i)P(i)P(cs)P(cp).$
(18)

Here, $P(α|i)$ is the prior ln$(αrl)∼N(0,i2)$ that prevents the amplification efficiencies from drifting away from 1 too far. The prior variance $i2∼Exp(σα)$, where $σα$ is a hyper-parameter to be optimised by the laboratory [
• Bright J.-A.
• Taylor D.
• Buckleton J.
Statement on minor miscoding in BN formulae STRmix V1.08 with additional comments.
]. $P(cs)$ and $P(cp)$ are priors on the peak height standard deviation parameters, which are also present in STRmix™ [
• Bright J.-A.
• Taylor D.
• Buckleton J.
Statement on minor miscoding in BN formulae STRmix V1.08 with additional comments.
].

### 2.4 Other considerations

In order to provide conservative estimates of LR, we use Balding–Nichols sub-population correction [
• Bright J.-A.
• Coble M.
Forensic DNA Profiling : A Practical Guide to Assigning Likelihood Ratios.
,
• Balding D.J.
• Nichols R.A.
DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands.
], and we report sub-source LRs [
• Taylor D.
• Bright J.-A.
• Buckleton J.
The ‘factor of two’ issue in mixed DNA profiles.
] unless specified otherwise. We consider dropout allelic contributions as separate peaks, i.e., (Q,Q) is considered heterozygous. The total number of genotype sets is reduced by considering at most one drop-in per locus and using a drop-in cap. Peaks which are in stutter position and are not included in the genotype set definition are considered drop-ins if they are abnormally tall w.r.t. the origin, see the maximum stutter ratios in Table 3. In summary, if we consider a genotype set for a locus with a tall peak, that genotype set will not be removed during preprocessing if:
• the corresponding allele score is a part of any of the genotypes from the set;
• or the height is lower than the drop-in cap and there is no other peak that can be explained as drop-in;
• or the peak can be explained as a stutter of another peak with an observed stutter ratio lower than the maximum stutter ratio.

### 2.5 Sampling algorithm

The marginalised (over genotype sets $Sj$) PG model from Eq. (4) is prohibitively costly to infer using random-walk MCMC, due to the large number of log-probabilities that need to be calculated when all possible genotype sets are considered. We thus reduce the number of required calculations of log-probabilities by using an adaptive-proposal sampler that has been successfully used in other fields: Hamiltonian Monte Carlo (HMC) [
• Duane S.
• Kennedy A.
• Pendleton B.J.
• Roweth D.
Hybrid Monte Carlo.
]. The difference between HMC and random-walk MCMC is how the proposal distribution is chosen. HMC simulates physical (i.e. Hamiltonian) system dynamics instead of choosing a random point from the neighbourhood of the current sample. This renders HMC very efficient for posteriors with multi-modal or multi-funnel shapes, significant parameter correlations, and/or high dimensionality [
• Betancourt M.
A conceptual introduction to Hamiltonian Monte Carlo.
]. Unlike GenoProof Mixture, which only changes the value of a single continuous parameter in each iteration, our sampler considers multi-variate moves. The proposal distribution for these moves is dynamically adapted across iterations. It is determined in each iteration based on the local gradient of the log-probability of the model [
• Duane S.
• Kennedy A.
• Pendleton B.J.
• Roweth D.
Hybrid Monte Carlo.
]. This is possible because our model is differentiable, as the discrete genotype sets are marginalised out.
To compute the model gradients, we rely on the proven automatic differentiation framework in TensorFlow Probability [
• Dillon J.V.
• Langmore I.
• Tran D.
• Brevdo E.
• Vasudevan S.
• Moore D.
• Patton B.
• Alemi A.
• Hoffman M.
• Saurous R.A.
TensorFlow distributions.
], where HMC is also implemented. Thanks to the portability of the TensorFlow library, we can implement both CPU and GPU versions of our inference procedure. However, a naïve TensorFlow implementation would perform below expectations due to the combinatorial growth of the number of possible genotype sets with increasing number of contributors. We therefore introduce an important performance improvement: a deduplication system that can be used with any model that considers all genotypes in a single iteration (e.g. our work, Euroformix). Additional performance optimisations are described in Section 3 of the Supplementary Material.
For the deduplication system, we consider a single locus $l$. The expected peak heights $xarl$ and the standard deviation $σarl(x)$ of their distribution (Eq. (10)) depend on the genotype sets. If we consider the output for a single allelic position, these values depend only on the continuous parameters and on the alleles at the allelic and stutter (backward, forward, double backward) positions. When multiple genotype sets are considered, the same values are needed multiple times. As an example, consider the locus shown in Fig. 2 and the likely genotype sets from Table 2 for two contributors. Computing every expected peak (and the likelihoods of observing the ratios between the observed and expected peaks) for genotype set {(10,13), (12,15)} entails computations that are also identically required for the other genotype sets. Our deduplication system ensures that each such computation is performed only once, and the result is cached and reused. This leads to large savings in multi-contributor mixtures. As an example, deduplication reduces the number of evaluations of the likelihood ratio of observed and expected peak heights in the D18S51 locus of the MIX13 Case 5 mixture from $≈$1.3 million to $≈$86 thousand when 4 unknown contributors are considered in $Hd$. We use deduplication during both the calculation of the log-probability and the calculation of the HMC gradient. The deduplication system works as follows: Before a run is performed, we precompute the indices of all duplicate entries in the functions and create two data structures, one containing the information required to evaluate the deduplicated expected peak heights, and one containing the indices for a gather operation to be performed after the log-probabilities of the deduplicated peaks have been calculated. This gather operation then unfolds the deduplicated results into a matrix that stores the log-probabilities per replicate, allelic position, chain, and genotype set. This procedure can be thought of as a cache: the forward propagation first generates the cache by calculating the deduplicated log-probabilities. Then, the cache is used during back-propagation to provide the results for all genotype sets.
Table 2Three likely genotype sets $Sj$ and the resulting contributions to the peaks of the locus in Fig. 2 from two contributors. $Cn,a=ρaa1tan1$ is the contribution of contributor $n$ from a single copy (i.e. $r=1$) of allele $a$. $Cn,a′=ρ(a−1)a1tan1$ is the contribution from stutter originating from allele $a$. For simplicity, only single-backward stutter and a single replicate are considered. Duplicate entries are highlighted with the same colour.

## 3. Results and discussions

Reliable forensic genotyping should minimise the number of false inclusions and false exclusions, exhibit high precision and have low inference runtime. We quantify these aspects for our proposed solution on publicly available test mixtures from previously published benchmarks: the ProvedIt dataset [
• Alfonse L.E.
• Garrett A.D.
• Lun D.S.
• Duffy K.R.
• Grgicak C.M.
A large-scale dataset of single and mixed-source short tandem repeat profiles to inform human identification strategies: PROVEDIt.
] and the MIX05 and MIX13 studies [
• Butler J.M.
• Kline M.C.
• Coble M.D.
NIST interlaboratory studies involving DNA mixtures (MIX05 and MIX13): Variation observed and lessons learned.
]. For the GlobalFiler™ mixtures from the ProvedIt dataset, we use the hyper-parametrisation suggested by Riman et al. [
• Riman S.
• Iyer H.
• Vallone P.M.
Examining performance and likelihood ratios for two likelihood ratio systems using the PROVEDIt dataset.
] (Table 3). For the MIX05 and MIX13 datasets, we use the hyper-parametrisation from Buckleton et al. [
• Buckleton J.S.
• Bright J.-A.
• Cheng K.
• Budowle B.
• Coble M.D.
NIST interlaboratory studies involving DNA mixtures (MIX13): A modern analysis.
]. For MIX05, MIX13, and precision studies we use the FBI extended caucasian population [
• Moretti T.R.
• Moreno L.I.
• Smerick J.B.
• Pignone M.L.
• Hizon R.
• Buckleton J.S.
• Bright J.-A.
• Onorato A.J.
Population data on the expanded CODIS core STR loci for eleven populations of significance for forensic DNA analyses in the united states.
] genetic background model. For the challenging mixtures of Section 3.3, we use the NIST 1036-Caucasian background allele frequencies [
• Steffen C.R.
• Coble M.D.
• Gettings K.B.
• Vallone P.M.
Corrigendum to ‘US population data for 29 autosomal STR loci’[Forensic Sci. Int. Genet. 7 (2013) e82–e83].
].
We denote the contributors in the hypotheses by plus-delimited strings. U stands for an unknown contributor, W is a witness, V is a victim, and all other entries denote suspects. The hypothesis V+W+S+U for example has 4 contributors: the victim, the witness, the suspect, and one unknown person. In all benchmark cases, the defendant’s hypothesis is the prosecutor’s hypothesis with the suspect replaced by an unknown contributor. All benchmark mixtures were created in laboratories with known ground-truth genotypes of the contributors.
Table 3Hyper-parameter values used in the present benchmarks. SR stands for stutter ratio. The values of the variance rates and the standard deviation of the locus-specific amplification efficiency (LSAE) were adjusted to our formulation of the model (i.e., natural logarithm instead of $log10$, 1/mean for LSAE standard deviation).
Hyper-parameterProvedItMIX05, MIX13
Drop-in rate ($drate$)0.00150
Drop-in ($λ$)0.032N/A
Allele variance ($cp2$) shape5.6533.57
Allele variance ($cp2$) rate15.75.196
Stutter variance ($cs2$) shape1.5016.97
Stutter variance ($cs2$) rate148.4629.279
LSAE ($αrl$) standard deviation32.25833.333
Rare allele frequency$2.5÷361$$2.5÷202$
Drop-in cap3 $hrl$
Max. observed backward SR0.3
Max. observed forward SR0.15
Max. observed double backward SR0.05
Wright’s $FST$ for Balding–Nichols0.01
Number of chains4
Number of burnin steps1200
Leapfrog steps per sample10
GR stopping threshold1.05
The linear stutter models are fitted on single-source profiles (forward, backward, and double-backward stutter for ProvedIt and MIX13) or on data provided by the kit manufacturer (only backward stutter for MIX05). The stutter models are available in Section 4 of the Supplementary Material.
All experiments are performed on affordable hardware in the cloud. We use NC8as_T4_v3 instances from Azure Cloud (8 vCPUs, Nvidia Tesla T4 GPU with 16GB VRAM). An exception has been made for ProvedIt Sample 3, which does not fit within 16 GB of VRAM. For this case we rented a Google Cloud virtual machine with a Nvidia A100 GPU with 40GB VRAM.
Table 4Accuracy of our method. Average LRs over 10 runs for our method (“HMC”) in comparison with other solutions on the MIX05 and MIX13 benchmarks with known ground truth. In most cases, all tested methods correctly return LRs larger than 1 for ground-truth inclusion or smaller than 1 for ground-truth exclusion. For MIX13, we compare with the reported results of Euroformix (“EFM”) version 1.11.4 and STRmix™
• Buckleton J.S.
• Bright J.-A.
• Cheng K.
• Budowle B.
• Coble M.D.
NIST interlaboratory studies involving DNA mixtures (MIX13): A modern analysis.
. The TPOX locus was ignored in the MIX05 Case 4, due to a tri-allelic pattern of the perpetrator.
Case$Hp$Truth$log10LR$
HMCEFMSTRmix™
MIX05 Case 1Perpetrator+UIncl.19.34
MIX05 Case 2Perpetrator+UIncl.25.67
MIX05 Case 3Perpetrator+UIncl.22.32
MIX05 Case 4Perpetrator+UIncl.10.44
MIX13 Case 1V+S01AIncl.20.1520.1820.15
MIX13 Case 2S02A+U+UIncl.17.0317.2816.98
MIX13 Case 2S02B+U+UIncl.7.507.887.26
MIX13 Case 2S02C+U+UIncl.5.416.115.83
MIX13 Case 2S02D+U+UExcl.−16.18−2.36−14.03
MIX13 Case 3V+W+S03AIncl.7.876.827.69
MIX13 Case 3V+W+S03BExcl.$−∞$$−∞$$−∞$
MIX13 Case 4V+SIncl.20.2319.9120.15
MIX13 Case 5S05A+U+UIncl.3.389.263.45
MIX13 Case 5S05B+U+UIncl.1.619.383.32
MIX13 Case 5S05C+U+UExcl.−8.666.45−9.22
MIX13 Case 5S05A+U+U+UIncl.6.20
MIX13 Case 5S05B+U+U+UIncl.5.96
MIX13 Case 5S05C+U+U+UExcl.2.63

### 3.1 Accuracy: MIX05 and MIX13 benchmarks

We first benchmark the performance of our method on inter-laboratory studies organised by NIST: MIX05 and MIX13 [
• Butler J.M.
• Kline M.C.
• Coble M.D.
NIST interlaboratory studies involving DNA mixtures (MIX05 and MIX13): Variation observed and lessons learned.
]. For MIX05, we analyse simultaneously replicates from different kits: ABI’s COFiler, ABI’s SGM Plus, Promega’s Powerplex 16, and ABI’s Profiler Plus. For MIX13, we follow the published studies in using only ABI’s AmpFLSTR IdentiFiler Plus replicate. All cases are analysed with a global analytical threshold of 50 and the ground truth number of contributors with two exceptions: Case 5 from MIX13 is also analysed with 3 contributors (since most laboratories taking part in the original study estimated this number), and Case 2 from MIX13 uses an analytical threshold of 30 (following the recommendation from NIST). For the capillary electrophoresis fragment analysis files, we use default GeneMapper™ ID-X 1.4 analysis settings. The results are presented in Table 4. Our algorithm reproduces most of the results of other solutions, suggesting its validity. Similar to other solutions, our algorithm provides more conservative LR values when a smaller number of contributors is chosen [
• Buckleton J.S.
• Bright J.-A.
• Cheng K.
• Budowle B.
• Coble M.D.
NIST interlaboratory studies involving DNA mixtures (MIX13): A modern analysis.
]. The only case in which our model provided LR larger than 1 for a false suspect is S05C in MIX13 Case 5. The genotype of this suspect had been deliberately constructed to share alleles with the true contributors in every locus. 74 out of 108 laboratories have included this suspect in the original study [
• Butler J.M.
• Kline M.C.
• Coble M.D.
NIST interlaboratory studies involving DNA mixtures (MIX05 and MIX13): Variation observed and lessons learned.
]; our method excludes it in a 3-contributor scenario. For MIX13 Case 4, our algorithm provides a higher LR than the reciprocal of the random match probability. An explanation for this behaviour is given in Subsection 2.1 of the Supplementary Material.

### 3.2 Precision: ProvedIt benchmark

Next we analyse the precision of our method in comparison with STRmix™ on the ProvedIt inter-laboratory benchmark [
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
]. We use the same analytical thresholds as Kelly et al. [
• Kelly H.
• Bright J.-A.
• Kruijver M.
• Cooper S.
• Taylor D.
• Duke K.
• Strong M.
• Beamer V.
• Buettner C.
• Buckleton J.
A sensitivity analysis to determine the robustness of STRmix™ with respect to laboratory calibration.
] (75, 100, 60, 80, and 100 RFU for the blue, green, yellow, red, and purple dyes, respectively) and report sub-sub-source LRs following Ref. [
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
].
We focus on the ‘Sample 1’ case, which was previously used to determine the precision of STRmix™ [
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
].
Our method displays high precision also for a simpler ‘Sample 2’ with $log10LR=29.0144±0.00254$ in case of Analysis Method A.
Four different analysis methods of GeneMapper™ are used in comparison (called A, B, C, and D). Analysis method A includes smoothing, no normalisation, and a baseline window of 51. The other methods differ slightly: B uses normalisation, C uses a baseline window of 33, and D does not use smoothing. The results are shown in Fig. 3. For STRmix™, a 10-fold run-to-run variability in the LRs is observed with the default stopping criterion (blue), which has been attributed to the stochastic nature of MCMC [
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
]. In our HMC method, we check the GR diagnostic every 300 iterations and stop when it is below 1.05 for all parameters. This results in roughly 10 orders of magnitude reduction in the LR standard deviation (cyan). For GeneMapper™ analysis method A, the standard deviation of $log10LR$ is 10.08 times lower in our method than in STRmix™ [
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
] (Fig. 3A). For analysis method B (Fig. 3B), it is 8.76 times lower, for analysis method C 10.99 times lower (Fig. 3C), and for analysis method D 10.76 times lower (Fig. 3D).
In the same Fig. 3, we also quantify the influence of the convergence diagnostic on the precision of our algorithm by testing two alternative stopping criteria: In the first, we calculate the mean of GR values and stop when this mean is $<$1.2 (orange). In the second, we use the same convergence metric as STRmix™, but with our threshold value of 1.05 (brown). Due to the efficiency of our HMC sampler, we were unable to simulate chains that resulted in values of this metric reaching as high as 1.2. Interestingly, however, we find that the STRmix™ criterion was sometimes satisfied with threshold 1.05 when some GR values were still $>$1.2.
Taken together, these results show that our approach is able to significantly improve precision over the STRmix™ method, and that this improvement is due to the stricter convergence criteria, as enabled by our purely continuous model parametrisation with genotype sets marginalised out.

### 3.3 Performance: computationally challenging mixtures

Finally, we quantify the performance of our HMC method on the most challenging cases of mixtures with 3 to 5 contributors. This serves to test our approach on cases that are not simple to resolve and that are challenging both from a precision point of view and for computational runtime. For these cases, we use the set of low analytical thresholds from Riman et al. [
• Riman S.
• Iyer H.
• Vallone P.M.
Examining performance and likelihood ratios for two likelihood ratio systems using the PROVEDIt dataset.
].
We analyse 4 mixtures: Sample 1 and Sample 2 from Bright et al. [
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
] analysed here without a known contributor (referred here to as Sample 1b and Sample 2b, respectively), and two following 5-person mixtures from ProvedIt with one known contributor:
• A05_RD14-0003-30_31_32_33_34-1;1;1;1;1-M3I22-0.315GF-Q1.3_01.15 s
(named here Sample 3, known contributor is Contributor 31)
• E04_RD14-0003-48_49_50_29_30-1;1;2;4;1-M2d-0.279GF-Q2.1_05.15 s
(named here Sample 4, known contributor is Contributor 29)
We use previously suggested analytical thresholds: 35 RFU for the blue channel, 65 for the green, 45 for the yellow, 50 for the red, and 60 for the purple [
• Riman S.
• Iyer H.
• Vallone P.M.
Examining performance and likelihood ratios for two likelihood ratio systems using the PROVEDIt dataset.
]. The filtered .csv data are used. For each case, we construct all possible prosecutor hypotheses with 1 suspect. We also construct the same number of false hypotheses by choosing the suspects randomly from the NIST 1036 U.S. Population Dataset [
• Steffen C.R.
• Coble M.D.
• Gettings K.B.
• Vallone P.M.
Corrigendum to ‘US population data for 29 autosomal STR loci’[Forensic Sci. Int. Genet. 7 (2013) e82–e83].
]. To quantify precision, we run our method 10 times for each ProvedIt case. The results are shown in Fig. 4. In all cases, our algorithm correctly classifies contributors and non-contributors with high precision. In all but one case where the true contributors are considered, the difference between extremal values of $log10LR$ is under 0.2. The only exception is Sample 2b when the suspect is Contributor 30. This is further analysed in Section 2 of the Supplementary Material.
The inference runtimes on the benchmark cloud instances are shown in Table 5. We show the results for all the mixtures we analysed with 3 or more unknown contributors. The results are better than the reported runtimes of previous versions of PG software solutions [
• Bright J.-A.
• Taylor D.
• McGovern C.
• Cooper S.
• Russell L.
• Abarno D.
• Buckleton J.
Developmental validation of STRmix™, expert software for the interpretation of forensic DNA profiles.
]. This suggests that despite the increased computational complexity of our marginalised model, the efficiency of HMC sampling and the efficient GPU implementation recover state-of-the-art runtimes as required for practical use of the method.
Table 5Inference runtimes. We report the average (over 10 repetitions) inference times on a single-GPU cloud instance for the listed cases with different numbers of contributors (NoC) in minutes (m) and seconds (s).
CaseNoCNoC knownInference time
MIX13 Case 2303 m 59 s
MIX13 Case 3301 m 59 s
MIX13 Case 5304 m 37 s
MIX13 Case 54031 m 15 s
ProvedIt Sample 1414 m 13 s
ProvedIt Sample 1b4019 m 39 s
ProvedIt Sample 2b306 m 55 s
ProvedIt Sample 35159 m 55 s
ProvedIt Sample 45127 m 15 s

## 4. Conclusions and future work

High precision, i.e. low run-to-run variability, of the results provided by probabilistic genotyping methods is key to building trust and to ensuring reliable discriminatory power of the analyses. While run-to-run variability has previously been attributed to the inherent stochasticity of MCMC algorithms [
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
,
• Bright J.-A.
• Stevenson K.E.
• Curran J.M.
• Buckleton J.S.
The variability in likelihood ratios due to different mechanisms.
], we have shown that it can be significantly reduced by an adjusted model formulation and stricter convergence criteria. We hypothesised that the convergence of probabilistic genotyping models is hard to assess if models contain both continuous and discrete (e.g. genotype sets) dimensions. We therefore presented a model where the discrete dimensions are marginalised out, leading to a purely continuous and differentiable formulation. We achieved state-of-the-art inference runtimes by implementing an efficient marginalisation strategy using a deduplication system, exploiting GPU acceleration, and using a HMC sampler.
The benchmark experiments presented have shown a reduction in the standard deviation of the resulting log-likelihood ratios by around an order of magnitude when using our method compared to the state-of-the-art STRmix™ software. They have also provided validation of the inference results against known ground truth by close reproduction of previously published results.
In the future, we plan to compare our method with other PG software on the ProvedIt benchmarks [
• Riman S.
• Iyer H.
• Vallone P.M.
Examining performance and likelihood ratios for two likelihood ratio systems using the PROVEDIt dataset.
,
• Cheng K.
• Bleka Ø.
• Gill P.
• Curran J.
• Bright J.-A.
• Taylor D.
• Buckleton J.
A comparison of likelihood ratios obtained from EuroForMix and STRmix™.
], provide a comparative analysis of the two main algorithmic approaches (Bayesian inference vs. maximum likelihood estimation) when the same probabilistic model is used, and work on further improvements of the model.
In addition to the run-to-run variability of the inference algorithm, the overall precision observed on a sample in the laboratory also depends on multiple other factors, including the frequency of the suspect’s genotype in the background population, the proportion of the suspect’s template, the quality of the sample, and the hyper-parametrisation of the data-processing methods. All of these must therefore be fixed when comparing different probabilistic genotyping algorithms. However, it might be insightful to explore which of these factors have the largest influence on the precision of final results, and to bound the precision in the worst case.

## Acknowledgements

We thank Dr. Sachin Krishnan (Center for Advanced Systems Understanding, Görlitz) and Zofia Dziedzic (University of Wrocław) for discussions on improving the mathematical notation, and Kevin Cheng (Institute of Environmental Science and Research, Ltd., Auckland) for valuable insights on the published STRmix™ results and the diagnostics used in PG software tools

## Appendix A. Supplementary data

• MMC S1

Further details on experiments, stutter models and memory consumption data.

## References

• Balding D.J.
• Buckleton J.
Interpreting low template DNA profiles.
Forensic Sci. Int.: Genetics. 2009; 4: 1-10
• Thompson W.C.
Painting the target around the matching profile: the texas sharpshooter fallacy in forensic DNA interpretation.
Law, Probab. Risk. 2009; 8: 257-276
• Executive Office Executive Office of the President
Report to the President - Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods.
CreateSpace Independent Publishing Platform, 2016
• Aitken C.
• Nordgaard A.
• Taroni F.
• Biedermann A.
Commentary: Likelihood ratio as weight of forensic evidence: A closer look.
Front. Genetics. 2018; 9
• Bright J.-A.
• Coble M.
Forensic DNA Profiling : A Practical Guide to Assigning Likelihood Ratios.
CRC Press, Boca Raton2021
• Bleka Ø.
• Storvik G.
• Gill P.
EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
Forensic Sci. Int.: Genetics. 2016; 21: 35-44
• Cowell R.
• Graversen T.
• Lauritzen S.L.
• Mortera J.
Analysis of forensic DNA mixtures with artefacts.
J. R. Stat. Soc. Ser. C. Appl. Stat. 2014;
• Cowell R.G.
• Graversen T.
• Lauritzen S.L.
• Mortera J.
Analysis of forensic dna mixtures with artefacts.
J. R. Stat. Soc. Ser. C. Appl. Stat. 2015; 64 (This is the accepted version of the following article) (which has been published in final form at http://dx.doi.org/10.1111/rssc.12071): 1-48
• Perlin M.W.
• Legler M.M.
• Spencer C.E.
• Smith J.L.
• Allan W.P.
• Belrose J.L.
• Duceman B.W.
Validating TrueAllele ® DNA mixture interpretation.
J. Forensic Sci. 2011; 56: 1430-1447
• Taylor D.
• Bright J.-A.
• Buckleton J.
The interpretation of single source and mixed DNA profiles.
Forensic Sci. Int.: Genetics. 2013; 7: 516-528
• Morrison G.S.
Special issue on measuring and reporting the precision of forensic likelihood ratios: Introduction to the debate.
Sci. Justice. 2016; 56: 371-373
• Taylor D.
• Bright J.-A.
• Buckleton J.
• Curran J.
An illustration of the effect of various sources of uncertainty on DNA likelihood ratio calculations.
Forensic Sci. Int.: Genetics. 2014; 11: 56-63
• Bauer D.W.
• Butt N.
• Hornyak J.M.
• Perlin M.W.
Validating TrueAllele ® interpretation of DNA mixtures containing up to ten unknown contributors.
J. Forensic Sci. 2019; 65: 380-398
• Bright J.-A.
• Taylor D.
• McGovern C.
• Cooper S.
• Russell L.
• Abarno D.
• Buckleton J.
Developmental validation of STRmix™, expert software for the interpretation of forensic DNA profiles.
Forensic Sci. Int.: Genetics. 2016; 23: 226-239
• Bright J.-A.
• et al.
STRmix™ collaborative exercise on DNA mixture interpretation.
Forensic Sci. Int.: Genetics. 2019; 40: 1-8
• Bright J.-A.
• Stevenson K.E.
• Curran J.M.
• Buckleton J.S.
The variability in likelihood ratios due to different mechanisms.
Forensic Sci. Int.: Genetics. 2015; 14: 187-190
• Gill P.
• Benschop C.
• Buckleton J.
• Bleka Ø.
• Taylor D.
A review of probabilistic genotyping systems: EuroForMix, DNAStatistX and STRmix™.
Genes. 2021; 12: 1559
• Garrett B.L.
• Crozier W.E.
Error rates, likelihood ratios, and jury evaluation of forensic evidence.
J. Forensic Sci. 2020; 65: 1199-1209
• Association of Forensic Science Providers
Standards for the formulation of evaluative forensic science expert opinion.
Sci. Justice. 2009; 49: 161-164
• European Network of Forensic Science Institutes
ENFSI guideline for evaluative reporting in forensic science.
2015
• Bright J.-A.
• Lee S.-I.
• Buckleton J.
• Taylor D.
Revisiting the STRmix™ likelihood ratio probability interval coverage considering multiple factors.
2021 (Biorxiv Preprint Server)
• Riman S.
• Iyer H.
• Vallone P.M.
Examining performance and likelihood ratios for two likelihood ratio systems using the PROVEDIt dataset.
PLOS ONE. 2021; 16e0256714
• Gelman A.
Bayesian Data Analysis.
second ed. Chapman & Hall/CRC, Boca Raton, Fla2004
• Brooks S.P.
• Gelman A.
General methods for monitoring convergence of iterative simulations.
J. Comput. Graph. Statist. 1998; 7: 434-455
• Götz F.M.
• Schönborn H.
• Borsdorf V.
• Pflugbeil A.-M.
• Labudde D.
Genoproof Mixture 3—new software and process to resolve complex DNA mixtures.
Forensic Sci. Int.: Genetics. 2017; 6: e549-e551
• Kwong K.
The algorithm says you did it: The use of black box algorithms to analyze complex DNA evidence.
Harvard J. Law Technol. 2017; 31: 275
• Vats D.
• Knudson C.
Revisiting the Gelman–Rubin diagnostic.
Statist. Sci. 2021; 36
• Taylor D.
• Bright J.-A.
• Buckleton J.
Using probabilistic theory to develop interpretation guidelines for Y-STR profiles.
Forensic Sci. Int.: Genetics. 2016; 21: 22-34
• Buckleton J.S.
• Lohmueller K.E.
• Inman K.
• Cheng K.
• Curran J.M.
• Pugh S.N.
• Bright J.-A.
• Taylor D.A.
Testing whether stutter and low-level DNA peaks are additive.
Forensic Sci. Int.: Genetics. 2019; 43102166
• Cheng K.
• Bright J.-A.
• Kerr Z.
• Taylor D.
• Ciecko A.
• Curran J.
• Buckleton J.
Examining the additivity of peak heights in forensic DNA profiles.
Australian J. Forensic Sci. 2020; : 1-15
1. J.S. Buckleton, The probability of dropout and drop-in, URL https://johnbuckleton.files.wordpress.com/2019/08/the-probability-of-dropout-and-drop-in.pdf.

• Bright J.-A.
• Taylor D.
• Buckleton J.
Statement on minor miscoding in BN formulae STRmix V1.08 with additional comments.
2016
• Balding D.J.
• Nichols R.A.
DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands.
Forensic Sci. Int. 1994; 64: 125-140
• Taylor D.
• Bright J.-A.
• Buckleton J.
The ‘factor of two’ issue in mixed DNA profiles.
J. Theoret. Biol. 2014; 363: 300-306
• Duane S.
• Kennedy A.
• Pendleton B.J.
• Roweth D.
Hybrid Monte Carlo.
Phys. Lett. B. 1987; 195: 216-222
• Betancourt M.
A conceptual introduction to Hamiltonian Monte Carlo.
2018
• Dillon J.V.
• Langmore I.
• Tran D.
• Brevdo E.
• Vasudevan S.
• Moore D.
• Patton B.
• Alemi A.
• Hoffman M.
• Saurous R.A.
TensorFlow distributions.
2017
• Alfonse L.E.
• Garrett A.D.
• Lun D.S.
• Duffy K.R.
• Grgicak C.M.
A large-scale dataset of single and mixed-source short tandem repeat profiles to inform human identification strategies: PROVEDIt.
Forensic Sci. Int.: Genetics. 2018; 32: 62-70
• Butler J.M.
• Kline M.C.
• Coble M.D.
NIST interlaboratory studies involving DNA mixtures (MIX05 and MIX13): Variation observed and lessons learned.
Forensic Sci. Int.: Genetics. 2018; 37: 81-94
• Buckleton J.S.
• Bright J.-A.
• Cheng K.
• Budowle B.
• Coble M.D.
NIST interlaboratory studies involving DNA mixtures (MIX13): A modern analysis.
Forensic Sci. Int.: Genetics. 2018; 37: 172-179
• Moretti T.R.
• Moreno L.I.
• Smerick J.B.
• Pignone M.L.
• Hizon R.
• Buckleton J.S.
• Bright J.-A.
• Onorato A.J.
Population data on the expanded CODIS core STR loci for eleven populations of significance for forensic DNA analyses in the united states.
Forensic Sci. Int.: Genetics. 2016; 25: 175-181
• Steffen C.R.
• Coble M.D.
• Gettings K.B.
• Vallone P.M.
Corrigendum to ‘US population data for 29 autosomal STR loci’[Forensic Sci. Int. Genet. 7 (2013) e82–e83].
Forensic Sci. Int.: Genetics. 2017; 31: e36-e40
• Kelly H.
• Bright J.-A.
• Kruijver M.
• Cooper S.
• Taylor D.
• Duke K.
• Strong M.
• Beamer V.
• Buettner C.
• Buckleton J.
A sensitivity analysis to determine the robustness of STRmix™ with respect to laboratory calibration.
Forensic Sci. Int.: Genetics. 2018; 35: 113-122
• Cheng K.
• Bleka Ø.
• Gill P.
• Curran J.
• Bright J.-A.
• Taylor D.
• Buckleton J.
A comparison of likelihood ratios obtained from EuroForMix and STRmix™.
J. Forensic Sci. 2021;