Research Paper| Volume 61, 102781, November 2022
• PDF [2 MB]PDF [2 MB]
• Top

# MPSproto: An extension of EuroForMix to evaluate MPS-STR mixtures

Open AccessPublished:September 26, 2022

## Highlights

• MPSproto requires calibration of stutters and noise.
• MPSproto increases the sensitivity of MPS-STR mixture analyses.
• MPSproto is a good alternative to EuroForMix.
• The gamma model in MPSproto performed best overall.

## Abstract

We have developed MPSproto as an extension of EuroForMix to improve handling of stutter artefacts and other typing errors that commonly occur in MPS-STR data. MPSproto implements two models for read depth: gamma and negative binomial. It differs from EuroForMix in that calibration is required before mixtures are interpreted. In this study a mixture dataset (2–4 persons) was revisited, where EuroForMix interpretation of MPS-STR mixtures using the LUS+ format was first described; the performance of this model was compared to the MPSproto models. Results indicated that, overall, the MPSproto models performed better than the conventional EuroForMix model, and the gamma model implemented in MPSproto performed best. Differences were highlighted and further investigated to establish causality. Goodness of fit tests showed that the MPSproto models were adequate for the sequence reads when a low analytical threshold was applied.

## 1. Introduction

Probabilistic genotyping software such as EuroForMix [
• Bleka Ø.
• Storvik G.
• Gill P.
EuroForMix: an open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
] (see Coble et al. [
• Coble M.D.
• Bright J.-A.
Probabilistic genotyping software: an overview.
] for an overview of different software) are important for the interpretation of mixtures of DNA from multiple contributors. However, all models of existing software packages were designed to use data generated with capillary electrophoresis (CE) based technology. During the last decade a new technology known as Massively Parallel Sequencing (MPS) has gained attention and is already viewed as a viable alternative to CE by the forensic community [
• Bruijns B.
• Tiggelaar R.
• Gardeniers H.
‘Massively parallel sequencing techniques for forensics: a review.
,
• Barrio P.A.
• et al.
The first GHEP-ISFG collaborative exercise on forensic applications of massively parallel sequencing.
,
• Alonso A.
• et al.
Current state-of-art of STR sequencing in forensic genetics.
]. Whereas CE provides information based on the number of short tandem repeats (STR), accompanied by a peak height (RFU) measurement, MPS also provides the actual underlying sequence of the repeats, which is accompanied by the number of reads (read depth). For instance, a length-based homozygote genotype of ‘10’/‘10’ could be composed of two different sequences that are isometric but heterozygous, e.g., ‘[ATCG]10’/‘[ATCG]6 ATCT [ATCG]3’. The detection of increased variability via sequencing, along with the potential to examine many more autosomal markers, leads to increased discriminatory power compared to the traditional CE approach.
However, if the DNA amount is low the discriminatory power may not necessarily be increased; in this scenario it would be more likely to observe an allele of a homozygous genotype than both alleles of a heterozygous genotype, especially if a high analytical threshold is used [
• Benschop C.C.G.
• et al.
Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
]. Hence to leverage the potential for increased discriminatory power with MPS, the analytical threshold must remain low. Yet, use of a low analytical threshold may increase the challenges of STR mixture interpretation – first, by the inclusion of more sequence errors (typically observed with MPS technology), and secondly, the alleles from a major contributor may produce stutter artifacts in the range of a minor donor, making it hard to distinguish between the donor alleles and stutter [
• van der Gaag K.J.
• et al.
Massively parallel sequencing of short tandem repeats—Population data and mixture analysis results for the PowerSeq™ system.
]. This latter issue is circumvented by developing suitable probabilistic genotyping models, where there is no need to definitively distinguish stutters and alleles [
• Taylor D.
• Bright J.A.
• Buckleton J.
‘The interpretation of single source and mixed DNA profiles.
,
• Cowell R.G.
• Graversen T.
• Lauritzen S.L.
• Mortera J.
Analysis of forensic DNA mixtures with artefacts.
]. There are also several additional challenges with MPS data, including relatively high imbalance both between and within markers [
• Hussing C.
• Huber C.
• Bytyci R.
• Mogensen H.S.
• Morling N.
• Børsting C.
Sequencing of 231 forensic genetic markers using the MiSeq FGx™ forensic genomics system – an evaluation of the assay and software.
], and increased stutter sizes and multiple stutter types – some of which are not observed with CE [
• Li R.
• et al.
Characterizing stutter variants in forensic STRs with massively parallel sequencing.
]. For instance, the ‘n0’ stutter is composed of both backward and forward stutters in the same sequence strand, hence there is no net change in repeat length.
Vilsen et al. [
• Vilsen S.B.
• Tvedebrink T.
• Eriksen P.S.
• Hussing C.
• Børsting C.
• Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
] established negative binomial models for read depth because reads are discrete counts. In contrast, EuroForMix is based on the gamma model which assumes a continuous outcome. Furthermore, Vilsen et al. [
• Vilsen S.B.
• Tvedebrink T.
• Eriksen P.S.
• Hussing C.
• Børsting C.
• Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
] suggested that parameters for inter-locus balance and stutter models could be estimated based on a calibration dataset before attempting to interpret complex mixtures. As part of the stutter model, Vilsen et al. [
• Vilsen S.B.
• et al.
Stutter analysis of complex STR MPS data.
] used the concept of block length of missing motif (BLMM) to describe stutter size. For instance, if an allele is ‘[ATCG]6 ATCT [ATCG]3’, and a stutter product ‘[ATCG]5 ATCT [ATCG]3’ was produced from it, then the BLMM value of the stutter product is 6, which is also the longest uninterrupted stretch (LUS) value. If instead the stutter product ‘[ATCG]6 ATCT [ATCG]2’ was produced, the BLMM value would be 3, whereas the LUS value would still be 6. It is well established that the expected stutter proportion increase with the value of BLMM [
• Li R.
• et al.
Characterizing stutter variants in forensic STRs with massively parallel sequencing.
,
• Vilsen S.B.
• et al.
Stutter analysis of complex STR MPS data.
,
• Cheng K.
• et al.
Modeling allelic analyte signals for aSTRs in NGS DNA profiles.
,
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
].
EuroForMix (from version 1.11.3) supports data given in “LUS format” [
• Just R.S.
• Irwin J.A.
Use of the LUS in sequence allele designations to facilitate probabilistic genotyping of NGS-based STR typing results.
,
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
]. Here sequences are converted to a format of type “x_y” where x is the regular CE designation based on the number of repeats, and y is the longest uninterrupted sequence (LUS). EuroForMix uses this format to identify backward or forward stutters where the LUS information is utilized. This improves the discriminatory power when DNA quantities are high. However, for minor contributors evaluated in a prior study, the performance was similar to that achieved with ordinary CE nomenclature [
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
]. In that study, which was based on ForenSeq DNA Signature Prep Kit (Verogen, Inc.) data developed using the manufacturer’s protocol, the largest influence on performance was related to how the data were pre-filtered: use of a static analytical threshold with no stutter prefiltering performed best, since potential alleles from minor contributors were not removed. Also, that study considered an analytical threshold of 30 reads since EuroForMix had not been adapted to work with lower values. Hence, this provides a motivation to develop a model which simultaneously lowers the analytical threshold and applies a probabilistic model to evaluate stutters and potential sequence errors.
Multiple types of stutters have been observed for MPS-STRs: n-1, n-2, n+1, n0, and n-1 from a region other than the LUS [
• Li R.
• et al.
Characterizing stutter variants in forensic STRs with massively parallel sequencing.
,
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
]. The different stutter types are easy to recognize if a bracketed format is used to represent the sequence string [
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
]. The bracket format facilitates analyst visualization since the compact form is more easily read and interpreted. Recently, the lusSTR tool was developed to automatically convert sequences into this format [

R. Mitchell , D. Standage, lusSTR. Bioforensics, 2021. (Online). https://github.com/bioforensics/lusSTR.

]. In our previous study [
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
] we used the bracket format to identify many of the afore-mentioned stutter types, and we quantified stutter proportion sizes using a beta regression with BLMM as an explanatory variable. In the present work we continue to use the bracket format in an adapted model for MPS-STRs and show how a calibrated stutter model, similar to [
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
], can be directly integrated into probabilistic genotyping.
This paper is structured as follows: The Materials and methods section introduces the dataset used for model calibration and mixture evaluation. Subsequently, the mathematical details of MPSproto are presented, and we describe how calibration of the model was carried out. Then a mixture evaluation study is described, in which several kinds of models were compared, implemented as part of either MPSproto or EuroForMix. The Results section then describes the outcomes of the comparison study, including model performance and goodness of fit.

## 2. Materials and methods

### 2.1 Data

MPS-STR data developed using the ForenSeq™ DNA Signature Prep Kit (Verogen, Inc.) with DNA Primer Mix B according to the manufacturer’s protocols were considered. The ForenSeq UAS version 1.3 (Verogen, Inc.) was used to obtain the sequences and read depths. This version has a hard-coded detection threshold, resulting in sequences with a minimum of 11 reads in the exported data.
A total of 30 undegraded samples were used for model calibration of MPSproto: each was run with template quantities of 2 ng, 1 ng and 0.5 ng, for a total of 90 single-source profiles. Libraries were pooled in sets of 32 for sequencing (30 single-source libraries plus a positive and negative amplification control per run).
The dataset used for the mixture evaluation study was the same as that used by Bleka et al. [
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
], composed of four anonymized samples (these are named ‘Ref1’,…, ‘Ref4’) from an African American population dataset [
• Gettings K.B.
• Borsuk L.A.
• Steffen C.R.
• Kiesler K.M.
• Vallone P.M.
Sequence-based U.S. population data for 27 autosomal STR loci.
]. ForenSeq libraries were pooled in sets of 32 for MiSeq FGx sequencing (30 mixture libraries plus a positive and negative control per run). A list of the mixtures is given in Table S2 for ease of reference.
Allele frequencies were also based on the African American dataset [

L.I. Moreno, T.R. Moretti, Short Tandem Repeat Genotypes of Samples From Eleven Populations Comprising the FBI’s Population Database, 1, 2019.

], where the alleles were sequence-based. New allele observations were imputed with the minimum allele frequency 0.00146 (this was the minimum observed frequency), and frequencies were normalized afterwards (ensuring sum equal to 1). Allele frequencies are provided, together with the mixture dataset from http://www.euroformix.com/MPSproto. The $Fst$ coancestry coefficient (also known as the “theta-correction”) was set to 0.01 for all calculations.
All sequences were evaluated using a re-implementation of lusSTR, LUSstrR (v0.2.2) [

Ø. Bleka, LUSstrR. 2022 (Online).https://github.com/oyvble/LUSstrR.

]. The “forward bracket format” was used for MPSproto and the LUS+ format was used for EuroForMix. MPSproto requires that sequences are converted into bracket format of type “$x1a1x2a2…$” where each block consists of a motif $x$ which is repeated $a$ times, and the blocks are separated with a whitespace. If $a=1$, then no brackets are used. For instance, for an allele sequence with $x1$=ATCG, $x2$=ATG, $a1=5$, and $a2=1$, the bracketed format of the allele would be “$ATCG5ATG$”.

### 2.2 The MPSproto model

#### 2.2.1 Models for sequence reads

The MPSproto model was based on an extension of EuroForMix with marker efficiency parameters included, inspired by Vilsen et al. [
• Vilsen S.B.
• Tvedebrink T.
• Eriksen P.S.
• Hussing C.
• Børsting C.
• Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
]. The implemented version we used, v0.8.1, supports two different distributions for the read depths: gamma (GA) and negative binomial (NB). GA assumes that read depths are continuous, as the model was originally designed for CE-based data (i.e., for peak heights). NB is a discrete distribution, based on read depth, and is proposed as a suitable choice for MPS data. We parameterized the NB model such that $μ$ and $ω$ had the same interpretation as for the GA model: the expectation and the coefficient of variation of a heterozygous read depth in a single-source profile without any degradation or stutter effects.
The models for signal intensity (read depth) $Ym,a$ for sequence $a$ at marker $m$, all observed above analytical threshold $Tm>0$ (marker specific), are defined as follows:
-GA: gamma(shape=$ψm,a$, scale=$μω2$)
-NB: negative-binomial(mu=$ψm,a$, size=$μ/(μω2−1)$)
where, without assuming any stutter model, we have:
$ψm,a=Amω−2∑k=1Kπknm,a,k$
(1)

for the GA model, and for the NB model we have:
$ψm,a=Amμ∑k=1Kπknm,a,k$
(2)

Here the models assume that the evidence is comprised of K individuals (K-person mixture), where each individual contributes with mixture proportion parameter $πk∈0,1$. $nm,a,k$ is the allele contribution (0, 1 or 2) decided by the assumed genotypes vector $gm=g1,g2,…,gK$: $nm,a,k=Igk,1=a+Igk,2=a$, where $gk=gk,1/gk,2$ and $I$ is the indicator function. The marker efficiency parameters $A=(A1,…,AM)$ are restricted such that $1M∑Am$= 1. Formulae for the probability density and cumulative functions $f$ and $F$ of the read depths, and the derivation of the size argument are included in the Appendix B, Appendix B.

#### 2.2.2 Stutter model

MPSproto extends the stutter model of EuroForMix by allowing a general set of stutter types and has already been described in [
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
]. We define $ϵm,at$ as the expected stutter proportion of a particular stutter type $t$ coming from donor allele $a′$. Likewise, we define $ϵm,a′′t$ to be the stutter product coming from donor allele $a$. Then, considering only stutter type $t$, the shape or mu argument (GA or NB model) is modified to:
$ψm,a=1−ϵm,a′′tψm,a*+ϵm,atψm,a′*$

where $*$ indicates the $ψm,a$ identity without assuming any stutter model (Eq. 1 for the GA model or Eq. 2 for the NB model). The first term is the “donates” part whereas the second term is the “receives” part. When several stutter types ($t=1,…$) are defined, the formula extends to:
$ψm,a=1−∑tϵm,a"t1ψm,a*+∑tϵm,atψm,a*,$

As an example, consider contributions of allele ‘4’, ‘5’, and ‘6’ (i.e., $ψ4*,ψ5*,ψ6*$ $>0$) and assume that only backward (BW) and forward (FW) stutter types are used. The expected stutter proportions are estimated at $a=5$, $ϵ5BW$ and $ϵ5FW$, the BW stutter at allele $a=4$, $ϵ4BW$, and the FW stutter at allele $a=6$, $ϵ6FW$. The formula above hence gives:
$ψ5=1−ϵ4BW−ϵ6FWψ5*+ϵ5FWψ4*+ϵ5BWψ6*$

In MPSproto we model the expected stutter proportion of allele $a$ for stutter type $t$ with a logit-link function:
$ϵm,at=logit−1(β0m,t+β1m,tx1,am,t+β2m,tx2,am,t)$
(3)

where $βm,t=(β0m,t,β1m,t,β2m,t)$ are model coefficients of the regression and $x1,am,t$ is the BLMM for stutter $a$ and $x2,am,t$ is the BLMM for a different motif (also for stutter $a$). The latter is used for n0 stutters where two parts of the sequence stutter simultaneously, but there is no net change of the size of the stutter product compared to its parent allele.

#### 2.2.3 Noise model

Sequences in the calibration dataset which do not originate from any of the known contributors or defined stutters types are classified as noise. The noise model consists of two parts and is carried out separately for each marker: 1) the number of unique noise sequences and 2) the read depth for the corresponding noise sequences.
• (A)
The number of (unique) noise sequences, $k$, follows a geometric distribution with parameter $p$:
$Pr(k|p)=p1−pk$

• (B)
The read depth of a noise sequence, $y$, is proportional to a Type I Pareto distribution with parameter $λ$ [
• Krishnaji N.
Characterization of the Pareto distribution through a model of underreported incomes.
]:
$dy|λ=c*λTmλy−(λ+1)$

where $Tm$ is the analytical threshold for a particular marker. We included an upper limit cap of $y=1000$ read depth for noise, such that the normalizing constant $c$ is calculated as$c=1/(∑x=Tx=1000λTmλx−(λ+1))$
Importantly, the noise parameters, $λ$ and $p$, are marker specific and must be decided based on a “noise dataset”, i.e., neither contributor alleles nor stutter.

#### 2.2.4 Dropout and degradation model

In MPSproto, dropout is modelled the same way as described for EuroForMix: the dropout probability of an unobserved allele is the integral from zero to $Tm$ for GA (continuous) or the sum from zero to $Tm−1$ for NB (discrete). As with EFM, the compound Q-allele (indicated as ‘99’ in the MPSproto program) was used to designate a dropout allele, representing all the non-observed alleles from the allele outcome (defined by the allele frequency information).
MPSproto also supports degradation modeling, where $ψm,a*$ is scaled with $τ(bm,a−100)/125$, where $τ$ is the degradation slope parameter and $bm,a$ is the fragment length of allele $a$ at marker $m$. To enable this, sequences in the evidence profile must be in the following format: “RU:Sequence”, where “RU” is the repeat unit (CE format), “Sequence” is the bracket format, and the two are separated by a colon. For instance, sequence [ATCG]10 must be described in the format “10:[ATCG]10”. The corresponding kit (e.g., “ForenSeq”) must also be specified, and the fragment length of each common RU must be defined. For the “ForenSeq” kit, the fragment lengths for each CE allele can be obtained from https://verogen.com/wp‐content/uploads/2018/08/ForenSeq‐DNA‐Prep‐Guide‐VD2018005‐A.pdf

#### 2.2.5 Implementation

MPSproto is implemented as an R-package available from github (https://github.com/oyvble/MPSproto) or the webpage http://www.euroformix.com/MPSproto. Numerical tests of the likelihood functions (for both GA and NB models) are included in the package, where calculations from the C++ code are compared to calculations from the R code.

### 2.3 Calibration of MPSproto

The model calibration of MPSproto consists of three main steps:
• 1)
Estimation of marker amplification efficiencies
• 2)
Identification of stutter types and estimation of regression coefficients
• 3)
Estimation of the noise model
The stutter and noise calibrations are conducted on a per-marker basis where all sequences are assumed to be above the defined analytical threshold, which can be specified differently for each marker. In this paper $Tm=11$ was used for all markers analysed with MPSproto.

#### 2.3.1 Estimating marker amplification efficiency

To estimate the marker amplification efficiency parameters $A=(A1,…,AM)$, the read depth sum per marker per sample is used. For each calibration sample $s=1,…,S$, the sum of reads per marker is calculated as $ys,m=Σays,m,a$. If $ys,m≥Tm$ (no marker dropout), we assume that the sum is gamma distributed (for the GA model), with $2Amωs−2$ as the shape argument and $μsωs2$ as the scale argument. On the other hand, if $ys,m (marker dropout is present), then the dropout probability is calculated as
$PrYs,m

where $Fs,m$is the cumulative density function of the gamma distribution (applied with same shape and scale argument, see Appendix). The estimated marker amplification efficiency parameters are obtained by maximizing the likelihood
$LA,μ,ω=∏s=1S∏m=1MPrYs,m=ys,m|Am,μs,ωs.$

#### 2.3.2 Stutter calibration

The following candidate stutter types are considered:
• BW1: Backward stutter (n-1) where the longest repeat is stuttering.
• FW1: Forward stutter (n+1) where the longest repeat is stuttering.
• DBW1: Double backward stutter (n-2) where the longest repeat is stuttering.
• BW2: Backward stutter (n-1) where a repeat other than the longest repeat is stuttering.
• FWBW: Longest repeat produces forward stutter, and a repeat other than the longest repeat produces backward stutter (n0)
• BWFW: Longest repeat produces backward stutter, and a repeat other than the longest repeat produces forward stutter (n0)
This stutter naming format is required by MPSproto and the analysis is carried out separately per marker. All stutter relations are identified based on the bracketed format: the internal function getStutteredSequence is applied on all donor alleles and compared with artefacts. Stutter artefacts are not considered in the calibration if multiple parental allele candidates are observed.
This is only true for the calibration model. All stutters are considered by the interpretation model, with no removal.
For instance, if artefact [ATCG]11 together with donor alleles [ATCG]10 and [ATCG]12 is observed, then the artefact could either be a forward stutter of [ATCG]10, a backward stutter of [ATCG]12, or both, and hence would be removed. Similarly, as carried out by [
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
], beta regression models were applied per stutter type to estimate the expected stutter proportion in Eq.3. Here, the slope parameter(s) $β1$ (and $β2$) are only included if estimated as non-negative and significantly different from zero (p-value < 0.05 applied).
For the calibration performed in the present study, a minimum of three stutter artefacts were required per stutter type. Artefacts which occurred fewer than three times were instead used to estimate the noise model (see Section 2.3.3).

#### 2.3.3 Estimating noise model

Sequences that are not known donor alleles or are not recognized as stutter are assigned as noise. The procedure to estimate the noise model parameters ($p,λ)$ follows, carried out for each marker separately:
• (A)
Let the number of noise sequences per sample be given as $x1,x2,…,xS$. An additional observation is always appended as $xS+1=1+maxx1,…,xS$ to make the model more robust for new observations. The $p$ parameter is then estimated as $pˆ=x̅+1−1$, where $x̅$ is the mean of the observations.
• (B)
Let the read depth of the noise sequences be given vectorized as $y1,y2,…,yn$. An additional observation is always appended as $yn+1=Tm$ to make the model more robust for new observations. If $n=0$ (no observations) two additional observations, $yn+2=Tm$ and$yn+3=Tm+1$, are appended to enable the model to be fitted. The $λ$ parameter is then estimated using the maximum likelihood estimate, obtained as $λˆ$.

#### 2.3.4 Implementation and functionalities

The entire calibration procedure can be performed by running the function calibrateModel in MPSproto. This eases the calibration steps for users. Additionally, a tutorial is available at https://github.com/oyvble/MPSproto/blob/master/doc/MPSproto_tutorial.html.

### 2.4 Inference of MPS evidence and LR calculations

#### 2.4.1 Definition of the likelihood function and the calculation of LR

For a new evidence profile, the unknown model parameters are $θ$=$π,μ,ω,τ$, whereas the stutter model parameters, marker amplification efficiency and noise model are based on the calibration. As with EuroForMix, the LR calculation is based on optimizing the likelihood function over the model parameters for a specific proposition $H$:
$Lθ|H=∏m=1M∑gm∈GmHPrEm|θ,gmPr(gm|H)$

where for a given marker $m$, $Em$ is the evidence information (sequences and read depths), and $gm=(g1,…,gK)$ is the joint genotype combination of $K$ contributors, traversed over the joint genotype outcome $GmH$ defined from proposition $H$. More specifically:
$PrEm|θ,gm=Pr(k|pˆm)∏a∈Emf(ym,a|θ,gm)∏a∉EmF(Tm|θ,gm)$

where $k$ is the assumed number of noise sequences for a given genotype combination (not derived from a contributor or a stutter). If a sequence $a$ is assumed to be a noise sequence, then $fym,a|θ,gm=d(ym,a|λˆm)$. The calculation of the genotype probabilities, $Pr(gm|H)$, includes the $Fst$coancestry coefficient to take sub-population structure into account, with formulae described in [
• Bleka Ø.
• Storvik G.
• Gill P.
EuroForMix: an open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
]. Finally, the LR for evaluating the evidence given two hypotheses, $Hp$ and $Hd$, is calculated using the maximum likelihood approach:
$LR=maxθpL(θp|Hp)/maxθdL(θd|Hd).$

#### 2.4.2 Mixture evaluation study

LRs for the 2 and 3-person mixtures were calculated by comparing the following propositions, where the person of interest (POI) is the hypothesized contributor:
• -
$Hp$: “POI + ($K−1$) unknown unrelated are contributors”
• -
$Hd$: “$K$ unknown unrelated are contributors”
where $K$ is specified as the number of known contributors to the constructed mixture. For 4-person mixtures, the major contributor (always designated as ‘Ref1’) was conditioned under both propositions such that the hypotheses were modified to:
• -
$Hp$: “POI + Ref1 + 2 unknown unrelated are contributors”
• -
$Hd$: “Ref1 + 3 unknown unrelated are contributors”
LR calculations were performed using each of the following models:
• 1.
MPSproto applied with gamma model (GA)
• 2.
MPSproto applied with negative binomial model (NB)
• 3.
EuroForMix (EFM)
For MPSproto interpretations, a calibrated model was first estimated from the single-source data described in Section 2.1 using the steps described in Section 2.3. For LR calculations an analytical threshold $Tm$ of T = 11 was applied to match the value used for the calibration. The degradation model was not used. A tutorial describing how to use MPSproto to calculate the LR can be found at https://github.com/oyvble/MPSproto/blob/master/doc/MPSproto_tutorial_advanced.html.
The EuroForMix evaluation originally performed in Bleka et al. [
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
] was repeated in the present study to account for slight updates to the allele frequency file. The same drop-in model as described by [
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
] was used, where the drop-in probability equals 0.05 and the hyperparameter lambda equals 0.01. The degradation model was always applied. No stutter filters were applied, and both forward and backward stutter models in EuroForMix utilized the LUS+ format. If EuroForMix failed to converge for any of the propositions, the model configuration was reduced to calculate with the backward stutter model only; if this also failed, the model was further reduced to apply no stutter model. A static analytical threshold of T = 30 was mainly applied since this was also used in [
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
]. However, we also investigated the effect when lowering the threshold to T = 20 or T = 11.
LR calculations were carried out for all true contributors as the POI ($Hp$ true), except for the 4-person mixtures where Ref1 was a known contributor under both propositions as describe above. This comprised a total 160 $Hp$ true comparisons. Examples where models obtained very different LR values were inspected in detail: the LR per marker and model fits were further investigated.
Non-contributor tests ($Hd$ true) were conducted for additional reference profiles (‘Ref5’,…, ‘Ref10’). Ref3 and Ref4 were also non-contributors for the 2-person mixtures, and Ref4 was also a non-contributor for the 3-person mixtures. This comprised a total of 420 non-contributor comparisons.
The following maximum likelihood function optimizer parameters were used for both MPSproto and EuroForMix: Seed= 1234; number of required optimizations (nDone)= 3; optimization accuracy (steptol)= 0.001.

#### 2.4.3 Model comparisons

For the model comparison we plotted the pairwise LR values for each model and highlighted differences, using Receiver Operating Characteristics (ROC) curves [
• Zweig M.H.
• Campbell G.
‘Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.
]. For a certain threshold $t$, a true positive event is when $LR>=t$ when $Hp$ is true. If instead $Hd$ was true, the event would be a false positive. The true positive rate is defined as the proportion of true positive events out of all $Hp$ true events, whereas the false positive rate is defined as the proportion of false positive events out of all $Hd$ true events. We also calculated the area under the ROC curve (AUC) for each model using the pROC R-package, including the 95 % confidence interval (function delong).
Goodness of fit for the fitted models was quantified using the Kolmogorov-Smirnov test by comparing an empirical cumulative density function against the cumulative uniform distribution. This was done for each optimized likelihood ($Hp$ and $Hd$ separately) of each comparison and model. The empirical cumulative values were obtained by sorting the cumulative probabilities obtained from the R-package function validMLE (MPSproto) or validMLEmodels (EuroForMix). A small p-value (e.g. <0.01) from the test indicates a non-adequate model for the read depth.

## 3. Results

### 3.1 Calibration of MPSproto for the mixture evaluation study

All 90 single-source profiles were used to estimate the marker amplification efficiency parameters (Table 1). Only five marker dropouts were registered.
Table 1Gives a summary of the parameter estimates from the MPSproto calibration. The estimated parameters varied across the markers, illustrating that assignment of universal parameters would not be ideal.
MarkerAplambdaBW1FW1DBW1BW2FWBWBWFW
CSF1PO0.3240.98915.6-4.44/0.133NANANANANA
D10S12480.6290.913.85-3.91/0.114NA-3.83NANANA
D12S3910.5740.6233.13-3.23/0.123NA-5.47/0.139-4.99/0.227-3.78NA
D13S3170.8220.854.08-5.73/0.201-4.02NANANANA
D16S5391.140.7653.13-3.79/0.147-4.4-7.06/0.246NANANA
D17S13010.5430.98915.6-4.25/0.2NA-3.87NANANA
D18S510.9680.7782.41-3.69/0.101-6.74/0.16-6.02/0.11NANANA
D19S4331.120.2043.55-3.74/0.102NANANANANA
D1S16560.3050.8674.26-4.3/0.17NANANANANA
D20S4822.440.8673.12-2.13-3.97-4.87NANANA
D21S110.5330.7463.87-3.54/0.0843-3.61NA-3.36NANA
D22S10451.440.9486.04-3.6/0.118-5.57/0.146-6.2/0.136NANANA
D2S13381.210.4461.85-2.63/0.0505NA-5.37/0.0982-6.92/0.427-4.38-4.37
D2S4411.140.9013.39-3.56-4.9NANANA-4.31
D3S13582.080.2862.43-4.15/0.127-6.85/0.177-7.05/0.178NA-4.69-4.83
D4S24080.9930.8924.01-5.27/0.208-6.86/0.23NANANANA
D5S8180.290.96812.4-2.85NANANANANA
D6S10431.090.853.03-3.67/0.1-7.42/0.259-4.59NANANA
D7S8200.7910.4672.24-5.17/0.222-4.35NANANANA
D8S11791.480.8753.21-3.39/0.122-6.83/0.217-6.3/0.178NANANA
D9S11221.550.8123.46-4.77/0.207-4.19-7.53/0.231NANANA
FGA1.050.6553.56-3.03/0.0873-5.78/0.112-5.79/0.135NANANA
PENTA E0.2560.9686.17-2.5NANANANANA
TH012.770.7282.97-4.19/0.193-4.81-5-5.25NANA
TPOX0.8940.9383.81-5/0.167NANANANANA
VWA0.3820.9583.09-3.56/0.12NANANANANA
PENTA D0.1810.4312.83-5.7/0.189NANANANANA
Table 1: Estimated parameters used for the calibrated MPSproto model. A is the estimated marker amplification efficiency, p is geometrical model parameter for the number of noise alleles and lambda is pareto model parameter for the noise size. The remaining columns indicate the stutter coefficients (intercept and slope) of the different stutter types separated by a backslash. Slope was only included if significant. “NA” means that the stutter type was not modelled.
In total, 3240 sequences unambiguously identified as stutter (as defined in Section 2.3.2) were used to estimate stutter regression coefficients (Table 1). In addition, 1181 sequences were identified as “noise” and were used to estimate the noise model parameters (Table 1). Eleven of these were originally denoted as stutter, but, as noted in Section 2.3.2, were re-categorized as noise since there were too few observations (< 3) for a given stutter type and marker. The vast majority of the noise sequences (1163, or 98.5 %) differed by a single base compared to one of the parent alleles. The number of observations per marker and per stutter/noise type is given in Table S1 (supplementary).

### 3.2 LR for $Hp$true comparisons

LR values obtained using the different models were compared in terms of “ban” unit (for instance, “3 bans” is same as log10LR=3). Results of all $Hp$ true calculations can be found in Supplementary Table S3. Overall, the GA model returned much greater LR values than the EFM (T = 30) model (Fig. 1A). There were a few exceptions (datapoints to the right of the line of equality in Fig. 1A), but only one with more than 2 bans in difference: Comparison Ref3 to mixture 4Q_0.067ng_10‐2‐2‐1 gave 3.6 bans for EFM but only 0.6/− 0.11 for GA/NB. By contrast, there were 56 instances where the GA model resulted in LRs more than 6 bans greater than for EFM (see Supplemental Table S4). This magnitude of difference was observed across the entirety of the LR range of EFM. For instance, for Ref3 compared to mixture 3Q_1.35ng_20‐5‐1, EFM obtained 0.2 bans, whereas an LR of more than 15 bans were observed for GA. The cause of this difference was mainly due to a high number of dropouts: 14 allele dropouts for EFM, but only 3 for GA. Several of the other comparisons which returned a difference greater than 15 bans also had a large difference in the number of dropouts. Details are highlighted in Appendix B, Appendix B.
Comparing the results obtained with the GA and NB models, it was observed that LRs for GA were almost always larger than for NB, however the values were quite similar for most comparisons (Fig. 1B): 76 % returned a difference of less than 2 bans, whereas for 20 % the GA model produced LRs 2–6 bans higher than NB. There were only two instances where the LR using the NB model was more than 2 bans higher than for GA: Ref2 compared to 2Q_0.075ng_2‐1 returned 13.7 and 10.9 bans respectively, and Ref2 compared to 4Q_0.67ng_2‐1‐1‐1 returned 13.2 and 9.4 bans. Conversely, there were four instances in which the GA model resulted in an LR more than 6 bans higher than that obtained with the NB model (see Table S5 and Supplemental Section 1 for details). A comparison of the LRs obtained using the NB model and EFM is illustrated in Fig S1 in the supplementary material, showing lesser LR differences compared to the GA vs EFM comparison.
LRs were large (nearly always > 10 bans) for all three models when there was no allelic dropout for the POI (Fig. 2). However, they were drastically reduced when there was at least one allele dropout, reflecting a reduced quantity of DNA and loss of information. At any given level of allelic dropout the LRs produced by the different models were similar, although there was a tendency for EFM to return larger LR values when at least seven dropouts were present. Overall, the GA model typically returned larger LR values compared to the NB model across the full dropout range.
As Fig. 2 indicates, the LR trended towards 1 as the number of dropouts increased and was sometimes less than 1. The fewest number of dropouts that resulted in an LR below 1 was inspected for each model: this was nine for the NB model (Ref3 compared to 3Q_0.34ng_20-5-1 gave log10LR=−0.27), 20 for the EFM model (Ref4 compared to 4P_0.67ng_20-5-5-1 gave log10LR=−0.04) and 22 for the GA model (Ref2 compared to 4P_0.067ng_10-2-2-1 gave log10LR=−1.148). A list of all instances which the LR was less than 1 for any of the models can be found in Table S6 in the Supplementary. We also created the same figure for non-contributors (Fig S2 in Supplementary). Here the lowest number of mismatches that were obtained for non-contributors with LR above 1 was found to be 18 for EFM (T = 30) and 10 for both MPSproto models. The GA model gave consistently larger LR values for the non-contributors compared to the NB model across the full dropout range.

### 3.3 ROC analysis and LR for $Hd$true comparisons

The ROC analysis indicated a slightly better AUC value for the NB model compared to the other two models (Fig. 3). The MPSproto models (GA and NB) resulted in higher true positive rate compared to EFM (T = 30) for false positive rate values up to approximately 0.1. For this range, the GA model returned higher true positive rates than the NB model. However, for higher false positive rates, the GA model returned fewer true positive rates than both the NB and EFM (T = 30) model, which behaved very similarly to each other.
With an LR threshold equal to 1, the number of false negatives/false positives for the three models, GA, NB and EFM (T = 30) were 7/66, 10/36 and 10/39, respectively (Table 2). When the LR threshold was increased to 10, the numbers for the three models were 9/0, 15/3 and 21/0, respectively. Consequently, LRs were low when false positives occurred (Fig. 1): For the NB model three circumstances were obtained where the LR was greater than 10 (references Ref4, Ref7 and Ref10 compared to sample 3P_0.067ng_20-5-1), with largest being LR= 63.4 (Ref7). All remaining non-contributor LR values were below 10 for all models. A list of all instances of non-contributor LRs above log10LR= 0.5, for all models, can be found in Table S7 in the Supplementary material.
Table 2Gives an overview over the true positive rate (TPR), false positive rate (FPR), the number of false negatives (FN) and the number of false positives (FP) for different LR thresholds (Thresh) and models. The NB and GA models both used analytical threshold T = 11.
ThreshModelTPRFPRFNFP
1EFM (T = 11)0.9310.02381110
1EFM (T = 20)0.9690.0548523
1EFM (T = 30)0.9380.09051038
1NB0.9380.08571036
1GA0.9560.157766
10EFM (T = 11)0.9060150
10EFM (T = 20)0.9120140
10EFM (T = 30)0.8690210
10NB0.9060.00714153
10GA0.944090

### 3.4 Goodness of fit

The Kolmogorov-Smirnov p-values between for the two MPSproto models were uniformly distributed (see marginal distributions in the margin of the figures), indicating adequate models (Fig S3). Also, the p-values seemed independent (no pattern observed), indicating different model properties. On the other hand, the p-values for EFM (T = 30) clustered closer to 0, indicating a poor model fit for most of the samples (Fig S4).

### 3.5 Reducing the analytical threshold for EuroForMix

Reducing the analytical threshold for EFM further down to T = 20 or T = 11 (the latter value used for both MPSproto models) also reduced the LR differences compared to the MPSproto models, especially for comparisons in the range of log10LR < 10 (Fig S5 in supplementary). However, the reduction of the EFM analytical threshold in general caused a further reduction in the KS p-values, and hence even worse model fits (Fig S6 in supplementary). Nevertheless, the reduced thresholds of T = 20 increased the sensitivity of the EFM model since fewer false negatives were obtained (Table 2). The AUC was also highest for this model (Fig S7 in supplementary). However, reducing the threshold further to T = 11 reduced the sensitivity. We found that when the LR threshold was increased to 10, the false positives vanished for all almost all models (a few were obtained for the NB model). For this LR threshold, the number of false negatives was only nine for the GA model, followed by EFM with a lowered threshold: 14 for T = 20 and 15 for T = 11.
Although most of the LR values increased using EFM when applying T = 20 over T = 30, there was one example were this was not the case: Ref2 compared with 2Q_1.5ng_40-1 recorded a reduction of approximately 6 bans (log10LR=−6.7 instead of log10LR=0.1 obtained with T = 30). Here, around 9 bans were obtained for the MPSproto models. The reason for the large reduction was due to artefacts that could not be modelled by EFM; instead, they were therefore modelled as originating from an unknown contributor. For this example, model validation failed for EFM, but not for MPSproto (GA): the KS p-value under $Hp$ was 0.004/5e-06 for EFM T = 30/T = 20.

## 4. Discussion

### 4.1 Summary and overall findings

In this study the development and testing of a new probabilistic genotyping software, MPSproto, is described to evaluate MPS-STR mixtures. MPSproto is an extension of EuroForMix, and was inspired by the work of Vilsen et al. [
• Vilsen S.B.
• Tvedebrink T.
• Eriksen P.S.
• Hussing C.
• Børsting C.
• Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
], who included marker amplification efficiency parameters to handle the large inter-locus balance that has been observed for MPS-STRs. MPSproto was already described and applied to an illustrative example in the discussion of Agudo et al. [
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
]; who focused on developing a framework to evaluate complex sequence stutters, describing their sizes using the bracket format nomenclature. The software includes two distribution options for the sequence read depths: the gamma (GA) model which is continuous, and the negative binomial (NB) model which is discrete.
In a departure from the methods used by EuroForMix, MPSproto requires calibration prior to implementation and evaluation of questioned DNA-profiles. The calibration dataset must be comprised of single-source profiles from donors whose alleles are known, and ideally should not include any degraded profiles, since the purpose of calibration is to characterize inter-locus variation. To simplify MPSproto calibration for users, the calibrateModel function in the software is used to perform all necessary calibration steps.
To evaluate the performance of the two MPSproto models compared to EuroForMix, the analysis described by Bleka et al. [
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
] was revisited for EuroForMix interpretation of 60 autosomal STR mixtures sequenced using the ForenSeq DNA Signature Prep Kit. Different analytical thresholds were applied: T = 11 for MPSproto versus T = 30 for EuroForMix, since for the latter, this was the threshold used in [
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
] and EuroForMix has not been adapted to work with lower values. However, we investigated the effect of reducing thresholds applied to EuroForMix: T = 20 and T = 11.
With this dataset, the MPSproto GA model performed considerably better than EuroForMix (T = 30), mainly because fewer dropouts were observed due to the lower analytical threshold applied to the former model. However, lowering the EuroForMix threshold reduced the differences, especially when T = 20 was applied instead. However, a worse model fit was returned for EuroForMix, since this also increased the number of artefacts that could not be properly modelled.
Overall, the GA model performed better than the NB model, since higher LRs for true contributors were obtained for the former. Additionally, the former model obtained a higher true positive rate for false positive rate values up to approximately 0.1. There were some situations where the GA model far outperformed the NB model. For example, when Ref2 was interrogated as a contributor to mixture 2P_0.75ng_20–1, the LR resulting from use of the GA model was log10LR= 10.7 versus log10LR= 4.0 for the NB model. With this mixture, Ref2 was a minor contributor with 11 dropouts, and the LR difference between the two models was largest for the D2S441 marker where Ref2 had one allele dropout. In the evaluation of this mixture, a lower penalty for dropout was applied when the GA model was used compared to that applied with the NB model. With further analysis we estimated the dropout probability of the Ref2 allele to be approximately 21 % for the GA model and only 2 % for the NB model (see Section 4.2 for more details). This indicates that the GA model is less strict and would be preferable to evaluate minor contributors compared to the NB model when an analytical threshold of 11 reads is applied (see Section 4.2 for further discussion of thresholds). However, the consequence is that non-contributors would be “excluded” to a lesser degree (albeit with low probative LRs).
Findings regarding model performance in this study do not necessarily support the suggestion by Vilsen et al. [
• Vilsen S.B.
• Tvedebrink T.
• Eriksen P.S.
• Hussing C.
• Børsting C.
• Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
] that the NB distribution provides a better model than GA when the DNA quantities are low: The GA model seems to provide a higher dropout probability than for the NB model. However, it is possible that our conclusion in this respect could be different if lower analytical thresholds were applied to the mixture dataset; indeed, the effect of applying alternative analytical thresholds was not explored here. Regardless, there may be situations where it could be better to use the NB model – for instance, if the model fit using GA was poor, and the model fit diagnostics indicated that NB was better. For this reason, it may be worthwhile to compute the LR for both models when minor contributors are evaluated as the POI, followed by an examination of the model diagnostics to select the best fit model.
In the present study, LR values for non-contributors mostly did not exceed approximately 10, regardless of the program or model employed for interpretation (there were three observations above LR=10 for the NB model). All LR values greater than 1 were from individuals unrelated to the mixture donors (no check of relatedness amongst non-contributors was performed). Estimating the number of contributors to an unknown sample can be challenging and can impact interpretation results – typically in the form of false negatives for true donors when the contributor number is underestimated, and an increased number of false positives for non-contributors when the contributor number is overestimated [
• Benschop C.C.G.
• et al.
Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
]. In this study the ground truth number of contributors to the constructed mixtures was used for all evaluations (MPSproto and EuroForMix); with donor DNA inputs as low as 2 pg (see Supplemental Table S2), the contributor number used may therefore have been higher than the apparent number of contributors due to high levels of allelic dropout.

### 4.2 Model considerations

#### 4.2.1 Stutters

As with EuroForMix, MPSproto uses stutter proportions instead of stutter ratios. The difference between the programs is in the way that stutter proportions are assigned: EuroForMix estimates stutter proportion parameter(s) when the questioned profile is evaluated, whereas MPSproto requires a calibration of the parameter in advance using a single-source dataset. The calibration of the stutter model for MPSproto is somewhat challenging, since it requires 1) the donor alleles to be known, 2) heterozygous allele pairs with sufficient separation (to not mask stutter effects), and 3) a sufficient number of stutter products (per sequence and across different alleles).
Rather than modelling stutter proportions directly, it may also be possible to take a fitted linear model of a stutter ratio, $SR,$ and convert it to a stutter proportion, $Sp$. Vilsen et al. [
• Vilsen S.B.
• et al.
Stutter analysis of complex STR MPS data.
] suggested model $SR=β(BLMM−1)$ for the expected stutter ratio. Conversion back to the expected stutter proportion could be accomplished using the transformation $Sp=1/(1+1/SR)$.

#### 4.2.2 Noise

As with stutter, MPSproto requires calibration of the noise parameters in advance using a single-source dataset. MPSproto models the number of noise sequences per marker using a geometrical distribution, and the noise read depth using a discretized pareto distribution. These models provided a good fit for the calibration dataset used in this study. The pareto distribution is a useful choice since it can consider higher noise levels than for instance the exponential distribution, due to heavier tails.
Importantly, as the noise parameter developed during calibration is dependent on the analytical threshold selected, the analytical threshold used for evaluation should be the same as that used for the calibration.
The noise model may be affected by the number of libraries multiplexed for sequencing. For example, the expectation is that a sequencing pool composed of 90 libraries will result in reduced quantity of noise signals/sequences at reduced read depths as compared to a sequencing a pool of 30 libraries. Therefore, a calibration may be needed for each distinct protocol used for data generation. In the calibration dataset we observed that most of the noise sequences (98.5%) appeared to be errors of a single base compared to a parental allele sequence. Hence, there is potential to model the single base error sequencies separately from other type of noise, which may help remedy the above-mentioned issue.
Vilsen et al. [
• Vilsen S.B.
• Tvedebrink T.
• Mogensen H.S.
• Morling N.
Modelling noise in second generation sequencing forensic genetics STR data using a one-inflated (zero-truncated) negative binomial model.
] proposed a one-inflated (zero-truncated) negative binomial distribution for the noise read depth, including all sequences down to singletons (i.e., analytical threshold T = 1). Our approach to handling noise sequences differs: we recommend avoiding an analytical threshold that is too low because of the way that the MPSproto noise model is defined (it may no longer fit). An alternative approach is to apply corrections: either of the type employed by FDStools [
• Hoogenboom J.
• van der Gaag K.J.
• de Leeuw R.H.
• Sijen T.
• de Knijff P.
• Laros J.F.J.
‘FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise’.
], which associates noise with parental sequences and combines the read depths (see Benschop et al. [
• Benschop C.C.G.
• et al.
Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
] for a comparison study), or a method to reduce the number of base-calling errors such as that introduced by Vilsen [

S.B. Vilsen, Statistical Modelling of Massively Parallel Sequencing Data in Forensic Genetics, Aalborg University, Aalborg.

].

#### 4.2.3 Thresholds and data filtering

Probabilistic genotyping (PG) systems are well established for CE based applications to deal with stutters. However, interpretation for MPS-STR data lags because of the current lack of availability of such systems, and accordingly has relied on defining thresholds to facilitate interpretation. Experience with PG systems applied to CE has shown that evaluation of evidence that is based upon thresholds, where binary decisions are required to designate alleles, are always suboptimal [
• Taylor D.
• Buckleton J.
• Bright J.-A.
Does the use of probabilistic genotyping change the way we should view sub-threshold data?.
]. This is because information is lost or thrown away. This in turn results in reports that either understate the value of the evidence when true alleles that support $Hp$ are removed (i.e., dropouts for the person of interest); or overstate the value of the evidence when true alleles that support $Hd$ are removed (i.e., dropouts for the unknown contributors). Given the increased complexity, it can be argued that the inherent advantages of MPS will never be properly realized without integration of PG systems for evidence evaluation. Use of a threshold intended to eliminate noise cannot be avoided with the current implementation of MPSproto, but steps have been taken to model noise, reducing the threshold to only 11 reads – which is substantially lower than that used in conventional systems.

#### 4.2.4 Marker amplification efficiencies

On the basis of prior findings regarding variability in marker amplification efficiencies with the ForenSeq DNA Signature Prep Kit [
• Hussing C.
• Huber C.
• Bytyci R.
• Mogensen H.S.
• Morling N.
• Børsting C.
Sequencing of 231 forensic genetic markers using the MiSeq FGx™ forensic genomics system – an evaluation of the assay and software.
,
• Cheng K.
• et al.
Modeling allelic analyte signals for aSTRs in NGS DNA profiles.
,
• Didier M.
• et al.
Establishing STR and identity SNP analysis thresholds for reliable interpretation and practical implementation of MPS gDNA casework.
], MPSproto includes a function to adjust the marker amplification efficiency estimates towards those obtained from a questioned DNA profile by utilizing a prior. To enable this, standard deviations of the amplification efficiencies must be specified in addition to the mean (per marker), and the user can choose between normal distribution or log-normal distribution for the prior. We did not use this functionality in this study, although it would be worthwhile for future investigations, where the prior is, for example, determined from the calibration dataset.

Similar to EuroForMix, the MPSproto model also includes the possibility to consider a degradation model for evaluation of questioned DNA profiles. In this study, both the calibration and mixture datasets were constructed without degradation exhibited. When a degradation model was applied in MPSproto, the degradation slope estimates were all close to one (not shown), confirming the pristine state of the test samples. It is highly recommended that the calibration dataset exhibits little or no degradation, so that the estimated marker efficiencies are not affected before the full model is applied to questioned DNA profiles (which may be highly degraded).
EuroForMix was used with the degradation model turned on since more of the data is explained with rather than without using it. The underlying reason is that the marker amplification efficiencies of samples in the mixture dataset decay with the fragment length.

#### 4.2.6 Data format

The format required for MPSproto interpretation of MPS-STR sequences does not necessitate the use of any specific alignment or data analysis software. Any program that can be used to perform marker identification and sequence allele calling from FASTQ files (such as STRaitRazor [
• Warshauer D.H.
• et al.
STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data.
], FDStools [
• Hoogenboom J.
• van der Gaag K.J.
• de Leeuw R.H.
• Sijen T.
• de Knijff P.
• Laros J.F.J.
‘FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise’.
], and STRinNGS [
• Jønck C.G.
• Qian X.
• Simayijiang H.
• Børsting C.
STRinNGS v2.0: Improved tool for analysis and reporting of STR sequencing data.
,
• Ganschow S.
• Silvery J.
• Kalinowski J.
• Tiemann C.
toaSTR: a web application for forensic STR genotyping by massively parallel sequencing.
]), along with bracketing of the sequences, could be used as part of a pipeline to produce the data for MPSproto input. In this paper, sequences were obtained from ForenSeq typing results using the ForenSeq UAS v1.3, and these were further converted into a bracket format (forward strand; using the lusSTR program [

R. Mitchell , D. Standage, lusSTR. Bioforensics, 2021. (Online). https://github.com/bioforensics/lusSTR.

,

Ø. Bleka, LUSstrR. 2022 (Online).https://github.com/oyvble/LUSstrR.

]) as recommended by an International Society for Forensic Genetics (ISFG) Commission [
• Parson W.
• et al.
Massively parallel sequencing of forensic STRs: considerations of the DNA commission of the International Society for Forensic Genetics (ISFG) on minimal nomenclature requirements.
].

#### 4.2.7 Model differences

We revisited the interpretation where reference Ref2 was compared to sample 2P_0.75ng_20‐1 since this was one of the examples where the two MPSproto models differed most (log10LR=6.7 in difference). The largest LR difference was observed for D2S441 which was estimated as 1.14 in marker amplification efficiency (Table 1). One of the heterozygous alleles of Ref2 dropped out. Fig. 4 illustrates how the dropout estimate became larger for the GA model (dropout probability 21 %) than for the NB model (dropout probability equal 2 %), and hence was penalized less in the LR calculation. The reason for this is that the probability density function of GA is skewed towards zero whereas for NB it is not. The read depth distribution for the GA model is also wider than for the NB model, which means that the former also has a wider heterozygous balance distribution compared to the latter.

### 4.3 Current implementation and future work

As with EuroForMix, the current version of MPSproto (v0.8.1) includes functionality to analyse PCR replicates. Benschop et al. [
• Benschop C.C.G.
• et al.
Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
] demonstrated that concurrent interpretation of PCR replicates in EuroForMix can produce improved results. We will examine the effects of using PCR replicates in a future study.
MPSproto can also be used to evaluate conventional CE-STR data (and accordingly, a CE module is included for the calibration step). Many stutter types are supported for this module: backward, forward, double-backward, double forward, triple backward and 2-bp stutters (half stutters). Here CE allele lengths are used to determine the BLMM. We will compare the performance between MPSproto and EuroForMix for the interpretation of CE data in a future study.
The current implementation MPSproto does not support combination of CE-STR with MPS-STR data for simultaneous evaluation. For this to be possible the model would need to be extended so that typing results for each marker can be differentiated according to CE or MPS origin, as a given marker could have observations from either CE typing, MPS typing, or both. For instance, the ForenSeq DNA Signature Prep Kit types 22 autosomal STRs that are also typed using the PowerPlex® Fusion 6C System Kit (Promega Corporation).
EuroForMix has been used to interpret SNP typing results [
• Bleka Ø.
• Eduardoff M.
• Santos C.
• Phillips C.
• Parson W.
• Gill P.
Open source software EuroForMix can be used to analyse complex SNP mixtures.
,
• Hwa H.-L.
• et al.
Massively parallel sequencing analysis of nondegraded and degraded DNA mixtures using the ForenSeq™ system in combination with EuroForMix software.
] (though Yang et al. [
• Yang T.-W.
• et al.
DNA mixture interpretation using linear regression and neural networks on massively parallel sequencing data of single nucleotide polymorphisms.
] showed better performance with machine learning approaches). Accordingly, MPSproto can also be used to interpret SNP data. For this application, the MPSproto stutter model would be turned off and not defined, and the degradation model would not be used. However, calibration is still needed to define the marker amplification efficiency and noise parameters to be used. Provided STR and SNP markers are neither linked nor in linkage disequilibrium, the LRs for the two systems could be multiplied together. The performance of MPSproto with SNPs will be described in a future paper.
The current version of MPSproto does not have a graphical user interface (GUI). As most forensic practitioners are not used to running programs from the command line, a near-term priority is to develop a user-friendly GUI for both model calibration and profile interpretation. Additionally, as MPSproto requires specific formatting for the input sequence data, we aim to integrate the LUSstrR conversion tool [

Ø. Bleka, LUSstrR. 2022 (Online).https://github.com/oyvble/LUSstrR.

] into MPSproto to ease translation of sequence strings into the proper format. Other MPS-STR assays such as the PowerSeq GY System (Promega Corporation), the Precision ID GlobalFiler NGS STR Panel (Thermo Fisher Scientific), or custom assays, could be used with MPSproto, provided that the sequences are converted to the proper format. However, to unlock the degradation model for these assays, the fragment lengths of common RU alleles must be defined in the MPSproto kit file; at present, only information about the ForenSeq DNA Signature Prep Kit is included.
Finally, there is no relationship testing module included in the current version of MPSproto, but we aim to include this in a future version.

## 5. Conclusion

In this paper we found that the developed MPSproto models are good alternatives to EuroForMix for the interpretation of MPS-STR mixtures typed using ForenSeq DNA Signature Prep Kit. MPSproto enables increased sensitivity by lowering the analytical threshold whilst demonstrating a characteristically good fit to the data. Although, it requires a model calibration of stutters, marker efficiency and noise before the interpretation.

## Conflict of interest

The authors declare that they have no conflict of interest.

## Acknowledgements

This work was funded under Agreement No. HSHQDC-15-C-00064 awarded to Battelle National Biodefense Institute (BNBI) by the Department of Homeland Security (DHS) Science and Technology Directorate (S&T) for the management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DHS or the U.S. Government. DHS does not endorse any products or commercial services mentioned in this presentation. In no event shall DHS, BNBI or NBACC have any responsibility or liability for any use, misuse, inability to use, or reliance upon the information contained herein. In addition, no warranty of fitness for a particular purpose, merchantability, accuracy or adequacy is provided regarding the contents of this document. All research involving living individuals, their data, or their biospecimens was conducted in compliance with the Federal Policy for the Protection of Human Subjects (The Common Rule, codified for DHS as 6 CFR 46), DHS Management Directive 026-04, and any other applicable statutory requirements. Research involving human subjects has only been initiated after the following has occurred: the need for IRB review has been determined, IRB approval has been obtained as applicable, and DHS Compliance Assurance Program Office certification or concurrence has been issued. Notice: This manuscript has been authored by Battelle National Biodefense Institute, LLC under Contract No. HSHQDC-15-C-00064 with DHS. The US Government (USG) retains and the publisher, by accepting the article for publication, acknowledges that the USG retains a non-exclusive, paid up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for USG purposes.

## Appendix A

### Mathematical description of the distributions

Gamma distribution:
$X=x∼gammashape=α,scale=β$

Continuous outcome: $x∈0,∞$.
Probability density function:$fx=1Γαβαxα−1e−xβ$
Cumulative distribution function:$Fx=∫0xf(t)dt$

### Negative binomial distribution (re-parameterized version):

$X=x∼negBinommu=μ,size=η$

Discrete outcome: $x={0,1,2…}$.
Probability density function:$fx=Γ(x+η)Γx+1Γημxηημ+ηx+η$
Cumulative distribution function:$Fx=∑t=0t=x−1f(t)$

### Deriving the relation between size parameter and the coefficient of variation

We assume following parameterization: $EX=μ$ and $VarX=σ2$.
Then the size parameter can be written as $η=μ2/(σ2−μ)$.
Since the coefficient of variation $CVX=σμ=ω$, then $η=μ2(μω)2−μ=μμω2−1$.
Computational sidenote: Very large values of $η$ can cause computational issues which is solvable by restricting it below a value $η0$. It can be derived that the restriction $η<η0$ is equivalent to $μ>ω2−1η0−1$. Hence smaller values of $ω$ would require a large value of $μ$. We used a value of $η0=1e4$ in the C++ function which avoids the maximum likelihood optimization to crash using parallelization with OpenMP.

## Appendix B. Supplementary material

• Supplementary material

.
• Supplementary material

.

## References

• Bleka Ø.
• Storvik G.
• Gill P.
EuroForMix: an open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
Forensic Sci. Int. Genet. 2016; 21: 35-44https://doi.org/10.1016/j.fsigen.2015.11.008
• Coble M.D.
• Bright J.-A.
Probabilistic genotyping software: an overview.
Forensic Sci. Int. Genet. 2019; 38: 219-224https://doi.org/10.1016/j.fsigen.2018.11.009
• Bruijns B.
• Tiggelaar R.
• Gardeniers H.
‘Massively parallel sequencing techniques for forensics: a review.
Electrophoresis. 2018; 39: 2642-2654https://doi.org/10.1002/elps.201800082
• Barrio P.A.
• et al.
The first GHEP-ISFG collaborative exercise on forensic applications of massively parallel sequencing.
Forensic Sci. Int. Genet. 2020; 49https://doi.org/10.1016/j.fsigen.2020.102391
• Alonso A.
• et al.
Current state-of-art of STR sequencing in forensic genetics.
Electrophoresis. 2018; 39: 2655-2668https://doi.org/10.1002/elps.201800030
• Benschop C.C.G.
• et al.
Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
Forensic Sci. Int. Genet. 2021; 52https://doi.org/10.1016/j.fsigen.2021.102489
• van der Gaag K.J.
• et al.
Massively parallel sequencing of short tandem repeats—Population data and mixture analysis results for the PowerSeq™ system.
Forensic Sci. Int. Genet. 2016; 24: 86-96https://doi.org/10.1016/j.fsigen.2016.05.016
• Taylor D.
• Bright J.A.
• Buckleton J.
‘The interpretation of single source and mixed DNA profiles.
Forensic Sci. Int. Genet. 2013; 7: 516-528
• Cowell R.G.
• Graversen T.
• Lauritzen S.L.
• Mortera J.
Analysis of forensic DNA mixtures with artefacts.
Appl. Stat. 2015; 64: 1-32
• Hussing C.
• Huber C.
• Bytyci R.
• Mogensen H.S.
• Morling N.
• Børsting C.
Sequencing of 231 forensic genetic markers using the MiSeq FGx™ forensic genomics system – an evaluation of the assay and software.
Null. 2018; 3: 111-123https://doi.org/10.1080/20961790.2018.1446672
• Li R.
• et al.
Characterizing stutter variants in forensic STRs with massively parallel sequencing.
Forensic Sci. Int. Genet. 2020; 45https://doi.org/10.1016/j.fsigen.2019.102225
• Vilsen S.B.
• Tvedebrink T.
• Eriksen P.S.
• Hussing C.
• Børsting C.
• Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
Forensic Sci. Int. Genet. 2018; 37: 6-12https://doi.org/10.1016/j.fsigen.2018.07.017
• Vilsen S.B.
• et al.
Stutter analysis of complex STR MPS data.
Forensic Sci. Int. Genet. 2018; 35: 107-112https://doi.org/10.1016/j.fsigen.2018.04.003
• Cheng K.
• et al.
Modeling allelic analyte signals for aSTRs in NGS DNA profiles.
J. Forensic Sci. 2021; 66: 1234-1245https://doi.org/10.1111/1556-4029.14685
• Agudo M.M.
• Aanes H.
• Roseth A.
• Albert M.
• Gill P.
• Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
Forensic Sci. Int. Genet. 2022; 102728https://doi.org/10.1016/j.fsigen.2022.102728
• Just R.S.
• Irwin J.A.
Use of the LUS in sequence allele designations to facilitate probabilistic genotyping of NGS-based STR typing results.
Forensic Sci. Int. Genet. 2018; 34: 197-205https://doi.org/10.1016/j.fsigen.2018.02.016
• Bleka Ø.
• Just R.
• Le J.
• Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
Forensic Sci. Int. Genet. 2020; 48102319https://doi.org/10.1016/j.fsigen.2020.102319
1. R. Mitchell , D. Standage, lusSTR. Bioforensics, 2021. (Online). https://github.com/bioforensics/lusSTR.

• Gettings K.B.
• Borsuk L.A.
• Steffen C.R.
• Kiesler K.M.
• Vallone P.M.
Sequence-based U.S. population data for 27 autosomal STR loci.
Forensic Sci. Int. Genet. 2018; 37: 106-115https://doi.org/10.1016/j.fsigen.2018.07.013
2. L.I. Moreno, T.R. Moretti, Short Tandem Repeat Genotypes of Samples From Eleven Populations Comprising the FBI’s Population Database, 1, 2019.

3. Ø. Bleka, LUSstrR. 2022 (Online).https://github.com/oyvble/LUSstrR.

• Krishnaji N.
Characterization of the Pareto distribution through a model of underreported incomes.
Econometrica. 1970; 38: 251-255
• Zweig M.H.
• Campbell G.
‘Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.
Clin. Chem. 1993; 39: 561-577
• Vilsen S.B.
• Tvedebrink T.
• Mogensen H.S.
• Morling N.
Modelling noise in second generation sequencing forensic genetics STR data using a one-inflated (zero-truncated) negative binomial model.
Forensic Sci. Int. Genet. Suppl. Ser. 2015; 5: e416-e417https://doi.org/10.1016/j.fsigss.2015.09.165
• Hoogenboom J.
• van der Gaag K.J.
• de Leeuw R.H.
• Sijen T.
• de Knijff P.
• Laros J.F.J.
‘FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise’.
Forensic Sci. Int. Genet. 2017; 27: 27-40https://doi.org/10.1016/j.fsigen.2016.11.007
4. S.B. Vilsen, Statistical Modelling of Massively Parallel Sequencing Data in Forensic Genetics, Aalborg University, Aalborg.

• Taylor D.
• Buckleton J.
• Bright J.-A.
Does the use of probabilistic genotyping change the way we should view sub-threshold data?.
Null. 2017; 49: 78-92https://doi.org/10.1080/00450618.2015.1122082
• Didier M.
• et al.
Establishing STR and identity SNP analysis thresholds for reliable interpretation and practical implementation of MPS gDNA casework.
Forensic Sci. Int. Genet. Suppl. Ser. 2019; 7: 363-364https://doi.org/10.1016/j.fsigss.2019.10.013
• Warshauer D.H.
• et al.
STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data.
Forensic Sci. Int. Genet. 2013; 7: 409-417https://doi.org/10.1016/j.fsigen.2013.04.005
• Jønck C.G.
• Qian X.
• Simayijiang H.
• Børsting C.
STRinNGS v2.0: Improved tool for analysis and reporting of STR sequencing data.
Forensic Sci. Int. Genet. 2020; 48102331https://doi.org/10.1016/j.fsigen.2020.102331
• Ganschow S.
• Silvery J.
• Kalinowski J.
• Tiemann C.
toaSTR: a web application for forensic STR genotyping by massively parallel sequencing.
Forensic Sci. Int. Genet. 2018; 37: 21-28https://doi.org/10.1016/j.fsigen.2018.07.006
• Parson W.
• et al.
Massively parallel sequencing of forensic STRs: considerations of the DNA commission of the International Society for Forensic Genetics (ISFG) on minimal nomenclature requirements.
Forensic Sci. Int. Genet. 2016; 22: 54-63https://doi.org/10.1016/j.fsigen.2016.01.009
• Bleka Ø.
• Eduardoff M.
• Santos C.
• Phillips C.
• Parson W.
• Gill P.
Open source software EuroForMix can be used to analyse complex SNP mixtures.
Forensic Sci. Int. Genet. 2017; 31: 105-110https://doi.org/10.1016/j.fsigen.2017.08.001
• Hwa H.-L.
• et al.
Massively parallel sequencing analysis of nondegraded and degraded DNA mixtures using the ForenSeq™ system in combination with EuroForMix software.
Int. J. Leg. Med. 2019; 133: 25-37https://doi.org/10.1007/s00414-018-1961-y
• Yang T.-W.
• et al.
DNA mixture interpretation using linear regression and neural networks on massively parallel sequencing data of single nucleotide polymorphisms.
Null. 2022; 54: 150-162https://doi.org/10.1080/00450618.2020.1807050