Advertisement

A comprehensive characterization of MPS-STR stutter artefacts

Open AccessPublished:May 26, 2022DOI:https://doi.org/10.1016/j.fsigen.2022.102728

      Highlights

      • Stutter artefacts from STRs in MPS forensic data are characterized.
      • Stutter proportions and candidate variables are investigated via beta regression.
      • Parental uninterrupted stretch (PTUS) as explanatory variable.
      • Estimates are integrated into probabilistic genotyping model MPSproto.

      Abstract

      Interpretation of DNA evidence involving mixtures is challenging when alleles from minor contributors coincide with stutters from major contributors. To accommodate this, it is important to have a good understanding of stutter sequence formation trends. Here, multiple stutter types were characterized based on massively parallel sequencing (MPS) data from 387 single source samples, using the Verogen ForenSeq™ DNA Signature Prep kit. A beta regression model was used to investigate the relationship between the stutter proportion and candidate explanatory variables. In the final model, stutter proportions were explained by the length of the parental uninterrupted stretch (PTUS), which is comparable to block length of the missing motif (BLMM). Also, different stutter types (n+1, n-1, n+2, n-2, n0) were analyzed separately per locus. The fitted stutter models were then integrated into an extended probabilistic genotyping model based on EuroForMix (MPSproto). An illustrative minor/major mock mixture example is discussed. Evaluation of multiple types of stutters on a per locus basis improved the probabilistic genotyping result compared to the conventional EuroForMix model, using the LUS+ nomenclature.

      Keywords

      1. Introduction

      The detection and identification of short tandem repeats (STR) regions from human DNA has been the cornerstone of forensic genetics over previous decades. In recent years there has been a shift of interest from traditional capillary gel electrophoresis (CE) towards MPS [
      • Bruijns B.
      • Tiggelaar R.
      • Gardeniers H.
      Massively parallel sequencing techniques for forensics: a review.
      ,
      • Barrio P.A.
      • García Ó.
      • Phillips C.
      • Prieto L.
      • Gusmão L.
      • Fernández C.
      • et al.
      The first GHEP-ISFG collaborative exercise on forensic applications of massively parallel sequencing.
      ,
      • Alonso A.
      • Barrio P.A.
      • Müller P.
      • Köcher S.
      • Berger B.
      • Martin P.
      • et al.
      Current state-of-art of STR sequencing in forensic genetics.
      ,
      • Alonso A.
      • Müller P.
      • Roewer L.
      • Willuweit S.
      • Budowle B.
      • Parson W.
      European survey on forensic applications of massively parallel sequencing.
      ]. These two methods differ greatly in the type of output signal: whilst for CE, fluorescence intensity is represented by allele peaks, in MPS, read counts relate to abundance of allele sequences [
      • Hoogenboom J.
      • van der Gaag K.J.
      • de Leeuw R.H.
      • Sijen T.
      • de Knijff P.
      • Laros J.F.J.
      FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.
      ]. With MPS, it is possible to detect sequence variants in both the repeat region, and in the flanking regions of STRs. There has also been an increase in multiplexing capability leading to the possibility to reduce amplicon fragment length, which is useful when dealing with degraded DNA [
      • Hussing C.
      • Huber C.
      • Bytyci R.
      • Mogensen H.S.
      • Morling N.
      • Børsting C.
      Sequencing of 231 forensic genetic markers using the MiSeq FGxTM forensic genomics system – an evaluation of the assay and software.
      ,
      • Vilsen S.B.
      • Tvedebrink T.
      • Eriksen P.S.
      • Bøsting C.
      • Hussing C.
      • Mogensen H.S.
      • et al.
      Stutter analysis of complex STR MPS data.
      ,
      • de Knijff P.
      From next generation sequencing to now generation sequencing in forensics.
      ].
      However, the use of MPS poses new challenges; i.e., there is strong reliance on bioinformatics tools [
      • Liu Y.Y.
      • Harbison S.
      A review of bioinformatic methods for forensic DNA analyses.
      ,
      • Cao M.D.
      • Balasubramanian S.
      • Boden M.
      Sequencing technologies and tools for short tandem repeat variation detection.
      ]. To date, there is no standard approach to classify sequences into alleles, stutter and noise categories; furthermore, analytical thresholds are defined by criteria that are subjectively based [
      • Riman S.
      • Iyer H.
      • Borsuk L.A.
      • Vallone P.M.
      Understanding the behavior of stutter through the sequencing of STR alleles.
      ,
      • Young B.
      • King J.L.
      • Budowle B.
      • Armogida L.
      A technique for setting analytical thresholds in massively parallel sequencing-based forensic DNA analysis.
      ].
      During PCR amplification artefactual products arise, the most common being stutters [
      • Walsh P.S.
      • Fildes N.J.
      • Reynolds R.
      Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus VWA.
      ,
      • Hauge X.Y.
      • Litt M.
      A study of the origin of ‘shadow bands’ seen when typing dinucleotide repeat polymorphisms by the PCR.
      ,
      • Meldgaard M.
      • Morling N.
      Detection and quantitative characterization of artificial extra peaks following polymerase chain reaction amplification of 14 short tandem repeat systems used in forensic investigations.
      ]. A probable mechanism for the generation of these in-vitro sequences is slipped-strand mispairing [
      • Levinson G.
      • Gutman G.A.
      Slipped-strand mispairing: a major mechanism for DNA sequence evolution.
      ]. Stutters are generated due to displacement and further misalignment of complementary DNA strands [
      • Levinson G.
      • Gutman G.A.
      Slipped-strand mispairing: a major mechanism for DNA sequence evolution.
      ,
      • Klintschar M.
      • Wiegand P.
      Polymerase slippage in relation to the uniformity of tetrameric repeat stretches.
      ,
      • Schlötterer C.
      • Tautz D.
      Slippage synthesis of simple sequence DNA.
      ]. This can lead to either additional or fewer copies of the repeat unit in the newly synthesized sequence compared to the true allele (template strand). Deletions of repeat units lead to backward, or reverse stutters (these are the most common type observed), whereas insertions lead to forward stutters [

      Gill P., Bleka Ø, Hansson O., Benschop C., Haned H. Chapter 2 - Empirical characterization of DNA profiles. In: Forensic Practitioner’s Guide to the Interpretation of Complex DNA Profiles [Internet]. Academic Press; 2020. p. 55–88. Available from: 〈https://doi.org/10.1016/B978–0-12–820562-4.00010–9〉.

      ,

      Butler JM. Forensic DNA typing: biology, technology, and genetics of STR markers. 2nd ed. Amsterdam; Boston: Elsevier Academic Press; 2005. 660 p.

      ].
      The amount of stutter is measured relative to the parental template as the stutter ratio (Eq. 1), or proportion (Eq. 2):
      SR=NSANA,
      (1)


      SP=NSANSA+NA
      (2)


      where NA is the read count for allele A, and NSA is the read count for stutter SA originating from (parental) allele A.
      The size (read count) of stutters depends upon several factors: the number of STR repeats; AT-content; repeat length; and importantly the input DNA quantity [

      Gill P., Bleka Ø, Hansson O., Benschop C., Haned H. Chapter 2 - Empirical characterization of DNA profiles. In: Forensic Practitioner’s Guide to the Interpretation of Complex DNA Profiles [Internet]. Academic Press; 2020. p. 55–88. Available from: 〈https://doi.org/10.1016/B978–0-12–820562-4.00010–9〉.

      ,

      Butler JM. Forensic DNA typing: biology, technology, and genetics of STR markers. 2nd ed. Amsterdam; Boston: Elsevier Academic Press; 2005. 660 p.

      ]. An understanding of potential causes of variation in stutter formation is useful to interpret complex DNA-mixtures, which typically consist of major/minor contributors [
      • Gill P.
      • Brenner C.H.
      • Buckleton J.S.
      • Carracedo A.
      • Krawczak M.
      • Mayr W.R.
      • et al.
      DNA commission of the international society of forensic genetics: recommendations on the interpretation of mixtures.
      ]. Interpretation is challenging; either stutters can be mistaken as alleles from the minor contributor, else alleles from the minor may be mistaken as stutters from the major. In many situations it would be impossible to distinguish these two events. It is therefore important to incorporate a comprehensive stutter model since it strongly affects the likelihood ratio (LR) in probabilistic genotyping.
      To adopt stutters into MPS probabilistic models, it is necessary to characterize their behavior. Extensive analyses have been reported for CE forensic data, and variability has been explained mostly by (1) number of nucleotides of the core repetition unit (or motif), (2) AT-content and (3) the length of the uninterrupted stretch (LUS) [
      • Walsh P.S.
      • Fildes N.J.
      • Reynolds R.
      Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus VWA.
      ,
      • Gibb A.J.
      • Huell A.L.
      • Simmons M.C.
      • Brown R.M.
      Characterisation of forward stutter in the AmpFlSTR® SGM Plus® PCR.
      ,
      • Brookes C.
      • Bright J.A.
      • Harbison S.
      • Buckleton J.
      Characterising stutter in forensic STR multiplexes.
      ,
      • Westen A.A.
      • Grol L.J.W.
      • Harteveld J.
      • Matai A.S.
      • de Knijff P.
      • Sijen T.
      Assessment of the stochastic threshold, back- and forward stutter filters and low template techniques for NGM.
      ].
      Bright et al. [
      • Bright J.A.
      • Taylor D.
      • Curran J.M.
      • Buckleton J.S.
      Developing allelic and stutter peak height models for a continuous method of DNA interpretation.
      ] proposed a linear model, based on CE observations, to describe the relationship between stutter ratio and LUS, which confirmed that alleles with larger LUS values generated larger stutter peaks than those with smaller. Afterwards, other models based on different families of distributions for stutter ratio were suggested to predict stutter formation [
      • Bright J.A.
      • Curran J.M.
      Investigation into stutter ratio variability between different laboratories.
      ,
      • Kelly H.
      • Bright J.A.
      • Buckleton J.S.
      • Curran J.M.
      Identifying and modelling the drivers of stutter in forensic DNA profiles.
      ]. Also, it has been shown that the identity of the locus has a significant effect on the stutter ratio in CE data [
      • Kelly H.
      • Bright J.A.
      • Buckleton J.S.
      • Curran J.M.
      Identifying and modelling the drivers of stutter in forensic DNA profiles.
      ]. Vilsen et al. [
      • Vilsen S.B.
      • Tvedebrink T.
      • Eriksen P.S.
      • Bøsting C.
      • Hussing C.
      • Mogensen H.S.
      • et al.
      Stutter analysis of complex STR MPS data.
      ] introduced an alternative linear model for stutter formation in MPS forensic data, where the explanatory variable is the block length of the missing motif (BLMM). Other models have characterized stutters from compound forensic markers D2S1338 and D12S391 [
      • Woerner A.E.
      • King J.L.
      • Budowle B.
      Compound stutter in D2S1338 and D12S391.
      ]. However, there is still a need for researchers to investigate alternative models and their respective impact on the interpretation of the DNA evidence.
      This paper describes a robust pipeline for the analysis of STRs in MPS forensic data used to interpret evidence where stutters are characterized using a beta regression model. Stutter types n+1, n-1, n+2, n-2 and the less common, n0 [
      • Li R.
      • Wu R.
      • Li H.
      • Zhang Y.
      • Peng D.
      • Wang N.
      • et al.
      Characterizing stutter variants in forensic STRs with massively parallel sequencing.
      ], were evaluated. This study relates the proportion of stutter sequences and parental template to candidate explanatory variables. In the discussion, we illustrate how the regression estimates can be directly integrated into a probabilistic genotyping model to improve the interpretation of DNA mixtures.

      2. Methods

      2.1 Ethical declaration

      This study was approved by the Data protection officer (DPO) at Oslo University Hospital with case numbers 20/13587, 20/16592 and 21/15006 .

      2.2 DNA samples

      Analyses were performed on genomic DNA that was previously isolated from blood samples donated by 387 unrelated Norwegian individuals as described by Dupuy et al. [
      • Dupuy B.M.
      • Stenersen M.
      • Lu T.T.
      • Olaisen B.
      Geographical heterogeneity of Y-chromosomal lineages in Norway.
      ]. The reference samples originated from diverse Southern locations in Norway.

      2.3 MPS analysis: Library preparation and sequencing

      STR libraries were prepared using the Verogen ForenSeq™ DNA Signature prep kit, following the manufacturer’s manual [

      Verogen. ForenSeq DNA Signature Prep Reference Guide. 2018;42.

      ]; 1 ng of genomic DNA was used as input. In this study, forensic loci were amplified using DNA Primer Set A, targeting 27 autosomal STRs (see Supplementary Table S1), 94 identity Informative SNPs (iSNPs), 24 Y-STRs and 7 XSTRs. Pooled samples were quantified with a Qubit™ 4 Fluorometer [

      Thermo Fisher Scientific I. Qubit 4 Fluorometer User Guide. 2018;72.

      ], and sequenced using MiSeq FGx™ according to the instrument´s Reference Guide [

      Illumina. MiSeq FGx Instrument Reference Guide. 2015;84.

      ]. Five sequencing runs were performed with 94, 96, 96, 32 and 92 samples respectively (including positive and negative controls) (See Supplementary Table S6B).

      2.4 Bioinformatics analyses

      After sequencing, FASTQ files were exported from UAS (v2.0), and quality assessment of the reads was performed with FastQC v0.11.9 [

      Andrews,Simon. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data [Internet]. [cited 2021 Aug 2]. Available from: 〈https://www.bioinformatics.babraham.ac.uk/projects/fastqc/〉.

      ] (See Fig. 1). Results across samples were summarized with MultiQC [
      • Ewels P.
      • Magnusson M.
      • Lundin S.
      • Käller M.
      MultiQC: summarize analysis results for multiple tools and samples in a single report.
      ].
      STRait Razor v3.0 [
      • Woerner A.E.
      • King J.L.
      • Budowle B.
      Fast STR allele identification with STRait Razor 3.0.
      ] was utilized for the identification of the alleles in every autosomal marker. The program uses the flanks of the repeat region and the repeat motif to assign reads to the correct marker. Only Read 1 sequences were used in this analysis, as it is expected that the full STR region is covered by the first 351 sequencing cycles. The analysis was run with the default parameters (only one bp difference was allowed in the anchor, and none in the motif); using the library configuration file Forenseq.config.strs [
      • Woerner A.E.
      • King J.L.
      • Budowle B.
      Fast STR allele identification with STRait Razor 3.0.
      ] (only STRs were included in the search, see Supplementary file). STRait Razor generates one text file for each sample where each unique sequence is represented once and the number of reads supporting the sequence is reported.
      Next, STR profiles generated by STRait Razor, were analyzed with lusSTR v0.3 [

      Mitchell, R., Standage, D. lusSTR [Internet]. bioforensics; 2021 [cited 2021 Aug 2]. Available from: 〈https://github.com/bioforensics/lusSTR〉.

      ]. In this step, extracted sequences are annotated into a short bracket format (forward strand direction).

      2.5 Data processing and identification of stutters

      Data preparation, calculation of stutter proportions, and statistical modelling, was performed in R (v 4.0.2) [

      R Core Team (2021). R: A language and environment for statistical computing. [Internet]. Vienna, Austria: R Foundation for Statistical Computing; Available from: URL 〈https://www.R-project.org/〉.

      ]. Scripts with all commands and functions are contained in the R package brackconv (https://github.com/Mariamsp/brackconv). After the analysis with lusSTR, we obtained 786,549 sequences, which were classified as either noise, allele, or stutter. A threshold of 10 reads was applied to filter out noise. Most of these sequences are singletons and they are likely due to sequencing errors [
      • Sharma V.
      • Chow H.Y.
      • Siegel D.
      • Wurmbach E.
      Qualitative and quantitative assessment of Illumina’s forensic STR and SNP kits on MiSeq FGxTM..
      ,

      Fox EJ, Reid-Bayliss KS. Accuracy of Next Generation Sequencing Platforms. J Gener Seq Appl [Internet]. 2014 [cited 2021 Jun 4];01(01). Available from: 〈https://www.omicsonline.org/open-access/accuracy-of-next-generation-sequencing-platforms-jngsa.1000106.php?aid=28132〉.

      ]. To classify sequences, heterozygote balance (Hb) values were calculated [
      • Clayton T.M.
      • Whitaker J.P.
      • Sparkes R.
      • Gill P.
      Analysis and interpretation of mixed forensic stains using DNA STR profiling.
      ]:
      Hb=Na2Na1
      (3)


      where Na1 and Na2 are the highest and second highest read counts respectively (hence Hb1) per sample and locus combination. The sequence with the highest read counts (a1) is always classified as a true allele, whereas the one with the second highest (a2) is classified as a second true allele only if Hb>0.3 (motivation for this criterion is described in Section 3.2).
      After the classification of the true alleles, a final filter was applied to omit ambiguous stutter sequences where origin cannot be identified. For example, consider the alleles [ATCG]11 and [ATCG]13, with a stutter observed as [ATCG]12. It cannot be distinguished if this allele is a forward-stutter from the [ATCG]11, a backward-stutter from [ATCG]13, or a combination of these two. Therefore, stutter observation [ATCG]12 would be removed from the analysis.
      Afterwards, we extracted the most frequent stutter types (~ 95% of all stutters) (see Supplementary Tables S2 and S3):
      • 1.
        Backward stutters: n-1 and n-2 types
      • 2.
        Forward stutters: n+1 and n+2 types
      • 3.
        n0 stutter: n-1 and n+1 products in the same sequence
      The final table consisted of 7760 stutter sequences and their respective parental allele (see Supplementary Table S4).

      2.6 Statistical analysis of stutters

      2.6.1 Beta regression model

      The aim of the stutter characterization is to study stutter proportions in relation with other variables (explanatory). Under this scenario, the stutter proportion is considered as a continuous dependent (response) variable that is beta distributed; values are skewed and restricted to the interval [0,1]. Beta regression has been described as a model that is tailored for these situations [
      • Ferrari S.
      • Cribari-Neto F.
      Beta regression for modelling rates and proportions.
      ,
      • Kieschnick R.
      • McCullough B.D.
      Regression analysis of variates observed on (0, 1): percentages, proportions and fractions.
      ,
      • Bayer F.M.
      • Cribari-Neto F.
      Model selection criteria in beta regression with varying dispersion.
      ,

      Thomopoulos NT. Statistical Distributions [Internet]. Cham: Springer International Publishing; 2017 [cited 2021 Aug 19]. Available from: 〈http://link.springer.com/10.1007/978–3-319–65112-5〉.

      ,

      Seefeld K., Ed M., Linder E. Statistics Using R with Biological Examples [Internet]. Durham, NH: University of New Hampshire, Department of Mathematics & Statistics; 2007. Available from: 〈https://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf〉.

      ] and it assumes that the dependent variable has an expected mean E(y)=μ and a variance VAR(y)=μ(1μ)/(1+φ) where φ is a precision parameter, and larger φ leads to smaller variability (dispersion) of the data for a given μ (see Supplementary Section A for a more detailed description of the model). Different model configurations were assessed using the betareg [
      • Cribari-Neto F.
      • Zeileis A.
      Beta regression in R.
      ] R-package (v3.1–4), and in all of them, a logit link function was selected for mapping the original [0,1] stutter proportion values into the real numbers line. Fitted curves with the estimated stutter proportion mean and quantile confidence intervals were obtained using the predict function. The precision parameter φ can depend on the explanatory variables, similarly as for the mean μ. Hence, models were compared for a regression with a constant precision parameter (no explanatory variables), versus a regression with a variable precision parameter where explanatory variables were included.
      As explained below (Section 2.6.2), several variables were assessed for their inclusion in the model, but the length of the parental uninterrupted stretch (PTUS) was considered pivotal for defining the analysis workflow (see Supplementary Fig. S1). This is a continuous variable that refers to the number of repeats contained in the parental uninterrupted stretch (or number of repeats of the parental motif); e.g., in the locus D12S391, we observed the parental allele [AGAT]13 [AGAC]9 and the correspondent stutter [AGAT]12 [AGAC]9, in which case, PTUS = 13, as the motif AGAT, has 13 repeats.

      2.6.2 Characterization of LUS stutters

      In this first part, stutter sequences from D13S317, D8S1179, D9S1122, D1S1656 and 18 more markers, were analyzed (see Supplementary Table S1 and Fig. S1). All sequences have a common feature, only the longest uninterrupted stretch (LUS) of the parental allele lead to a stutter, meaning that non-LUS stutters were not observed. Therefore, PTUS equals only to LUS.

      2.6.2.1 Investigated variables

      We were interested in exploring the appropriate variables that explained the change of stutter proportions. Thus, a comprehensive model comparison was carried out where the following explanatory variables were considered (see Supplementary Table S5):
      • 1.
        PTUS (abbreviation of parental uninterrupted stretch) (continuous): Number of repeats of the parental uninterrupted stretch (parental motif).
      • 2.
        LUSRep (continuous): Number of repeats of the longest uninterrupted stretch of the parental allele (the motif which the stutter originates from).
      • 3.
        MotifDiff (factor): Sequence motif that stutters and difference in number of repeats compared to the parental allele.
      • 4.
        stutterType (factor): Type of stutter generated from the parental allele. It can be a longer sequence (e.g., n+1) or shorter (e.g., n-1). This variable was used to analyse data across all categories of stutter types.
      • 5.
        Locus (factor): 27 STR markers were analysed. This variable was included in the model when analysis across all loci was performed. See Supplementary Table S1 with the full list of markers.
      • 6.
        Complexity (factor): STRs are divided into different categories depending on the repeat pattern [
        • Butler J.
        • Hill C.
        Biology and genetics of new autosomal STR loci useful for forensic DNA analysis.
        ]. These are the following:
        • 6.1
          Simple repeats (repeat units/motifs are identical in length and sequence).
          • E.g., [ATCT]11.
        • 6.2
          Compound repeats (two or more simple adjacent repeats/motifs).
          • E.g., TCTA TCTG [TCTA]12.
        • 6.3
          Complex repeats (combination of repeat units/motifs with variable length and/or sequence).
          • E.g., [TCTA]6 [TCTG]5 [TCTA]3 TA [TCTA]3 TCA [TCTA]2 TCCATA [TCTA]10.

      2.6.2.2 Model comparison

      Different model configurations of the locus and stutter categories were investigated:
      • 1.
        All loci and stutter types together (ALL model)
      • 2.
        Per marker, but all stutter types together (per-marker model)
      • 3.
        Per stutter type, but all markers together (per-stuttertype model)
      • 4.
        Per marker and per stutter type (per-marker-stuttertype model)
      A minimum number of five observations were required for a marker to be recorded within a particular stutter type. To decide the best fit model, Bayesian information criterion (BIC) [
      • Gelman A.
      • Carlin B.J.
      • Stern S.H.
      • Dunson B.D.
      • Vehtari A.
      • Rubin B.D.
      ,

      Shmueli G. To Explain or to Predict? Stat Sci [Internet]. 2010 Aug 1 [cited 2022 Jan 20];25(3). Available from: 〈https://projecteuclid.org/journals/statistical-science/volume-25/issue-3/To-Explain-or-to-Predict/10.1214/10-STS330.full〉.

      ] was calculated and the model configuration with the lowest score was selected. BIC was the preferred method based on the premise that the interest lies in selecting the best model that accurately infers the “true distribution”[

      Shmueli G. To Explain or to Predict? Stat Sci [Internet]. 2010 Aug 1 [cited 2022 Jan 20];25(3). Available from: 〈https://projecteuclid.org/journals/statistical-science/volume-25/issue-3/To-Explain-or-to-Predict/10.1214/10-STS330.full〉.

      ]. Afterwards, the selected model was used to characterize the stutter proportions.

      2.6.3 Characterization of LUS and non-LUS stutters

      In this second part (see Supplementary Table S1 and Fig. S1), stutters were observed from two different parental motifs, either LUS or non-LUS. In another words, PTUS could equal to LUS or non-LUS values. Stutters were analyzed in markers D12S391, D21S11, D2S1338, D6S1043, TH01. As an example, D12S391 with the generic structure [AGAT]n1 [AGAC]n2, can lead to n-1 stutters from both motifs:
      1) Parental [AGAT]11 [AGAC]9 and stutter [AGAT]10 [AGAC]9.
      2) Parental [AGAT]14 [AGAC]9 and stutter [AGAT]14 [AGAC]8.
      The stutter characterization was carried out separately by loci and stutter type; at least five observations were required. For the n-1 stutter type, observations were subdivided into two feature groups (Table 1): MotifDiff and MotifType, where the latter has a binary outcome: stutters originating from the LUS or not (non-LUS; i.e., a different parental uninterrupted stretch). In Table 1, two examples are shown: For D12S39, n-1 stutters can arise from variation in repetitions of two different motifs, AGAT and AGAC. AGAT is part of the LUS and AGAC is part of the second longest parental uninterrupted stretch (non-LUS). For the D6S1043 locus a single ATCT motif was observed, but it is contained either in the LUS or in a different parental uninterrupted stretch.
      Table 1Examples of loci from multiple motif origin dataset.
      LocusStutter typeGroupSequence typeSequence (bracket format)
      MotifDiffMotifType
      D12S391n-1AGAT:−1LUSParental[AGAT]14 [AGAC]9
      Stutter[AGAT]13 [AGAC]9
      AGAC:−1non-LUSParental[AGAT]14 [AGAC]9
      Stutter[AGAT]14 [AGAC]8
      D6S1043n-1ATCT:−1LUSParental[ATCT]6 ATGT [ATCT]13
      Stutter[ATCT]6 ATGT [ATCT]12
      ATCT:−1non-LUSParental[ATCT]6 ATGT [ATCT]13
      Stutter[ATCT]5 ATGT [ATCT]13
      Beta regression models were fitted for each MotifType group, and the parental uninterrupted stretch was used as the explanatory variable PTUS. The linear predictor was defined as
      β0+β1PTUS
      (4)


      For situations where the variable parental uninterrupted stretch has a small range of values (lack of variability) and the initial model fails, we propose an intercept regression model, where the mean value of the stutter proportion is constant (this is equivalent to β1=0).
      In a separate analysis, dependency was investigated between pairs of n-1 stutter sequences that were originated from the same parental (Table 1), where one has PTUS equal the LUS and the other is equal the non-LUS. Residuals were extracted for all pairs (based on model described above) and the correlation was measured with the Pearson coefficient (95% confidence intervals were also calculated).

      2.6.4 Characterization of n0 stutters

      We observed sequences that simultaneously stutter in two different motifs from the same parental allele (see Supplementary Table S1 and Fig. S1). These are less common and were described in [
      • Li R.
      • Wu R.
      • Li H.
      • Zhang Y.
      • Peng D.
      • Wang N.
      • et al.
      Characterizing stutter variants in forensic STRs with massively parallel sequencing.
      ] as n0, since there is no net increase or decrease in sequence length. Two variants of this type of stutter were found:
      1) For marker D12S391 the parental sequence [AGAT]13 [AGAC]11 can originate [AGAT]14 [AGAC]10 stutter sequences. Both the LUS and the second longest uninterrupted stretch (non-LUS) are different from the parental sequence. This is termed as n+1 / n-1; a combination of a forward stutter from LUS and a backward stutter from non-LUS.
      2) In marker D3S1358 the parental and stutter sequences are TCTA [TCTG]3 [TCTA]13 and TCTA [TCTG]2 [TCTA]14, respectively. This type of combination is termed as n-1 / n+1, where backward and forward formations are observed together.
      The n0 stutters were fitted using a beta regression where the two PTUS variables, one for each of the corresponding motifs, were included as explanatory variables:
      β0+β1PTUS1+β2PTUS2
      (5)


      3. Results

      3.1 Quality check with FastQC

      The quality of samples was evaluated with FastQC and results were summarized with MultiQC. There was a trend in decreasing base quality towards the 3′ end of reads, as expected, in a library of different amplicon lengths (see Supplementary Table S6).

      3.2 Genotype identification

      A minimum threshold of 10 reads is expected to remove sequences containing errors produced during MPS library prep (PCR amplification), and sequencing. The application of this filter resulted in the removal of 748553 noise sequences and approximately 86% of these, corresponded to singletons (see Supplementary Table S7). In Section 2.5 a heuristic approach was described to classify true allele pairs and stutters by specifying an Hb threshold of 0.3. Fig. 2 represents all a2 sequences that are compatible with being a n-1 stutter of the first true allele a1. Two distinctive clusters of a2 sequences were observed based on the read counts and Hb. Cluster 1 (Hb ≤ 0.3) is assumed to correspond to n-1 stutters, whilst sequences in cluster 2, those with Hb > 0.5, are defined as the second true allele. Supplementary Fig. S2 shows the same data, but categorized by marker, confirming the distinct separation of the two clusters.
      Fig. 2
      Fig. 2Read count and Hb of sequences classified as a2 in n-1 stutter position. The red solid line indicates a Hb value of 0.3. The dashed blue line indicates the minimum threshold of 10 reads.
      After running the analysis to classify all sequences, a total of 10151 a1(highest read) and 8092 a2 (lowest read) alleles were counted. Finally, 7760 stutters were identified. Please see Supplementary Table S8 for a detailed summary with the number of observations for each combination of marker and stutter type.

      3.3 Stutter analysis

      3.3.1 Characterization of LUS stutters

      3.3.1.1 Model comparison

      The effect of different predictors of stutter proportion was assessed (see Supplementary Table S5) with a beta regression for 22 markers, where only the LUS of the parental sequence leads to a stutter, meaning that PTUS equals only to LUS. Table 2 shows the two best-fit models with their respective BIC values where a constant precision was considered. Following the explanation in 2.6.1, 2.6.2, models were built differently as follows:
      Table 2Two best-fit beta regression models with constant precision parameter. logLik is the natural logarithm of the likelihood value and nObs is the number of observations.
      Linear predictor (Model)Model TypelogLikParam (n)nObsBIC
      1)β0+β1PTUSper marker / stutter type15497.858891354813-29851
      2)β0+β1PTUS+β2ATcontentSeq1per marker / stutter type15530.109661464813-29822
      - Model 1: Stutter types and markers were evaluated separately, where PTUS was used as the explanatory variable.
      - Model 2: Same as in Model 1, but the AT content of the parental sequence ATcontentSeq1 was also included to explain the variation of stutter proportions.

      3.3.1.2 Extension of the precision model

      We performed and extended analysis of Model 1, where we investigated whether the precision is best fitted as a constant (small model) or whether it should be modelled as a separate regression with PTUS as an explanatory variable (large model). As observed from the BIC scores in Table 3, the large model did not perform better compared to the small model. This may be explained because BIC gives a larger penalty per parameter when the dataset is large, and therefore favors simpler models.
      Table 3Comparison of beta regression with constant or variable precision. logLik is the natural logarithm of the likelihood value and nObs is the number of observations.
      Beta regression model typelogLiknParamnObsBIC
      Constant precision (small model)15497.861354813-29851
      Variable precision (large model)15566.321804813-29606.4
      A second evaluation of data for the small and large models was performed with probability-probability (P-P) plots (Supplementary Fig. S3). In both models, the points were following very close to the identity line, indicating adequate models.

      3.3.1.3 Stutter proportion sizes

      Overall, largest proportions were observed with n-1 stutters compared to other types (Fig. 3A; Supplementary Fig. S4 and S5), although there are some exceptions e.g., D13S317 (0.01–0.05; Fig. 3B). In Fig. 3C, n-2 stutters are described for D8S1179 and in Fig. 3D n+1 stutters are represented for marker D9S1122. Patterns are similar for all markers, with proportion trends maintained in an interval of approximate 0.01–0.05 (Supplementary Fig. S4 and S5).
      Fig. 3
      Fig. 3Proportions for different stutter types explained by the length of the parental uninterrupted stretch (PTUS). n-1 stutters are plot for markers (A) D1S1656 and (B) D13S317, in (C) n-2 sequences for D8S1179 and in (D) n+1 stutters of marker D9S1122. Expected mean responses (fitted curves) are represented by solid orange lines.
      Expected means (μ) were plotted, and a consistent pattern was observed across a total of 22 loci, where stutter proportions increased with the length of the parental uninterrupted stretch (PTUS).
      Supplementary Fig. S6 illustrates 95% confidence intervals of the estimates of PTUS slopes in n-1 (A), n-2 (B) and n+1 (C) stutter proportion regressions. When intervals included 0 or negative estimates, a new model was built based on the intercept value only, i.e., the expected stutter proportion is constant for all possible lengths of PTUS. See also Supplementary Table S8 for details about the estimated coefficients.

      3.3.2 Characterization of LUS and non-LUS stutters

      Markers D12S391, D21S11, D2S1338, D6S1043 and TH01 (x.3 microvariants) were analysed separately (estimated coefficients provided in Supplementary Table S8). Backward (n-1) stutters generated from these loci follow different patterns of formation that correspond to respective parental LUS or non-LUS regions.
      A simple linear predictor β0+β1PTUS was applied to obtain the coefficient regression estimates for the two stutter types (LUS vs non-LUS). Fig. 4 shows that for markers D2S1338, D12S391, D21S11 and D6S1043, stutter proportion varies with the length of the parental uninterrupted stretch (PTUS). The fitted curves showed different trends between groups of stutters from LUS and non-LUS within the same locus. Non-LUS stuttered less than LUS.
      Fig. 4
      Fig. 4Proportion of n-1 stutter sequences explained by the length of the parental uninterrupted stretch in (A) D2S1338, (B) D12S391, (C) D21S11 and (D) D6S1043. Stutter sequences were grouped by type of parental uninterrupted stretch (LUS or non-LUS). Fitted curves were also plot for showing the trend of the stutter proportions.
      For the TH01 marker, only the 9.3 microvariant lead to a non-LUS stutter product, and a constant model was applied for modeling these observations (See n-1, n+1 and n-2 stutter analysis in Supplementary Fig. S7).
      As described in Section 2.6.3, we also assessed the dependency between pairs of n-1 stutter sequences, where one is LUS, and other is non-LUS. The correlations were calculated as: 0.066, 0.2, 0.034 and 0.32 (Supplementary Fig. S8), respectively for markers D12S391, D21S11, D2S1338 and D6S1043. None of the markers showed significant dependency (i.e., the 95% CI included zero).

      3.3.3 Characterization of n0 stutters

      Coefficient estimates were calculated for n0 stutters observed in loci D12S391, D2S1338, D3S1358, D19S433, D21S11 and D2S441 (Supplementary Table S8). As explained in Section 2.6.4, these sequences originate from simultaneous variation in length of two parental uninterrupted stretches (PTUS1 and PTUS2). Fig. 5. shows the trend for n-1 / n+1 combinations in D3S1358, where the n-1 stutter occurs in the 5′ repeat region and the n+1 occurs in the 3′ repeat region. The expected stutter proportion increases with the lengths of PTUS1 and PTUS2. Similar findings were observed for marker D12S391, but for the n+1 / n-1 combination (Supplementary Fig. S9).
      Fig. 5
      Fig. 5Proportion of the n0 stutter combination n-1 / n+1 in marker D3S1358. Values of first and second parental uninterrupted stretches have been combined into a new PTUS factor variable to facilitate a 2D plot of the observations. Points correspond with observed sequences and short red lines with the expected stutter proportion for each category of the new factor variable.

      4. Discussion

      4.1 Setting the analytical thresholds

      A pivotal part of the analysis of MPS sequences and the effective resolution of DNA mixtures [
      • Young B.
      • King J.L.
      • Budowle B.
      • Armogida L.
      A technique for setting analytical thresholds in massively parallel sequencing-based forensic DNA analysis.
      ,
      • Zeng X.
      • King J.L.
      • Budowle B.
      Investigation of the STR loci noise distributions of PowerSeqTM auto system.
      ], relates to the establishment of optimal analytical thresholds. Setting thresholds, allows categorisation of noise, allele and stutter sequences. There are multiple strategies to assign different sequences but they are all heuristic and there is no preferred approach. Some authors have calculated thresholds based on rules that balance the cost of encountering more or less drop-in / drop-out events [
      • Young B.
      • King J.L.
      • Budowle B.
      • Armogida L.
      A technique for setting analytical thresholds in massively parallel sequencing-based forensic DNA analysis.
      ,
      • Riman S.
      • Iyer H.
      • Borsuk L.A.
      • Vallone P.M.
      Understanding the characteristics of sequence-based single-source DNA profiles.
      ]. Others set a minimal threshold and designate “true alleles” based on a heterozygote balance of 0.4 [
      • Vilsen S.B.
      • Tvedebrink T.
      • Eriksen P.S.
      • Bøsting C.
      • Hussing C.
      • Mogensen H.S.
      • et al.
      Stutter analysis of complex STR MPS data.
      ,
      • Li R.
      • Wu R.
      • Li H.
      • Zhang Y.
      • Peng D.
      • Wang N.
      • et al.
      Characterizing stutter variants in forensic STRs with massively parallel sequencing.
      ]. FDSTools [
      • Hoogenboom J.
      • van der Gaag K.J.
      • de Leeuw R.H.
      • Sijen T.
      • de Knijff P.
      • Laros J.F.J.
      FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.
      ], for reference samples, sets a noise threshold of 15% of the most frequent allele for a given locus, where the expectation is that the one or two of the most abundant sequences are allelic in origin; sequences are defined as stutters when read counts are less than 30% of the allele with largest read counts. MPSproto (Section 4.8) is a continuous probabilistic genotyping model, that differs from previous approaches because it does not depend upon heuristics other than a minimum analytical threshold of 10 reads to remove noise sequences.
      The MPS data produced for this paper was based on five runs in total (Supplementary Table S6B), where four of them contained more than 90 samples, but only one had 32. With this experimental setup, there is an expectation of a lower number of reads per sample, compared to runs with fewer samples. This may have an impact by lowering the number of observed stutters, as these can be under the 10 reads threshold. Consequently, it would be more likely to observe uncommon stutters (e.g., n-2 and n+1) with runs containing fewer samples.
      In this study, we did not run any additional CE analysis to confirm genotypes. This would be beneficial for the analysis of e.g., locus D22S1045, where high imbalance has been detected, and concordance studies could avoid inclusion of ‘false stutters’.

      4.2 Stutter ratio vs stutter proportion

      There is a one-to-one transformation between stutter ratio and stutter proportions, it is solely a matter of preference which to use. Taking the logarithm of stutter ratios enables use of ordinary linear regression models [
      • Bright J.A.
      • Taylor D.
      • Curran J.M.
      • Buckleton J.S.
      Developing allelic and stutter peak height models for a continuous method of DNA interpretation.
      ,
      • Bright J.A.
      • Curran J.M.
      Investigation into stutter ratio variability between different laboratories.
      ,
      • Kelly H.
      • Bright J.A.
      • Buckleton J.S.
      • Curran J.M.
      Identifying and modelling the drivers of stutter in forensic DNA profiles.
      ]. Stutter proportions were chosen to be consistent with EuroForMix. The coefficients estimated from beta regressions were plugged directly into MPSproto to utilize the same stutter model definition as in EuroForMix. A model comparison of stutter ratio versus proportions was not within the scope of this paper.

      4.3 Model comparison for LUS stutters

      In this part of the study, stutter sequences originated from a PTUS that was the same as the LUS. Multiple variables were evaluated to model stutter proportions. Finally, the model with the lowest BIC score was selected, and only PTUS was adopted as explanatory variable. The second best-fit corresponded to a model that included the AT-content as a second explanatory variable. Brookes et al. [
      • Brookes C.
      • Bright J.A.
      • Harbison S.
      • Buckleton J.
      Characterising stutter in forensic STR multiplexes.
      ] described a possible impact of AT-content in stutter formation but we did not further evaluate this model here.

      4.4 Characterization of LUS stutters

      Proportions for stutter types n-1, n-2, n+1 and n+2 were evaluated. In this dataset, for most of the markers, n-1 stutter proportions increased with the length of the parental uninterrupted stretch (PTUS) (Fig. 3 and Supplementary Fig. S4). This aligns with the regression proposed by Vilsen et al. [
      • Vilsen S.B.
      • Tvedebrink T.
      • Eriksen P.S.
      • Bøsting C.
      • Hussing C.
      • Mogensen H.S.
      • et al.
      Stutter analysis of complex STR MPS data.
      ], using the block length of the missing motif (BLMM). For this particular analysis, all of the PTUS corresponds to the LUS, consequently, results are also consistent with previous studies showing an association between the amount of stutter and the LUS [
      • Walsh P.S.
      • Fildes N.J.
      • Reynolds R.
      Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus VWA.
      ,
      • Brookes C.
      • Bright J.A.
      • Harbison S.
      • Buckleton J.
      Characterising stutter in forensic STR multiplexes.
      ,
      • Bright J.A.
      • Taylor D.
      • Curran J.M.
      • Buckleton J.S.
      Developing allelic and stutter peak height models for a continuous method of DNA interpretation.
      ,
      • Bright J.A.
      • Curran J.M.
      Investigation into stutter ratio variability between different laboratories.
      ,
      • Kelly H.
      • Bright J.A.
      • Buckleton J.S.
      • Curran J.M.
      Identifying and modelling the drivers of stutter in forensic DNA profiles.
      ].
      For the remaining variants, a clear result was not obtained (Fig. 3 and Supplementary, Fig. S5), and for many examples, an intercept model (i.e., a model where the beta regression slope coefficient = 0) was used to describe mean stutter proportion. Such models return a constant value irrespective of the PTUS length (Supplementary Table S8).
      In general, there were too few observations to characterize n+2 stutters. Although we obtained 13 observations for marker D22S1045 which returned an estimated mean stutter proportion of ≈ 0.15. We suspect that this high estimate is caused by interference from other stutter types or high heterozygous balance.
      D2S441 n-1 stutter proportions were poorly described by PTUS, and an intercept (constant) model was used to fit the data. Bright et al. have reported a similar behavior [
      • Bright J.A.
      • Taylor D.
      • Curran J.M.
      • Buckleton J.S.
      Developing allelic and stutter peak height models for a continuous method of DNA interpretation.
      ]. For this stutter type, proportions could be modelled as two separate regressions: (1) PTUS with lengths of 7–9 repeats and (2) PTUS with lengths of 10–13 repeats of the proportions (Supplementary Fig. S4 plot A). This is consistent with the study in [
      • Westen A.A.
      • Grol L.J.W.
      • Harteveld J.
      • Matai A.S.
      • de Knijff P.
      • Sijen T.
      Assessment of the stochastic threshold, back- and forward stutter filters and low template techniques for NGM.
      ]. The disruption of the continuous trend could be explained by several factors. Allele microvariants 11.3 and 12.3 ([TCTA]4 TCA [TCTA]7 and [TCTA]4 TCA [TCTA]8), which are categorized in (1), have been observed. In both microvariants, the LUS (also the PTUS) is at the 3′ end of the repeat region, whilst for all other alleles in the population, LUS is at the 5′ end. Also, in parental sequences from stutters in (1), the motif TCTG was observed (e.g., parental allele [TCTA]8 TCTG TCTA). In addition, the repeat motif TTTA was detected in both categories. Phillips et al. [
      • Phillips C.
      • Fernandez-Formoso L.
      • Garcia-Magariños M.
      • Porras L.
      • Tvedebrink T.
      • Amigo J.
      • et al.
      Analysis of global variability in 15 established and 5 new European Standard Set (ESS) STRs using the CEPH human genome diversity panel.
      ] described the presence of the SNP A >G (TCT[A/G]) in the repeat region and the TTTA motif. These differences in the sequence structures, may explain the double pattern in the stutter proportions. A more complex model would take account of both sets of repeats.

      4.5 Characterization of LUS and non-LUS stutters

      For some markers, it was observed that multiple motifs of the parental repeat region can lead to correspondingly different n-1 stutter sequences where PTUS is either LUS or non-LUS. In D12S391, motif AGAT was found in a PTUS that coincides with LUS, whereas the motif AGAC formed a PTUS that belonged to the non-LUS category. In Fig. 4B trends from separate beta regression models for each group of stutters (LUS or non-LUS) were plotted. The longer the PTUS, the larger the stutter proportion. Results were similar to Just et al. [
      • Just R.S.
      • Irwin J.A.
      Use of the LUS in sequence allele designations to facilitate probabilistic genotyping of NGS-based STR typing results.
      ] where LUS was included as a predictor of the stutter rates. Woerner et al. [
      • Woerner A.E.
      • King J.L.
      • Budowle B.
      Compound stutter in D2S1338 and D12S391.
      ] investigated stutter rates in D12S391 and D2S1338. They proposed to simultaneously model the stutter formation from both parental uninterrupted stretches (LUS and non-LUS) with a multivariate regression to account for dependency. Although, we did not find any significant dependency in our study, further modelling of these stutters could be investigated in a future.
      All loci were identified as compound, except for TH01, which is a simple STR. Inclusion of this marker was because of allele 9.3, since the same motif sequence [AATG] is found in LUS and non-LUS parental stretches. Only 13 stutters were observed from the non-LUS parental regions, and an intercept model was built. Stutter for the remaining alleles of TH01, were modelled as described for LUS stutters.
      In general, stutter proportions for non-LUS parental uninterrupted stretches are distinctly smaller compared to LUS stutters. For D6S1043 locus, in Fig. 4D, stutters from non-LUS regions are minimal (close to 0) and they are only observed when the length (number of repeats) of the PTUS is equal to five or six. DNA slippage seems to occur more frequently in LUS regions; we observed that larger stutter proportions (between 0.04 and 0.13) are associated with larger PTUS values in a wider interval of 9–15 repeats. For both LUS and non-LUS groups, the same sequence motif ATCT was observed. The motif is at the 5′ end of the repeat region in the LUS group and at the 3′ end, in the non-LUS group. This pattern was also observed in locus D21S11, where sequence motifs TCTG (non-LUS) and TCTA (LUS) are 5′ and 3‘, respectively.

      4.6 Characterization of n0 stutters

      A total of 180 observations corresponds to stutter sequences that cannot be classified into any of the well-described stutters: n-1, n-2, n+1 or n+2. The main characteristic of these sequences is that two PTUS simultaneously stutter (typically n-1 and n + 1), so that there is no net change in length. These events were termed n0 [
      • Li R.
      • Wu R.
      • Li H.
      • Zhang Y.
      • Peng D.
      • Wang N.
      • et al.
      Characterizing stutter variants in forensic STRs with massively parallel sequencing.
      ]. Young et al. [
      • Young B.
      • King J.L.
      • Budowle B.
      • Armogida L.
      A technique for setting analytical thresholds in massively parallel sequencing-based forensic DNA analysis.
      ] briefly mentioned that stutters may also be the same length as the parental allele. In a recent publication, Woerner et al. [
      • Woerner A.E.
      • Mandape S.
      • King J.L.
      • Muenzler M.
      • Crysup B.
      • Budowle B.
      Reducing noise and stutter in short tandem repeat loci with unique molecular identifiers.
      ], considered n0 stutters as an example of “reciprocal slippage events”. It is unclear what process leads to n0 stutters, but [
      • Li R.
      • Wu R.
      • Li H.
      • Zhang Y.
      • Peng D.
      • Wang N.
      • et al.
      Characterizing stutter variants in forensic STRs with massively parallel sequencing.
      ] hypothesized direct formation of the sequence from the parental allele in a single step. In our study, the stutter proportion sizes of n0 sequences increased with lengths of PTUS1 and PTUS2 (see Fig. 5).

      4.7 Future research. Improvement of present model

      The results obtained from this study provides scope for future research. For many years, AT-content has been considered as a possible factor contributing to enhanced stutter formation. We observed that the second-best model included AT-content as a second predictor variable. Improved beta regression models could include AT-content of the parental allele sequence. In addition, it has been shown that flanking regions have an impact on stutter rates [
      • Woerner A.E.
      • King J.L.
      • Budowle B.
      Compound stutter in D2S1338 and D12S391.
      ,
      • Woerner A.
      • King J.
      • Budowle B.
      Flanking variation influences rates of stutter in simple repeats.
      ], and this merits further investigation.

      4.8 MPSProto

      Bleka et al. [
      • Bleka Ø.
      • Just R.
      • Le J.
      • Gill P.
      An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
      ] and chapter 13 of Gill et al. [

      Gill P., Bleka Ø, Benschop C., Haned H. Forensic Practitioner’s Guide to the Interpretation of Complex DNA Profiles [Internet]. First. Elsevier; 2020 [cited 2021 Nov 17]. Available from: 〈https://linkinghub.elsevier.com/retrieve/pii/C20190012332〉.

      ] described an adaptation of EuroForMix to analyse MPS-STRs using LUS nomenclature. In this earlier work, only n-1 stutter was modelled without stutter filters. Furthermore, the stutter parameter was applied universally. As indicated in [
      • Bleka Ø.
      • Just R.
      • Le J.
      • Gill P.
      An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
      ], the application of stutter filters is generally unsatisfactory because there is a loss of information that in turn leads to reduced LRs for minor contributors (Hp=true).

      4.8.1 Extension of the EuroForMix model (MPSproto)

      We extended the EuroForMix model as a new tool called MPSproto which is available from https://github.com/oyvble/MPSproto; an outline of the theory that is used is also provided for the interested reader. This will be expanded into a separate publication, along with a validation study. The main focus of this study is to characterize stutters, since when present, they can compromise mixture interpretation. With MPS-STR analyses, full sequences are obtained, thus it is possible to investigate the great diversity and complexity of stutters that are encountered. The original EuroForMix adaptation took account of MPS-STRs [
      • Bleka Ø.
      • Just R.
      • Le J.
      • Gill P.
      An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
      ] and utilized LUS sequences based upon the standard approach used for CE-STR analysis. MPSproto has been extended specifically to accommodate the special challenges presented by analysis of MPS-STR mixtures and there are important differences that have been introduced. In particular, instead of carrying out analyses based upon generalized per sample parameters, these are now marker-specific. Consequently, MPSproto requires information from a calibration exercise that uses control data samples from single contributors. There are three new parts to the model:
      • 1.
        Stutter model,
      • 2.
        Noise model and
      • 3.
        Marker Efficiency.
      1) The stutter model:
      The focus of this paper is to characterize multiple kinds of stutters with beta regressions outlined in Section 2.6. In EuroForMix, stutters were modelled with a single expected stutter proportion parameter which was estimated based on data across all markers [
      • Bleka Ø.
      • Storvik G.
      • Gill P.
      EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
      ,
      • Gill P.
      • Bleka Ø
      • Hansson O.
      • Benschop C.
      • Haned H.
      Chapter 7 - The quantitative (continuous) model theory.
      ]. The difference with MPSproto is that the expected stutter proportion for multiple stutter types (n+1, n-1, n+2, n-2, n0, etc) are determined by their respective per marker beta regression models illustrated in 2.6, 3.3: MPSproto takes the regression coefficients, provided per stutter type and per marker, as input (see Eq. 5 and Supplementary Table S8).
      As discussed in Section 4.4, caution is needed with the n+2 stutter model for D22S1045 (a higher expected stutter proportion was estimated). The effect for the MPSproto interpretation on the mixture is that sequences in the n+2 position of a major contributor would be more likely to be a stutter than from an allele.
      2) The noise model:
      With MPS there are numerous low-read sequences that do not originate from stutters or known contributors. Many are unclassified sequence artefacts. Instead of using the conventional drop-in model of EuroForMix, this has now been replaced by a "noise model" consisting of two parts:
      a) The number of noise sequences follows a geometric distribution.
      b) The read count of the noise sequence follows a discretized pareto distribution.
      These two models were calibrated based on a “noise dataset”, i.e., alleles that could not be attributed to either known contributors or stutters.
      It is important to note that the calibration noise model should reflect the experimental regime in use. For example, a model based upon c. 90 pooled samples may have lower noise size per sample compared to a model that is based upon 30 samples. In future, we aim to develop a noise model that is able to incorporate this effect.
      3) Marker efficiency:
      As provided by [
      • Cheng K.
      • Lin M.
      • Moreno L.
      • Skillman J.
      • Hickey S.
      • Cuenca D.
      • et al.
      Modeling allelic analyte signals for aSTRs in NGS DNA profiles.
      ], we included locus-specific marker efficiency parameters. This was achieved by scaling the shape argument of the gamma distribution with a marker specific parameter. These were universally decided based on maximum likelihood estimation where the sum of the reads per-marker was used as input (details provided through the MPSproto github link).

      4.8.2 An illustration of mixture interpretation using MPSproto and EuroForMix

      The sample considered was a constructed major-minor mixture profile (Ref1/Ref2) where Ref2 is minor (methodology will be described elsewhere). Calibration of stutter model is important when minor contributors are in range of stutters of a major (Mx=5–10%). Without such a model, minor contributor alleles would be filtered and represent lost information. An example is provided to show the benefits.
      The alternative propositions considered are:
      • Hp: “Ref1 + Ref2 are contributors”
      • Hd: “Ref1 + 1 (unrelated) unknown are contributors”
      An Fst= 0.01 was used as "theta” correction for all calculations.
      For MPSproto we used an analytical threshold of T = 10 reads. Only one allele dropout for Ref2 genotype was observed (Penta E). MPSproto estimated Ref2 as Mx= 5.5% mixture proportion and log10LR= 17.5 was calculated. The model validation PP-plots showed a good fit under both propositions (see Supplementary Fig. S10).
      We also applied EuroForMix to the same sample using the recommended approach from [
      • Bleka Ø.
      • Just R.
      • Le J.
      • Gill P.
      An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
      ]. All sequences were converted into LUS+ nomenclature (including the frequency database) and the sequences in the mixtures were filtered with a static threshold of T = 30 (no stutter filter was applied). We modelled both backward and forward stutter but not degradation. The default drop-in model (prob=0.05 and lambda=0.01) was applied. An LR of log10LR= 10.0 was calculated.

      4.8.3 Comparison of the two models

      We note that the PHvar parameter (coefficient-of-variation) is considerably smaller for MPSproto (0.20) than for EuroForMix (0.47), meaning that calibration of the former model captures much more of the variation.
      The LR-per marker for the two models is illustrated side by side in Fig. 6. For MPSproto all markers showed either inclusionary or neutral support for Ref2, except for marker D9S1122 (log10LR=−0.5). For this marker, EuroForMix provided inclusionary support. The reason for this large difference is the lower T = 10 value for MPSproto compared to T = 30 for EuroForMix. Allele [TAGA]12 (with 27 reads) is interpreted by MPSproto as noise under Hp and as a minor unknown contribution under Hd, whereas for EuroForMix it is removed (since it is below T = 30). There are several markers where MPSproto provided considerably better results compared to EuroForMix: D12S391, D1S1656, FGA, Penta E and VWA, whereas EuroForMix was slightly better for D21S11. Details are as follows:
      • D12S391: An allele dropout for Ref2 using T = 30. There are several stutter artefacts that cannot be accounted for by EuroForMix, but they are for MPSproto.
      • D1S1656: An allele dropout for Ref2 using T = 30.
      • FGA: An additional artefact that is not accounted for using EuroForMix, whereas it is for MPSproto.
      • Penta E: Both alleles of Ref2 drop out using T = 30, but only one for T = 10.
      • vWA: A homozygous allele dropout for Ref2 using T = 30.
      • D21S11: A n+1 stutter artefact (12 reads) was not captured by the MPSproto stutter model (too few observations in the calibration dataset to be part of the model). The LR increased by log10LR= 0.3 when this stutter was accounted for in MPSproto.
      Fig. 6
      Fig. 6LR-per marker comparison between MPSproto and EuroForMix.

      5. Conclusion

      Stutters from MPS-STRs contain more information than their CE counterparts. Unless they are taken into account, their presence places severe restrictions on interpretation of mixtures, particularly when a minor contributor is the person of interest. The alternative strategy that is usually employed is heuristic; stutters and noise are removed using filters; the analytical threshold will be set to a high level. Unfortunately, this will also remove 'true' alleles from minor contributors, so that the LR will often be underestimated if Hp is true. We have shown that taking account of stutter on a per marker basis increases the amount of information that is used by the continuous MPSproto model. This, in turn, eliminates the requirement for stutter filters and enables the analytical threshold to be considerably reduced. The advantage is that this increases the sensitivity of tests when low level (minor) contributors are analyzed, since 'true' alleles are no longer discarded. However, more noise artefacts are present which must be accommodated, especially if few samples are run in batches, and this increases the complexity of the model. Consequently, it is not possible, at present, to reduce the analytical threshold to zero; the current choice of selecting the value at 10 reads is a compromise solution which is considerably lower than that applied in current practice.

      CRediT authorship contribution statement

      Maria Martin Agudo: Conceptualization, Methodology, Writing – review & editing, Software, Formal analysis, Data curation, Investigation. Håvard Aanes: Conceptualization, Methodology, Writing – review & editing, Supervision. Arne Roseth: Investigation. Michel Albert: Investigation. Peter Gill: Conceptualization, Methodology, Writing – review & editing, Supervision. Øyvind Bleka: Conceptualization, Methodology, Writing – review & editing, Software, Formal analysis, Data curation, Investigation, Supervision.

      Acknowledgements

      This work is part of the PhD of Maria Martin Agudo and The project is funded by the Department of Forensic Sciences, Oslo University Hospital, Norway. We would like to thank Dr. Berit Myhre Dupuy for donating the samples that were used in this study. Also, we would like to thank Dr. Eirik Natås Hanssen for helping us in obtaining the samples and for all the support received throughout the development of the study.

      Appendix A. Supplementary material

      References

        • Bruijns B.
        • Tiggelaar R.
        • Gardeniers H.
        Massively parallel sequencing techniques for forensics: a review.
        Electrophoresis. 2018; 39: 2642-2654
        • Barrio P.A.
        • García Ó.
        • Phillips C.
        • Prieto L.
        • Gusmão L.
        • Fernández C.
        • et al.
        The first GHEP-ISFG collaborative exercise on forensic applications of massively parallel sequencing.
        Forensic Sci. Int. Genet. 2020; 49102391
        • Alonso A.
        • Barrio P.A.
        • Müller P.
        • Köcher S.
        • Berger B.
        • Martin P.
        • et al.
        Current state-of-art of STR sequencing in forensic genetics.
        Electrophoresis. 2018; 39: 2655-2668
        • Alonso A.
        • Müller P.
        • Roewer L.
        • Willuweit S.
        • Budowle B.
        • Parson W.
        European survey on forensic applications of massively parallel sequencing.
        Forensic Sci. Int. Genet. 2017; 29: e23-e25
        • Hoogenboom J.
        • van der Gaag K.J.
        • de Leeuw R.H.
        • Sijen T.
        • de Knijff P.
        • Laros J.F.J.
        FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.
        Forensic Sci. Int. Genet. 2017; 27: 27-40
        • Hussing C.
        • Huber C.
        • Bytyci R.
        • Mogensen H.S.
        • Morling N.
        • Børsting C.
        Sequencing of 231 forensic genetic markers using the MiSeq FGxTM forensic genomics system – an evaluation of the assay and software.
        Forensic Sci. Res. 2018; 3: 111-123
        • Vilsen S.B.
        • Tvedebrink T.
        • Eriksen P.S.
        • Bøsting C.
        • Hussing C.
        • Mogensen H.S.
        • et al.
        Stutter analysis of complex STR MPS data.
        Forensic Sci. Int. Genet. 2018; 35: 107-112
        • de Knijff P.
        From next generation sequencing to now generation sequencing in forensics.
        Forensic Sci. Int. Genet. 2019; 38: 175-180
        • Liu Y.Y.
        • Harbison S.
        A review of bioinformatic methods for forensic DNA analyses.
        Forensic Sci. Int. Genet. 2018; 33: 117-128
        • Cao M.D.
        • Balasubramanian S.
        • Boden M.
        Sequencing technologies and tools for short tandem repeat variation detection.
        Brief. Bioinform. 2015; 16: 193-204
        • Riman S.
        • Iyer H.
        • Borsuk L.A.
        • Vallone P.M.
        Understanding the behavior of stutter through the sequencing of STR alleles.
        Forensic Sci. Int Genet. Suppl. Ser. 2019; 7: 115-116
        • Young B.
        • King J.L.
        • Budowle B.
        • Armogida L.
        A technique for setting analytical thresholds in massively parallel sequencing-based forensic DNA analysis.
        in: Kalendar R. PLOS One. 12. 2017
        • Walsh P.S.
        • Fildes N.J.
        • Reynolds R.
        Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus VWA.
        Nucleic Acids Res. 1996; 24: 2807-2812
        • Hauge X.Y.
        • Litt M.
        A study of the origin of ‘shadow bands’ seen when typing dinucleotide repeat polymorphisms by the PCR.
        Hum. Mol. Genet. 1993; 2: 411-415
        • Meldgaard M.
        • Morling N.
        Detection and quantitative characterization of artificial extra peaks following polymerase chain reaction amplification of 14 short tandem repeat systems used in forensic investigations.
        Electrophoresis. 1997; 18: 1928-1935
        • Levinson G.
        • Gutman G.A.
        Slipped-strand mispairing: a major mechanism for DNA sequence evolution.
        Mol. Biol. Evol. 1987; 4: 203-221
        • Klintschar M.
        • Wiegand P.
        Polymerase slippage in relation to the uniformity of tetrameric repeat stretches.
        Forensic Sci. Int. 2003; 135: 163-166
        • Schlötterer C.
        • Tautz D.
        Slippage synthesis of simple sequence DNA.
        Nucleic Acids Res. 1992; 20: 211-215
      1. Gill P., Bleka Ø, Hansson O., Benschop C., Haned H. Chapter 2 - Empirical characterization of DNA profiles. In: Forensic Practitioner’s Guide to the Interpretation of Complex DNA Profiles [Internet]. Academic Press; 2020. p. 55–88. Available from: 〈https://doi.org/10.1016/B978–0-12–820562-4.00010–9〉.

      2. Butler JM. Forensic DNA typing: biology, technology, and genetics of STR markers. 2nd ed. Amsterdam; Boston: Elsevier Academic Press; 2005. 660 p.

        • Gill P.
        • Brenner C.H.
        • Buckleton J.S.
        • Carracedo A.
        • Krawczak M.
        • Mayr W.R.
        • et al.
        DNA commission of the international society of forensic genetics: recommendations on the interpretation of mixtures.
        Forensic Sci. Int. 2006; 160: 90-101
        • Gibb A.J.
        • Huell A.L.
        • Simmons M.C.
        • Brown R.M.
        Characterisation of forward stutter in the AmpFlSTR® SGM Plus® PCR.
        Sci. Justice. 2009; 49: 24-31
        • Brookes C.
        • Bright J.A.
        • Harbison S.
        • Buckleton J.
        Characterising stutter in forensic STR multiplexes.
        Forensic Sci. Int. Genet. 2012; 6: 58-63
        • Westen A.A.
        • Grol L.J.W.
        • Harteveld J.
        • Matai A.S.
        • de Knijff P.
        • Sijen T.
        Assessment of the stochastic threshold, back- and forward stutter filters and low template techniques for NGM.
        Forensic Sci. Int. Genet. 2012; 6: 708-715
        • Bright J.A.
        • Taylor D.
        • Curran J.M.
        • Buckleton J.S.
        Developing allelic and stutter peak height models for a continuous method of DNA interpretation.
        Forensic Sci. Int. Genet. 2013; 7: 296-304
        • Bright J.A.
        • Curran J.M.
        Investigation into stutter ratio variability between different laboratories.
        Forensic Sci. Int. Genet. 2014; 13: 79-81
        • Kelly H.
        • Bright J.A.
        • Buckleton J.S.
        • Curran J.M.
        Identifying and modelling the drivers of stutter in forensic DNA profiles.
        Aust. J. Forensic Sci. 2014; 46: 194-203
        • Woerner A.E.
        • King J.L.
        • Budowle B.
        Compound stutter in D2S1338 and D12S391.
        Forensic Sci. Int. Genet. 2019; 39: 50-56
        • Li R.
        • Wu R.
        • Li H.
        • Zhang Y.
        • Peng D.
        • Wang N.
        • et al.
        Characterizing stutter variants in forensic STRs with massively parallel sequencing.
        Forensic Sci. Int Genet. 2020; 45102225
        • Dupuy B.M.
        • Stenersen M.
        • Lu T.T.
        • Olaisen B.
        Geographical heterogeneity of Y-chromosomal lineages in Norway.
        Forensic Sci. Int. 2006; 164: 10-19
      3. Verogen. ForenSeq DNA Signature Prep Reference Guide. 2018;42.

      4. Thermo Fisher Scientific I. Qubit 4 Fluorometer User Guide. 2018;72.

      5. Illumina. MiSeq FGx Instrument Reference Guide. 2015;84.

      6. Andrews,Simon. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data [Internet]. [cited 2021 Aug 2]. Available from: 〈https://www.bioinformatics.babraham.ac.uk/projects/fastqc/〉.

        • Ewels P.
        • Magnusson M.
        • Lundin S.
        • Käller M.
        MultiQC: summarize analysis results for multiple tools and samples in a single report.
        Bioinformatics. 2016; 32: 3047-3048
        • Woerner A.E.
        • King J.L.
        • Budowle B.
        Fast STR allele identification with STRait Razor 3.0.
        Forensic Sci. Int Genet. 2017; Sep; 30: 18-23
      7. Mitchell, R., Standage, D. lusSTR [Internet]. bioforensics; 2021 [cited 2021 Aug 2]. Available from: 〈https://github.com/bioforensics/lusSTR〉.

      8. R Core Team (2021). R: A language and environment for statistical computing. [Internet]. Vienna, Austria: R Foundation for Statistical Computing; Available from: URL 〈https://www.R-project.org/〉.

        • Sharma V.
        • Chow H.Y.
        • Siegel D.
        • Wurmbach E.
        Qualitative and quantitative assessment of Illumina’s forensic STR and SNP kits on MiSeq FGxTM..
        in: Kalendar R. PLOS One. 12. 2017
      9. Fox EJ, Reid-Bayliss KS. Accuracy of Next Generation Sequencing Platforms. J Gener Seq Appl [Internet]. 2014 [cited 2021 Jun 4];01(01). Available from: 〈https://www.omicsonline.org/open-access/accuracy-of-next-generation-sequencing-platforms-jngsa.1000106.php?aid=28132〉.

        • Clayton T.M.
        • Whitaker J.P.
        • Sparkes R.
        • Gill P.
        Analysis and interpretation of mixed forensic stains using DNA STR profiling.
        Forensic Sci. Int. 1998; 91: 55-70
        • Ferrari S.
        • Cribari-Neto F.
        Beta regression for modelling rates and proportions.
        J. Appl. Stat. 2004; 31: 799-815
        • Kieschnick R.
        • McCullough B.D.
        Regression analysis of variates observed on (0, 1): percentages, proportions and fractions.
        Stat. Model. 2003; 3: 193-213
        • Bayer F.M.
        • Cribari-Neto F.
        Model selection criteria in beta regression with varying dispersion.
        Commun. Stat. Simul. Comput. 2017; 46 (729:46)
      10. Thomopoulos NT. Statistical Distributions [Internet]. Cham: Springer International Publishing; 2017 [cited 2021 Aug 19]. Available from: 〈http://link.springer.com/10.1007/978–3-319–65112-5〉.

      11. Seefeld K., Ed M., Linder E. Statistics Using R with Biological Examples [Internet]. Durham, NH: University of New Hampshire, Department of Mathematics & Statistics; 2007. Available from: 〈https://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf〉.

        • Cribari-Neto F.
        • Zeileis A.
        Beta regression in R.
        J. Stat. Softw. 2021; 34 ([Internet]. 2010 [cited 2021 Apr 24];34(2))
        • Butler J.
        • Hill C.
        Biology and genetics of new autosomal STR loci useful for forensic DNA analysis.
        Forensic Sci. Rev. 2012; 24: 15-26
        • Gelman A.
        • Carlin B.J.
        • Stern S.H.
        • Dunson B.D.
        • Vehtari A.
        • Rubin B.D.
        Bayesian Data Analysis. third ed. CRC Press, Boca Raton2014: 639
      12. Shmueli G. To Explain or to Predict? Stat Sci [Internet]. 2010 Aug 1 [cited 2022 Jan 20];25(3). Available from: 〈https://projecteuclid.org/journals/statistical-science/volume-25/issue-3/To-Explain-or-to-Predict/10.1214/10-STS330.full〉.

        • Zeng X.
        • King J.L.
        • Budowle B.
        Investigation of the STR loci noise distributions of PowerSeqTM auto system.
        Croat. Med J. 2017; 58: 214-221
        • Riman S.
        • Iyer H.
        • Borsuk L.A.
        • Vallone P.M.
        Understanding the characteristics of sequence-based single-source DNA profiles.
        Forensic Sci. Int. Genet. 2020; 44102192
        • Phillips C.
        • Fernandez-Formoso L.
        • Garcia-Magariños M.
        • Porras L.
        • Tvedebrink T.
        • Amigo J.
        • et al.
        Analysis of global variability in 15 established and 5 new European Standard Set (ESS) STRs using the CEPH human genome diversity panel.
        Forensic Sci. Int. Genet. 2011; 5: 155-169
        • Just R.S.
        • Irwin J.A.
        Use of the LUS in sequence allele designations to facilitate probabilistic genotyping of NGS-based STR typing results.
        Forensic Sci. Int. Genet. 2018; 34: 197-205
        • Woerner A.E.
        • Mandape S.
        • King J.L.
        • Muenzler M.
        • Crysup B.
        • Budowle B.
        Reducing noise and stutter in short tandem repeat loci with unique molecular identifiers.
        Forensic Sci. Int. Genet. 2021; 51102459
        • Woerner A.
        • King J.
        • Budowle B.
        Flanking variation influences rates of stutter in simple repeats.
        Genes. 2017; 8: 329
        • Bleka Ø.
        • Just R.
        • Le J.
        • Gill P.
        An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
        Forensic Sci. Int. Genet. 2020; 48102319
      13. Gill P., Bleka Ø, Benschop C., Haned H. Forensic Practitioner’s Guide to the Interpretation of Complex DNA Profiles [Internet]. First. Elsevier; 2020 [cited 2021 Nov 17]. Available from: 〈https://linkinghub.elsevier.com/retrieve/pii/C20190012332〉.

        • Bleka Ø.
        • Storvik G.
        • Gill P.
        EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
        Forensic Sci. Int. Genet. 2016; 21: 35-44
        • Gill P.
        • Bleka Ø
        • Hansson O.
        • Benschop C.
        • Haned H.
        Chapter 7 - The quantitative (continuous) model theory.
        Forensic Practitioner’s Guide to the Interpretation of Complex DNA Profiles. London: Academic Press, 2020: 181-238https://doi.org/10.1016/B978–0-12–820562-4.00010–9
        • Cheng K.
        • Lin M.
        • Moreno L.
        • Skillman J.
        • Hickey S.
        • Cuenca D.
        • et al.
        Modeling allelic analyte signals for aSTRs in NGS DNA profiles.
        J. Forensic Sci. 2021; 66: 1234-1245