1. Introduction
Probabilistic genotyping software such as EuroForMix [
[1]- Bleka Ø.
- Storvik G.
- Gill P.
EuroForMix: an open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts.
] (see Coble et al. [
[2]Probabilistic genotyping software: an overview.
] for an overview of different software) are important for the interpretation of mixtures of DNA from multiple contributors. However, all models of existing software packages were designed to use data generated with capillary electrophoresis (CE) based technology. During the last decade a new technology known as Massively Parallel Sequencing (MPS) has gained attention and is already viewed as a viable alternative to CE by the forensic community [
3- Bruijns B.
- Tiggelaar R.
- Gardeniers H.
‘Massively parallel sequencing techniques for forensics: a review.
,
4The first GHEP-ISFG collaborative exercise on forensic applications of massively parallel sequencing.
,
5Current state-of-art of STR sequencing in forensic genetics.
]. Whereas CE provides information based on the number of short tandem repeats (STR), accompanied by a peak height (RFU) measurement, MPS also provides the actual underlying sequence of the repeats, which is accompanied by the number of reads (read depth). For instance, a length-based homozygote genotype of ‘10’/‘10’ could be composed of two different sequences that are isometric but heterozygous, e.g., ‘[ATCG]10’/‘[ATCG]6 ATCT [ATCG]3’. The detection of increased variability via sequencing, along with the potential to examine many more autosomal markers, leads to increased discriminatory power compared to the traditional CE approach.
However, if the DNA amount is low the discriminatory power may not necessarily be increased; in this scenario it would be more likely to observe an allele of a homozygous genotype than both alleles of a heterozygous genotype, especially if a high analytical threshold is used [
[6]Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
]. Hence to leverage the potential for increased discriminatory power with MPS, the analytical threshold must remain low. Yet, use of a low analytical threshold may increase the challenges of STR mixture interpretation – first, by the inclusion of more sequence errors (typically observed with MPS technology), and secondly, the alleles from a major contributor may produce stutter artifacts in the range of a minor donor, making it hard to distinguish between the donor alleles and stutter [
[7]Massively parallel sequencing of short tandem repeats—Population data and mixture analysis results for the PowerSeq™ system.
]. This latter issue is circumvented by developing suitable probabilistic genotyping models, where there is no need to definitively distinguish stutters and alleles [
8- Taylor D.
- Bright J.A.
- Buckleton J.
‘The interpretation of single source and mixed DNA profiles.
,
9- Cowell R.G.
- Graversen T.
- Lauritzen S.L.
- Mortera J.
Analysis of forensic DNA mixtures with artefacts.
]. There are also several additional challenges with MPS data, including relatively high imbalance both between and within markers [
[10]- Hussing C.
- Huber C.
- Bytyci R.
- Mogensen H.S.
- Morling N.
- Børsting C.
Sequencing of 231 forensic genetic markers using the MiSeq FGx™ forensic genomics system – an evaluation of the assay and software.
], and increased stutter sizes and multiple stutter types – some of which are not observed with CE [
[11]Characterizing stutter variants in forensic STRs with massively parallel sequencing.
]. For instance, the ‘n0’ stutter is composed of both backward and forward stutters in the same sequence strand, hence there is no net change in repeat length.
Vilsen et al. [
[12]- Vilsen S.B.
- Tvedebrink T.
- Eriksen P.S.
- Hussing C.
- Børsting C.
- Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
] established negative binomial models for read depth because reads are discrete counts. In contrast, EuroForMix is based on the gamma model which assumes a continuous outcome. Furthermore, Vilsen et al. [
[12]- Vilsen S.B.
- Tvedebrink T.
- Eriksen P.S.
- Hussing C.
- Børsting C.
- Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
] suggested that parameters for inter-locus balance and stutter models could be estimated based on a calibration dataset before attempting to interpret complex mixtures. As part of the stutter model, Vilsen et al. [
[13]Stutter analysis of complex STR MPS data.
] used the concept of block length of missing motif (BLMM) to describe stutter size. For instance, if an allele is ‘[ATCG]6 ATCT [ATCG]3’, and a stutter product ‘[ATCG]5 ATCT [ATCG]3’ was produced from it, then the BLMM value of the stutter product is 6, which is also the longest uninterrupted stretch (LUS) value. If instead the stutter product ‘[ATCG]6 ATCT [ATCG]2’ was produced, the BLMM value would be 3, whereas the LUS value would still be 6. It is well established that the expected stutter proportion increase with the value of BLMM [
11Characterizing stutter variants in forensic STRs with massively parallel sequencing.
,
13Stutter analysis of complex STR MPS data.
,
14Modeling allelic analyte signals for aSTRs in NGS DNA profiles.
,
15- Agudo M.M.
- Aanes H.
- Roseth A.
- Albert M.
- Gill P.
- Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
].
EuroForMix (from version 1.11.3) supports data given in “LUS format” [
16Use of the LUS in sequence allele designations to facilitate probabilistic genotyping of NGS-based STR typing results.
,
17- Bleka Ø.
- Just R.
- Le J.
- Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
]. Here sequences are converted to a format of type “x_y” where x is the regular CE designation based on the number of repeats, and y is the longest uninterrupted sequence (LUS). EuroForMix uses this format to identify backward or forward stutters where the LUS information is utilized. This improves the discriminatory power when DNA quantities are high. However, for minor contributors evaluated in a prior study, the performance was similar to that achieved with ordinary CE nomenclature [
[17]- Bleka Ø.
- Just R.
- Le J.
- Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
]. In that study, which was based on ForenSeq DNA Signature Prep Kit (Verogen, Inc.) data developed using the manufacturer’s protocol, the largest influence on performance was related to how the data were pre-filtered: use of a static analytical threshold with no stutter prefiltering performed best, since potential alleles from minor contributors were not removed. Also, that study considered an analytical threshold of 30 reads since EuroForMix had not been adapted to work with lower values. Hence, this provides a motivation to develop a model which simultaneously lowers the analytical threshold and applies a probabilistic model to evaluate stutters and potential sequence errors.
Multiple types of stutters have been observed for MPS-STRs: n-1, n-2, n+1, n0, and n-1 from a region other than the LUS [
11Characterizing stutter variants in forensic STRs with massively parallel sequencing.
,
15- Agudo M.M.
- Aanes H.
- Roseth A.
- Albert M.
- Gill P.
- Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
]. The different stutter types are easy to recognize if a bracketed format is used to represent the sequence string [
[15]- Agudo M.M.
- Aanes H.
- Roseth A.
- Albert M.
- Gill P.
- Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
]. The bracket format facilitates analyst visualization since the compact form is more easily read and interpreted. Recently, the lusSTR tool was developed to automatically convert sequences into this format [
[18]R. Mitchell , D. Standage, lusSTR. Bioforensics, 2021. (Online). https://github.com/bioforensics/lusSTR.
]. In our previous study [
[15]- Agudo M.M.
- Aanes H.
- Roseth A.
- Albert M.
- Gill P.
- Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
] we used the bracket format to identify many of the afore-mentioned stutter types, and we quantified stutter proportion sizes using a beta regression with BLMM as an explanatory variable. In the present work we continue to use the bracket format in an adapted model for MPS-STRs and show how a calibrated stutter model, similar to [
[15]- Agudo M.M.
- Aanes H.
- Roseth A.
- Albert M.
- Gill P.
- Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
], can be directly integrated into probabilistic genotyping.
This paper is structured as follows: The Materials and methods section introduces the dataset used for model calibration and mixture evaluation. Subsequently, the mathematical details of MPSproto are presented, and we describe how calibration of the model was carried out. Then a mixture evaluation study is described, in which several kinds of models were compared, implemented as part of either MPSproto or EuroForMix. The Results section then describes the outcomes of the comparison study, including model performance and goodness of fit.
4. Discussion
4.1 Summary and overall findings
In this study the development and testing of a new probabilistic genotyping software, MPSproto, is described to evaluate MPS-STR mixtures. MPSproto is an extension of EuroForMix, and was inspired by the work of Vilsen et al. [
[12]- Vilsen S.B.
- Tvedebrink T.
- Eriksen P.S.
- Hussing C.
- Børsting C.
- Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
], who included marker amplification efficiency parameters to handle the large inter-locus balance that has been observed for MPS-STRs. MPSproto was already described and applied to an illustrative example in the discussion of Agudo et al. [
[15]- Agudo M.M.
- Aanes H.
- Roseth A.
- Albert M.
- Gill P.
- Bleka Ø.
A comprehensive characterization of MPS-STR stutter artefacts.
]; who focused on developing a framework to evaluate complex sequence stutters, describing their sizes using the bracket format nomenclature. The software includes two distribution options for the sequence read depths: the gamma (GA) model which is continuous, and the negative binomial (NB) model which is discrete.
In a departure from the methods used by EuroForMix, MPSproto requires calibration prior to implementation and evaluation of questioned DNA-profiles. The calibration dataset must be comprised of single-source profiles from donors whose alleles are known, and ideally should not include any degraded profiles, since the purpose of calibration is to characterize inter-locus variation. To simplify MPSproto calibration for users, the calibrateModel function in the software is used to perform all necessary calibration steps.
To evaluate the performance of the two MPSproto models compared to EuroForMix, the analysis described by Bleka et al. [
[17]- Bleka Ø.
- Just R.
- Le J.
- Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
] was revisited for EuroForMix interpretation of 60 autosomal STR mixtures sequenced using the ForenSeq DNA Signature Prep Kit. Different analytical thresholds were applied: T = 11 for MPSproto versus T = 30 for EuroForMix, since for the latter, this was the threshold used in [
[17]- Bleka Ø.
- Just R.
- Le J.
- Gill P.
An examination of STR nomenclatures, filters and models for MPS mixture interpretation.
] and EuroForMix has not been adapted to work with lower values. However, we investigated the effect of reducing thresholds applied to EuroForMix: T = 20 and T = 11.
With this dataset, the MPSproto GA model performed considerably better than EuroForMix (T = 30), mainly because fewer dropouts were observed due to the lower analytical threshold applied to the former model. However, lowering the EuroForMix threshold reduced the differences, especially when T = 20 was applied instead. However, a worse model fit was returned for EuroForMix, since this also increased the number of artefacts that could not be properly modelled.
Overall, the GA model performed better than the NB model, since higher LRs for true contributors were obtained for the former. Additionally, the former model obtained a higher true positive rate for false positive rate values up to approximately 0.1. There were some situations where the GA model far outperformed the NB model. For example, when Ref2 was interrogated as a contributor to mixture 2P_0.75ng_20–1, the LR resulting from use of the GA model was log10LR= 10.7 versus log10LR= 4.0 for the NB model. With this mixture, Ref2 was a minor contributor with 11 dropouts, and the LR difference between the two models was largest for the D2S441 marker where Ref2 had one allele dropout. In the evaluation of this mixture, a lower penalty for dropout was applied when the GA model was used compared to that applied with the NB model. With further analysis we estimated the dropout probability of the Ref2 allele to be approximately 21 % for the GA model and only 2 % for the NB model (see
Section 4.2 for more details). This indicates that the GA model is less strict and would be preferable to evaluate minor contributors compared to the NB model when an analytical threshold of 11 reads is applied (see
Section 4.2 for further discussion of thresholds). However, the consequence is that non-contributors would be “excluded” to a lesser degree (albeit with low probative LRs).
Findings regarding model performance in this study do not necessarily support the suggestion by Vilsen et al. [
[12]- Vilsen S.B.
- Tvedebrink T.
- Eriksen P.S.
- Hussing C.
- Børsting C.
- Morling N.
Modelling allelic drop-outs in STR sequencing data generated by MPS.
] that the NB distribution provides a better model than GA when the DNA quantities are low: The GA model seems to provide a higher dropout probability than for the NB model. However, it is possible that our conclusion in this respect could be different if lower analytical thresholds were applied to the mixture dataset; indeed, the effect of applying alternative analytical thresholds was not explored here. Regardless, there may be situations where it could be better to use the NB model – for instance, if the model fit using GA was poor, and the model fit diagnostics indicated that NB was better. For this reason, it may be worthwhile to compute the LR for both models when minor contributors are evaluated as the POI, followed by an examination of the model diagnostics to select the best fit model.
In the present study, LR values for non-contributors mostly did not exceed approximately 10, regardless of the program or model employed for interpretation (there were three observations above LR=10 for the NB model). All LR values greater than 1 were from individuals unrelated to the mixture donors (no check of relatedness amongst non-contributors was performed). Estimating the number of contributors to an unknown sample can be challenging and can impact interpretation results – typically in the form of false negatives for true donors when the contributor number is underestimated, and an increased number of false positives for non-contributors when the contributor number is overestimated [
[6]Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
]. In this study the ground truth number of contributors to the constructed mixtures was used for all evaluations (MPSproto and EuroForMix); with donor DNA inputs as low as 2 pg (see
Supplemental Table S2), the contributor number used may therefore have been higher than the apparent number of contributors due to high levels of allelic dropout.
4.2 Model considerations
4.2.1 Stutters
As with EuroForMix, MPSproto uses stutter proportions instead of stutter ratios. The difference between the programs is in the way that stutter proportions are assigned: EuroForMix estimates stutter proportion parameter(s) when the questioned profile is evaluated, whereas MPSproto requires a calibration of the parameter in advance using a single-source dataset. The calibration of the stutter model for MPSproto is somewhat challenging, since it requires 1) the donor alleles to be known, 2) heterozygous allele pairs with sufficient separation (to not mask stutter effects), and 3) a sufficient number of stutter products (per sequence and across different alleles).
Rather than modelling stutter proportions directly, it may also be possible to take a fitted linear model of a stutter ratio,
and convert it to a stutter proportion,
. Vilsen et al. [
[13]Stutter analysis of complex STR MPS data.
] suggested model
for the expected stutter ratio. Conversion back to the expected stutter proportion could be accomplished using the transformation
.
4.2.2 Noise
As with stutter, MPSproto requires calibration of the noise parameters in advance using a single-source dataset. MPSproto models the number of noise sequences per marker using a geometrical distribution, and the noise read depth using a discretized pareto distribution. These models provided a good fit for the calibration dataset used in this study. The pareto distribution is a useful choice since it can consider higher noise levels than for instance the exponential distribution, due to heavier tails.
Importantly, as the noise parameter developed during calibration is dependent on the analytical threshold selected, the analytical threshold used for evaluation should be the same as that used for the calibration.
The noise model may be affected by the number of libraries multiplexed for sequencing. For example, the expectation is that a sequencing pool composed of 90 libraries will result in reduced quantity of noise signals/sequences at reduced read depths as compared to a sequencing a pool of 30 libraries. Therefore, a calibration may be needed for each distinct protocol used for data generation. In the calibration dataset we observed that most of the noise sequences (98.5%) appeared to be errors of a single base compared to a parental allele sequence. Hence, there is potential to model the single base error sequencies separately from other type of noise, which may help remedy the above-mentioned issue.
Vilsen et al. [
[24]- Vilsen S.B.
- Tvedebrink T.
- Mogensen H.S.
- Morling N.
Modelling noise in second generation sequencing forensic genetics STR data using a one-inflated (zero-truncated) negative binomial model.
] proposed a one-inflated (zero-truncated) negative binomial distribution for the noise read depth, including all sequences down to singletons (i.e., analytical threshold T = 1). Our approach to handling noise sequences differs: we recommend avoiding an analytical threshold that is too low because of the way that the MPSproto noise model is defined (it may no longer fit). An alternative approach is to apply corrections: either of the type employed by FDStools [
[25]- Hoogenboom J.
- van der Gaag K.J.
- de Leeuw R.H.
- Sijen T.
- de Knijff P.
- Laros J.F.J.
‘FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise’.
], which associates noise with parental sequences and combines the read depths (see Benschop et al. [
[6]Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
] for a comparison study), or a method to reduce the number of base-calling errors such as that introduced by Vilsen [
[26]S.B. Vilsen, Statistical Modelling of Massively Parallel Sequencing Data in Forensic Genetics, Aalborg University, Aalborg.
].
4.2.3 Thresholds and data filtering
Probabilistic genotyping (PG) systems are well established for CE based applications to deal with stutters. However, interpretation for MPS-STR data lags because of the current lack of availability of such systems, and accordingly has relied on defining thresholds to facilitate interpretation. Experience with PG systems applied to CE has shown that evaluation of evidence that is based upon thresholds, where binary decisions are required to designate alleles, are always suboptimal [
[27]- Taylor D.
- Buckleton J.
- Bright J.-A.
Does the use of probabilistic genotyping change the way we should view sub-threshold data?.
]. This is because information is lost or thrown away. This in turn results in reports that either understate the value of the evidence when true alleles that support
are removed (i.e., dropouts for the person of interest); or overstate the value of the evidence when true alleles that support
are removed (i.e., dropouts for the unknown contributors). Given the increased complexity, it can be argued that the inherent advantages of MPS will never be properly realized without integration of PG systems for evidence evaluation. Use of a threshold intended to eliminate noise cannot be avoided with the current implementation of MPSproto, but steps have been taken to model noise, reducing the threshold to only 11 reads – which is substantially lower than that used in conventional systems.
4.2.4 Marker amplification efficiencies
On the basis of prior findings regarding variability in marker amplification efficiencies with the ForenSeq DNA Signature Prep Kit [
10- Hussing C.
- Huber C.
- Bytyci R.
- Mogensen H.S.
- Morling N.
- Børsting C.
Sequencing of 231 forensic genetic markers using the MiSeq FGx™ forensic genomics system – an evaluation of the assay and software.
,
14Modeling allelic analyte signals for aSTRs in NGS DNA profiles.
,
28Establishing STR and identity SNP analysis thresholds for reliable interpretation and practical implementation of MPS gDNA casework.
], MPSproto includes a function to adjust the marker amplification efficiency estimates towards those obtained from a questioned DNA profile by utilizing a prior. To enable this, standard deviations of the amplification efficiencies must be specified in addition to the mean (per marker), and the user can choose between normal distribution or log-normal distribution for the prior. We did not use this functionality in this study, although it would be worthwhile for future investigations, where the prior is, for example, determined from the calibration dataset.
4.2.5 Degradation
Similar to EuroForMix, the MPSproto model also includes the possibility to consider a degradation model for evaluation of questioned DNA profiles. In this study, both the calibration and mixture datasets were constructed without degradation exhibited. When a degradation model was applied in MPSproto, the degradation slope estimates were all close to one (not shown), confirming the pristine state of the test samples. It is highly recommended that the calibration dataset exhibits little or no degradation, so that the estimated marker efficiencies are not affected before the full model is applied to questioned DNA profiles (which may be highly degraded).
EuroForMix was used with the degradation model turned on since more of the data is explained with rather than without using it. The underlying reason is that the marker amplification efficiencies of samples in the mixture dataset decay with the fragment length.
4.2.6 Data format
The format required for MPSproto interpretation of MPS-STR sequences does not necessitate the use of any specific alignment or data analysis software. Any program that can be used to perform marker identification and sequence allele calling from FASTQ files (such as STRaitRazor [
[29]STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data.
], FDStools [
[25]- Hoogenboom J.
- van der Gaag K.J.
- de Leeuw R.H.
- Sijen T.
- de Knijff P.
- Laros J.F.J.
‘FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise’.
], and STRinNGS [
30- Jønck C.G.
- Qian X.
- Simayijiang H.
- Børsting C.
STRinNGS v2.0: Improved tool for analysis and reporting of STR sequencing data.
,
31- Ganschow S.
- Silvery J.
- Kalinowski J.
- Tiemann C.
toaSTR: a web application for forensic STR genotyping by massively parallel sequencing.
]), along with bracketing of the sequences, could be used as part of a pipeline to produce the data for MPSproto input. In this paper, sequences were obtained from ForenSeq typing results using the ForenSeq UAS v1.3, and these were further converted into a bracket format (forward strand; using the lusSTR program [
18R. Mitchell , D. Standage, lusSTR. Bioforensics, 2021. (Online). https://github.com/bioforensics/lusSTR.
,
21Ø. Bleka, LUSstrR. 2022 (Online).https://github.com/oyvble/LUSstrR.
]) as recommended by an International Society for Forensic Genetics (ISFG) Commission [
[32]Massively parallel sequencing of forensic STRs: considerations of the DNA commission of the International Society for Forensic Genetics (ISFG) on minimal nomenclature requirements.
].
4.2.7 Model differences
We revisited the interpretation where reference Ref2 was compared to sample 2P_0.75ng_20‐1 since this was one of the examples where the two MPSproto models differed most (log10LR=6.7 in difference). The largest LR difference was observed for D2S441 which was estimated as 1.14 in marker amplification efficiency (
Table 1). One of the heterozygous alleles of Ref2 dropped out.
Fig. 4 illustrates how the dropout estimate became larger for the GA model (dropout probability 21 %) than for the NB model (dropout probability equal 2 %), and hence was penalized less in the LR calculation. The reason for this is that the probability density function of GA is skewed towards zero whereas for NB it is not. The read depth distribution for the GA model is also wider than for the NB model, which means that the former also has a wider heterozygous balance distribution compared to the latter.
4.3 Current implementation and future work
As with EuroForMix, the current version of MPSproto (v0.8.1) includes functionality to analyse PCR replicates. Benschop et al. [
[6]Application of a probabilistic genotyping software to MPS mixture STR data is supported by similar trends in LRs compared with CE data’.
] demonstrated that concurrent interpretation of PCR replicates in EuroForMix can produce improved results. We will examine the effects of using PCR replicates in a future study.
MPSproto can also be used to evaluate conventional CE-STR data (and accordingly, a CE module is included for the calibration step). Many stutter types are supported for this module: backward, forward, double-backward, double forward, triple backward and 2-bp stutters (half stutters). Here CE allele lengths are used to determine the BLMM. We will compare the performance between MPSproto and EuroForMix for the interpretation of CE data in a future study.
The current implementation MPSproto does not support combination of CE-STR with MPS-STR data for simultaneous evaluation. For this to be possible the model would need to be extended so that typing results for each marker can be differentiated according to CE or MPS origin, as a given marker could have observations from either CE typing, MPS typing, or both. For instance, the ForenSeq DNA Signature Prep Kit types 22 autosomal STRs that are also typed using the PowerPlex® Fusion 6C System Kit (Promega Corporation).
EuroForMix has been used to interpret SNP typing results [
33- Bleka Ø.
- Eduardoff M.
- Santos C.
- Phillips C.
- Parson W.
- Gill P.
Open source software EuroForMix can be used to analyse complex SNP mixtures.
,
34Massively parallel sequencing analysis of nondegraded and degraded DNA mixtures using the ForenSeq™ system in combination with EuroForMix software.
] (though Yang et al. [
[35]DNA mixture interpretation using linear regression and neural networks on massively parallel sequencing data of single nucleotide polymorphisms.
] showed better performance with machine learning approaches). Accordingly, MPSproto can also be used to interpret SNP data. For this application, the MPSproto stutter model would be turned off and not defined, and the degradation model would not be used. However, calibration is still needed to define the marker amplification efficiency and noise parameters to be used. Provided STR and SNP markers are neither linked nor in linkage disequilibrium, the LRs for the two systems could be multiplied together. The performance of MPSproto with SNPs will be described in a future paper.
The current version of MPSproto does not have a graphical user interface (GUI). As most forensic practitioners are not used to running programs from the command line, a near-term priority is to develop a user-friendly GUI for both model calibration and profile interpretation. Additionally, as MPSproto requires specific formatting for the input sequence data, we aim to integrate the LUSstrR conversion tool [
[21]Ø. Bleka, LUSstrR. 2022 (Online).https://github.com/oyvble/LUSstrR.
] into MPSproto to ease translation of sequence strings into the proper format. Other MPS-STR assays such as the PowerSeq GY System (Promega Corporation), the Precision ID GlobalFiler NGS STR Panel (Thermo Fisher Scientific), or custom assays, could be used with MPSproto, provided that the sequences are converted to the proper format. However, to unlock the degradation model for these assays, the fragment lengths of common RU alleles must be defined in the MPSproto kit file; at present, only information about the ForenSeq DNA Signature Prep Kit is included.
Finally, there is no relationship testing module included in the current version of MPSproto, but we aim to include this in a future version.
Acknowledgements
This work was funded under Agreement No. HSHQDC-15-C-00064 awarded to Battelle National Biodefense Institute (BNBI) by the Department of Homeland Security (DHS) Science and Technology Directorate (S&T) for the management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DHS or the U.S. Government. DHS does not endorse any products or commercial services mentioned in this presentation. In no event shall DHS, BNBI or NBACC have any responsibility or liability for any use, misuse, inability to use, or reliance upon the information contained herein. In addition, no warranty of fitness for a particular purpose, merchantability, accuracy or adequacy is provided regarding the contents of this document. All research involving living individuals, their data, or their biospecimens was conducted in compliance with the Federal Policy for the Protection of Human Subjects (The Common Rule, codified for DHS as 6 CFR 46), DHS Management Directive 026-04, and any other applicable statutory requirements. Research involving human subjects has only been initiated after the following has occurred: the need for IRB review has been determined, IRB approval has been obtained as applicable, and DHS Compliance Assurance Program Office certification or concurrence has been issued. Notice: This manuscript has been authored by Battelle National Biodefense Institute, LLC under Contract No. HSHQDC-15-C-00064 with DHS. The US Government (USG) retains and the publisher, by accepting the article for publication, acknowledges that the USG retains a non-exclusive, paid up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for USG purposes.
Appendix A
Mathematical description of the distributions
Continuous outcome: .
Probability density function:
Cumulative distribution function:
Negative binomial distribution (re-parameterized version):
Discrete outcome: .
Probability density function:
Cumulative distribution function:
Deriving the relation between size parameter and the coefficient of variation
We assume following parameterization: and .
Then the size parameter can be written as .
Since the coefficient of variation , then .
Computational sidenote: Very large values of can cause computational issues which is solvable by restricting it below a value . It can be derived that the restriction is equivalent to . Hence smaller values of would require a large value of . We used a value of in the C++ function which avoids the maximum likelihood optimization to crash using parallelization with OpenMP.
Article info
Publication history
Published online: September 26, 2022
Accepted:
September 23,
2022
Received in revised form:
September 16,
2022
Received:
June 6,
2022
Copyright
© 2022 The Author(s). Published by Elsevier B.V.