- •A machine learning pipeline for analyzing UMI barcoded reads was created.
- •On low-template data, single source allele calling was improved by 13% (PR AUC).
- •Allele calling for a 100 pg balanced mixture was improved by 10%.
PCR artifacts are an ever-present challenge in sequencing applications. These artifacts can seriously limit the analysis and interpretation of low-template samples and mixtures, especially with respect to a minor contributor. In medicine, molecular barcoding techniques have been employed to decrease the impact of PCR error and to allow the examination of low-abundance somatic variation. In principle, it should be possible to apply the same techniques to the forensic analysis of mixtures. To that end, several short tandem repeat loci were selected for targeted sequencing, and a bioinformatic pipeline for analyzing the sequence data was developed. The pipeline notes the relevant unique molecular identifiers (UMIs) attached to each read and, using machine learning, filters the noise products out of the set of potential alleles. To evaluate this pipeline, DNA from pairs of individuals were mixed at different ratios (1−1, 1−9) and sequenced with different starting amounts of DNA (10, 1 and 0.1 ng). Naïvely using the information in the molecular barcodes led to increased performance, with the machine learning resulting in an additional benefit. In concrete terms, using the UMI data results in less noise for a given amount of drop out. For instance, if thresholds are selected that filter out a quarter of the true alleles, using read counts accepts 2381 noise alleles and using raw UMI counts accepts 1726 noise alleles, while the machine learning approach only accepts 307.
To read this article in full you will need to make a payment
Purchase one-time access:Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
One-time access price info
- For academic or personal research use, select 'Academic and Personal'
- For corporate R&D use, select 'Corporate R&D Professionals'
Subscribe:Subscribe to Forensic Science International: Genetics
Already a print subscriber? Claim online access
Already an online subscriber? Sign in
Register: Create an account
Institutional Access: Sign in to ScienceDirect
- Low template STR typing: effect of replicate number and consensus method on genotyping reliability and DNA database search results.Forensic Sci. Int.: Genet. 2011; 5: 316-328
- Degradation of forensic DNA profiles.Aust. J. Forensic Sci. 2013; 45: 445-449
- Developing allelic and stutter peak height models for a continuous method of DNA interpretation..Forensic Sci. Int. Genet. 2013; 7: 296-304
- Characterising stutter in forensic STR multiplexes.Forensic Sci. Int. Genet. 2012; 6: 58-63
- Modeling allelic analyte signals for aSTRs in NGS DNA profiles..J. Forensic Sci. 2021;
- Variability and additivity of read counts for aSTRs in NGS DNA profiles.Forensic Sci. Int.: Genet. 2020; 48
- ProSynAR: a reference aware read merger.Bioinformatics. 2022;
- Second-generation sequencing of forensic STRs using the ion torrent HID STR 10-plex and the ion PGM.Forensic Sci. Int Genet. 2015; 14: 132-140
- Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations.Proc. Natl. Acad. Sci. 2014; 111: 1891-1896
- An investigation of the rigor of interpretation rules for STRs derived from less than 100 pg of DNA.Forensic Sci. Int. 2000; 112: 17-40
- PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R.Bioinformatics. 2015; 31: 2595-2597
- The meaning and use of the area under a receiver operating characteristic (ROC) curve.Radiology. 1982; 143: 29-36
- ChIP-nexus enables improved detection of in vivo transcription factor binding footprints.Nat. Biotechnol. 2015; 33: 395-401
- High-accuracy long-read amplicon sequences using unique molecular identifiers with nanopore or PacBio sequencing.Nat. Methods. 2021; 18: 165-169
- Detecting ultralow-frequency mutations by duplex sequencing.Nat. Protoc. 2014; 9: 2586-2606
- Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting.Sci. Adv. 2016; 2e1501371
- STRait razor online: an enhanced user interface to facilitate interpretation of MPS data.Forensic Sci. Int.: Genet. 2021; 52
- Encoding PCR products with batch-stamps and barcodes.Biochem. Genet. 2007; 45: 761-767
- Evaluation of Promega PowerSeq™ Auto/Y systems prototype on an admixed sample of Rio de Janeiro, Brazil: population data, sensitivity, stutter and mixture studies.Forensic Sci. Int Genet. 2021; 53102516
- The revival of the Gini importance?.Bioinformatics. 2018; 34: 3711-3718
- Reducing amplification artifacts in high multiplex amplicon sequencing by using molecular barcodes..BMC Genom. 2015; 16: 1-12
- Examining sources of error in PCR by single-molecule sequencing.PloS One. 2017; 12e0169774
Qiagen (2021). "QIAseq Targeted DNA Panel Handbook"
- Slippage synthesis of simple sequence DNA.Nucleic Acids Res. 1992; 20: 211-215
- Detection of ultra-rare mutations by next-generation sequencing.Proc. Natl. Acad. Sci. 2012; 109: 14508-14513
- Birthday paradox for multi-collisions.International Conference on Information Security and Cryptology. Springer, 2006
- Statistical model for degraded DNA samples and adjusted probabilities for allelic drop-out.Forensic Sci. Int. Genet. 2012; 6: 97-101
- Statistical modelling of ion PGM HID STR 10-plex MPS data.Forensic Sci. Int.: Genet. 2017; 28: 82-89
- Preferential PCR amplification of alleles: mechanisms and solutions.PCR Methods Appl. 1992; 1: 241-250
- Mutation of human short tandem repeats.Hum. Mol. Genet. 1993; 2: 1123-1128
- Fast STR allele identification with STRait Razor 3.0.Forensic Sci. Int. Genet. 2017; 30: 18-23
- Flanking variation influences rates of stutter in simple repeats.Genes (Basel). 2017; 8: 329
- Compound stutter in D2S1338 and D12S391.Forensic Sci. Int. Genet. 2019; 39: 50-56
- Reducing noise and stutter in short tandem repeat loci with unique molecular identifiers.Forensic Sci. Int.: Genet. 2021; 51
- Ranger: a fast implementation of random forests for high dimensional data in C++ and R.J. Stat. Softw. 2015; 77: 17
- Investigation of the STR loci noise distributions of PowerSeq auto system.Croat. Med J. 2017; 58: 214-221
Published online: November 22, 2022
Accepted: November 18, 2022
Received in revised form: October 20, 2022
Received: June 14, 2022
© 2022 Elsevier B.V. All rights reserved.