Advertisement

Nanopore sequencing of a forensic combined STR and SNP multiplex

Open AccessPublished:October 28, 2021DOI:https://doi.org/10.1016/j.fsigen.2021.102621

      Highlights

      • Nanopore sequencing of a ForenSeq DNA Signature Prep library was compared to Illumina sequencing.
      • Nanopore basecalling was performed using two state-of-the-art basecallers, Guppy and Bonito. All autosomal STR-loci could be genotyped correctly with at least one of both basecallers.
      • Alignment score filtering increased the signal-to-noise ratio.
      • STR iso-alleles could reliably be genotyped.
      • SNP genotyping was highly accurate.

      Abstract

      Nanopore sequencing for forensic purposes has gained attention, as it yields added discriminatory power compared to capillary electrophoresis (CE), without the need for a high up-front capital investment. Besides enabling the detection of iso-alleles, Massively Parallel Sequencing (MPS) facilitates the analysis of Short Tandem Repeats (STRs) and Single Nucleotide Polymorphisms (SNPs) in parallel. In this research, six single-contributor samples were amplified by such a combined multiplex of 58 STR and 94 SNP loci, followed by nanopore sequencing using an R10.3 flowcell. Basecalling was performed using two state-of-the-art basecallers, Guppy and Bonito. An advanced alignment-based analysis method was developed, which lowered the noise after alignment of the STR reads to a reference library. Although STR genotyping by nanopore sequencing is more challenging, correct genotyping was obtained for all autosomal and all but two non-autosomal STR loci. Moreover, genotyping of iso-alleles proved to be very accurate. SNP genotyping yielded an accuracy of 99% for both basecallers. The use of novel basecallers, in combination with the newly developed alignment-based analysis method, yields results with a pronouncedly higher STR genotyping accuracy compared to previous studies

      Keywords

      1. Introduction

      Massively Parallel Sequencing (MPS) technologies have become a well-established approach for forensic human identification [
      • Bruijns B.
      • Tiggelaar R.
      • Gardeniers H.
      Massively parallel sequencing techniques for forensics: a review.
      ]. The most commonly used markers for forensic DNA profiling are Short Tandem Repeats (STRs), which are short nucleotide sequences repeated multiple times in a head-to-tail fashion [
      • Butler J.M.
      Forensic DNA Typing: Biology, Technology, and Genetics of STR markers.
      ]. The repeat number for such loci varies between individuals, and can thus be used to generate a unique DNA fingerprint. STR sizing is realized mainly by performing a polymerase chain reaction (PCR) followed by capillary electrophoresis (CE) [
      • Butler J.
      • McCord B.
      • Jung J.
      • Allen R.
      Rapid analysis of the short tandem repeat HUMTH01 by capillary electrophoresis.
      ]. Although this generates highly informative profiles, sequence variants within the amplicons, such as iso-alleles and SNPs, are not detected by CE. MPS-based approaches yield additional discriminatory power for STR analysis, which is in particular useful for low-input samples. Moreover, MPS allows higher order SNP and STR multiplexing, enabling analysis of STR loci in parallel with Single Nucleotide Polymorphism (SNP) loci. In contrast to STRs, the amplicons targeting the SNP markers are much shorter, which is favorable for the analysis of highly degraded samples [
      • Børsting C.
      • Mogensen H.S.
      • Morling N.
      Forensic genetic SNP typing of low-template DNA and highly degraded DNA from crime case samples.
      ].
      Although validated forensic MPS strategies are commercially available, e.g. the Verogen technology [
      • Churchill J.D.
      • Schmedes S.E.
      • King J.L.
      • Budowle B.
      Evaluation of the Illumina® beta version ForenSeqTM DNA signature prep kit for use in genetic profiling.
      ], the widespread implementation in forensic routine is hampered by both the high up-front capital investment and the high reagent costs. The MinION device, an affordable long-read sequencer commercialized by Oxford Nanopore Technologies, has gained importance in the forensic field [
      • Plesivkova D.
      • Richards R.
      • Harbison S.
      A review of the potential of the MinIONTM single‐molecule sequencing system for forensic applications.
      ]. Moreover, the device is handheld, enabling on-site analysis of samples, which could be of great use for disaster victim identification, or for extremely urgent crime scene samples.
      Although the technology has improved considerably, nanopore sequencing still results in a higher level of sequencing error noise than Illumina sequencing [
      • Wick R.R.
      • Judd L.M.
      • Holt K.E.
      Performance of neural network basecalling tools for Oxford Nanopore sequencing.
      ]. Nevertheless, accurate data for sequencing of forensic bi- and multi-allelic SNPs were obtained by our group [
      • Cornelis S.
      • Gansemans Y.
      • Deleye L.
      • Deforce D.
      • Van Nieuwerburgh F.
      Forensic SNP genotyping using nanopore MinION sequencing.
      ,
      • Cornelis S.
      • Gansemans Y.
      • Vander Plaetsen A.-S.
      • Weymaere J.
      • Willems S.
      • Deforce D.
      • Van F.
      Nieuwerburgh, Forensic tri-allelic SNP genotyping using nanopore sequencing.
      ]. Nanopore sequencing of forensic STRs proved to be more cumbersome [
      • Asogawa M.
      • Ohno A.
      • Nakagawa S.
      • Ochiai E.
      • Katahira Y.
      • Sudo M.
      • Osawa M.
      • Sugisawa M.
      • Imanishi T.
      Human short tandem repeat identification using a nanopore-based DNA sequencer: a pilot study.
      ,
      • Cornelis S.
      • Willems S.
      • Van Neste C.
      • Tytgat O.
      • Weymaere J.
      • Vander Plaetsen A.-S.
      • Deforce D.
      • Van Nieuwerburgh F.
      Forensic STR profiling using Oxford Nanopore Technologies’ MinION sequencer.
      ,
      • Tytgat O.
      • Gansemans Y.
      • Weymaere J.
      • Rubben K.
      • Deforce D.
      • Van Nieuwerburgh F.
      Nanopore sequencing of a forensic STR multiplex reveals Loci suitable for single-contributor STR profiling.
      ,
      • Ren Z.-L.
      • Zhang J.-R.
      • Zhang X.-M.
      • Liu X.
      • Lin Y.-F.
      • Bai H.
      • Wang M.-C.
      • Cheng F.
      • Liu J.-D.
      • Li P.
      Forensic nanopore sequencing of STRs and SNPs using Verogen’s ForenSeq DNA signature prep kit and MinION.
      ]. Some specific locus-dependent success-limiting factors hampering accurate STR genotyping could be identified, one of them being the presence of homopolymers in the repeat or flanking region [
      • Tytgat O.
      • Gansemans Y.
      • Weymaere J.
      • Rubben K.
      • Deforce D.
      • Van Nieuwerburgh F.
      Nanopore sequencing of a forensic STR multiplex reveals Loci suitable for single-contributor STR profiling.
      ]. In this research, a forensic STR- and SNP-multiplex is nanopore sequenced using an R10.3 flowcell. This novel type of flowcell was designed to resolve homopolymers with a higher accuracy. It features pores characterized by a dual constriction, which both modulate the raw signal obtained during sequencing [
      • Van der Verren S.E.
      • Van Gerven N.
      • Jonckheere W.
      • Hambley R.
      • Singh P.
      • Kilgour J.
      • Jordan M.
      • Wallace E.J.
      • Jayasinghe L.
      • Remaut H.
      A dual-constriction biological nanopore resolves homonucleotide sequences with high fidelity.
      ]. Moreover, two state-of-the-art basecallers (Guppy and Bonito) are compared for this purpose, and an improved alignment-based analysis method is demonstrated.

      2. Materials and methods

      2.1 Samples

      The results presented in this research were obtained from six samples: two commercially available reference DNA samples (9947a and 9948) from OriGene (Rockville, Maryland, USA), and four blood samples collected from anonymous donors. The ethical review board of Ghent University Hospital provided ethical approval, and all healthy volunteers signed the informed consent (BC-05557). The blood samples were obtained by a finger puncture using a 21G Minicollect® Lancelino safety lancet with a penetration depth of 2.4 mm (Greiner Bio-One, Kremsmünster, Austria) and collected in a K3E K3EDTA Minicollect® collection tube (Greiner Bio-One, Kremsmünster, Austria). DNA extraction of the blood samples was performed using the DNeasy® Blood and Tissue kit according to the manufacturer’s instructions.

      2.2 PCR amplification

      All six samples were amplified using the ForenSeq DNA Signature Prep Kit (Verogen, San Diego, USA), using Primer Mix A, according to the manufacturer’s instructions. Primer Mix A contains 149 primer pairs, targeting 27 autosomal STRs, 24 Y-STRs, 7 X-STRs, and 94 identity SNPs. It should be noted that primers designed for some other loci (e.g. SE33) are also included in the Primer Mix, but are not analysed by the Universal Analysis Software [
      • Wick R.R.
      • Judd L.M.
      • Holt K.E.
      Performance of neural network basecalling tools for Oxford Nanopore sequencing.
      ]. All primers consist of a target-specific region and a mutual overhang region. A first PCR reaction was performed using a SimpliAmp Thermal Cycler (ThermoFisher Scientific, Waltham, MA, USA), during which the target-specific regions anneal to their complement. In a second PCR step, the targets were enriched, along with the incorporation of indexes for sample de-multiplexing and sequencing adapters. After amplification, samples are purified using the Sample Purification beads provided in the DNA Signature Prep Kit, according to the manufacturer’s instructions. Elution was performed in 52.5 μL Resuspension Buffer, aliquots of this eluate were subjected to both Verogen and nanopore sequencing.

      2.3 Verogen sequencing

      After performing bead-based normalization, the samples were pooled and denatured as specified in the DNA Signature Prep Kit protocol. Immediately after denaturation, the samples were loaded on the reagent cartridge. Paired-end sequencing was performed on the MiSeq FGx Sequencing System (Illumina, San Diego, USA). Data analysis was done using the ForenSeq Universal Analysis Software v1.3 (Verogen). For all loci, the analytical threshold, which is the lower limit of detection, was set at 1.5% of the reads for the specific locus, and the interpretation threshold at 4.5%. This implicates that each allele to which more than 4.5% of the reads is assigned, should be interpreted as a true allele or a stutter. The threshold for STR intra-locus imbalance was set at 60%. For SNPs this was set at 50%. The stutter filter settings for the STRs were locus dependent, default settings were used.

      2.4 Nanopore sequencing

      The purified amplicons obtained after the PCR2 amplification step were subjected to library preparation for nanopore sequencing as described in previous work [
      • Tytgat O.
      • Gansemans Y.
      • Weymaere J.
      • Rubben K.
      • Deforce D.
      • Van Nieuwerburgh F.
      Nanopore sequencing of a forensic STR multiplex reveals Loci suitable for single-contributor STR profiling.
      ]. An input of 75 ng was used for each sample. DNA repair and end preparation were performed using NEBNext FFPE DNA Repair Mix and NEBNext End Repair/dA-Tailing Module (NEB, Ipswich, MA, USA). After purification using a 1.8 × volume of AMPure XP beads (Beckman Coulter, High Wycombe, UK), barcode ligation was performed using the Native Barcoding Expansion 1–12 (EXP-NBD104) kit (ONT, Oxford, UK). This was realized by adding 25 μL NEB Blunt/TA Ligase Master Mix and 2.5 μL Native Barcode to the sample, followed by a 10 min incubation step at room temperature. Next, the barcoded amplicons were purified using a 1.8 × volume of AMPure XP beads and quantified using a Qubit dsDNA High Sensitivity Assay Kit (Thermo Fisher, Waltham, MA, USA). An equimolar pool, with a total input of 50 ng, was subjected to adapter ligation. To realize this, 5 μL of Adaptor mix II, 20 μL of NEBNext Quick Ligation Reaction Buffer, and 10 μL of Quick T4 DNA Ligase were added to the library, followed by a 10 min incubation step at room temperature. Again, a purification step was performed using a 1.8 × volume of AMPure XP beads, followed by quantifying the final library using a Qubit fluorimeter. 19 ng of DNA was loaded onto the SpotON flowcell, according to the manufacturer’s instructions, using the SQK-LSK109 kit (ONT, Oxford, UK). Sequencing was performed using a GridION device, which accomodates the same flowcells as the portable MinION device. Sequencing was performed for 48 h, to obtain a maximal amount of data. However, as the flow cell quality deteriorates over time, most reads are obtained during the first hours of sequencing.

      2.5 Data analysis

      Basecalling was performed with both the fully supported basecaller Guppy (v.4.3.4, ONT) and the research basecaller Bonito (v.0.3.8, ONT). Sample de-multiplexing based on the barcode sequence was done in real-time by the MinKNOW control software. For autosomal STR genotyping, a reference library (Supplementary File 2) was constructed for all investigated STR-loci, encompassing all alleles occurring within the European population with a frequency > 1%. The population information was obtained using the pop.STR database [
      • Amigo J.
      • Phillips C.
      • Salas T.
      • Formoso L.F.
      • Carracedo Á.
      • Lareu M.
      pop. STR—an online population frequency browser for established and new forensic STRs.
      ], while the sequence data were obtained from STRbase [
      • Ruitberg C.M.
      • Reeder D.J.
      • Butler J.M.
      STRBase: a short tandem repeat DNA database for the human identity testing community.
      ] and STRSeq [
      • Gettings K.B.
      • Borsuk L.A.
      • Ballard D.
      • Bodner M.
      • Budowle B.
      • Devesse L.
      • King J.
      • Parson W.
      • Phillips C.
      • Vallone P.M.
      STRSeq: a catalog of sequence diversity at human identification Short Tandem Repeat loci.
      ]. Moreover, the frequently occurring iso-alleles reported in the STRbase and STRSeq databases were included, as well as iso-alleles detected in the Verogen sequencing results. For most Y-STR loci, sequence information was obtained from STRBase 2.0 [
      • Ruitberg C.M.
      • Reeder D.J.
      • Butler J.M.
      STRBase: a short tandem repeat DNA database for the human identity testing community.
      ]. For Y-STR loci not included in the STRBase 2.0 database, the sequence of the repeat unit and the flanking regions were retrieved from the obtained Verogen reads within this study. Population data were retrieved from the YHRD database [
      • Willuweit S.
      • Roewer L.
      • International Forensic Y Chromosome User Group
      Y chromosome haplotype reference database (YHRD): update.
      ], as well as from data obtained by Kline et al. [
      • Butler J.M.
      • Decker A.E.
      • Vallone P.M.
      • Kline M.C.
      Allele frequencies for 27 Y-STR loci with US Caucasian, African American, and Hispanic samples.
      ]. X-STR sequence information and population data were obtained from Borsuk et al. [
      • Borsuk L.A.
      • Steffen C.R.
      • Kiesler K.M.
      • Vallone P.M.
      • Gettings K.B.
      Sequence-based US population data for 7 X-STR loci.
      ]. Alignment of the obtained reads against this reference library was performed by Burrow Wheelers Aligner (v0.7.17) with the -x ont2d option enabled [
      • Li H.
      Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
      ]. To lower the noise in the genotyping data, caused by amplification and sequencing errors, an alignment score (AS) filter was applied. As the AS reflects how well the obtained read resembles the reference it aligned to, higher AS scores are expected to be found for true alleles and stutters. Reads affected by sequencing and basecalling errors might lead to mis-alignment, characterized by a lower AS. The maximal AS is read-specific and equals the read span. Based on the CIGAR string, the read span for each aligned read was calculated. All reads with an AS lower than 90% of the read span were discarded. The resulting read counts per allele were used for genotyping, using the same genotyping rule as used in previous research [
      • Tytgat O.
      • Gansemans Y.
      • Weymaere J.
      • Rubben K.
      • Deforce D.
      • Van Nieuwerburgh F.
      Nanopore sequencing of a forensic STR multiplex reveals Loci suitable for single-contributor STR profiling.
      ]: for each locus, the allele with the highest read count was called as present, as well as the second most abundant allele if the corresponding read count equals at least 50% of the maximal read count. For single-copy Y-STR loci, only the allele with the highest read count was called as present.
      For SNP genotyping, all reads were aligned using Burrow Wheelers Aligner (v0.7.17) to a library of reference sequences, containing one reference per locus. These references are all 51 nucleotides long, with the SNP positioned centrally, and were retrieved from the Single Nucleotide Polymorphism database (dbSNP) [
      • Sherry S.T.
      • Ward M.-H.
      • Kholodov M.
      • Baker J.
      • Phan L.
      • Smigielski E.M.
      • Sirotkin K.
      dbSNP: the NCBI database of genetic variation.
      ]. Based on the obtained alignment data, the nucleotide variations at all positions were extracted. The read count corresponding to each possible allele at the SNP position was obtained by SAMtools (v1.11) [
      • Li H.
      • Handsaker B.
      • Wysoker A.
      • Fennell T.
      • Ruan J.
      • Homer N.
      • Marth G.
      • Abecasis G.
      • Durbin R.
      The sequence alignment/map format and SAMtools.
      ] and BCFtools (v1.6–45-gdb2e2b6) [
      • Danecek P.
      • Bonfield J.K.
      • Liddle J.
      • Marshall J.
      • Ohan V.
      • Pollard M.O.
      • Whitwham A.
      • Keane T.
      • McCarthy S.A.
      • Davies R.M.
      Twelve years of SAMtools and BCFtools.
      ], and was used for SNP genotyping. An arbitrary allelic imbalance cut-off should be set for heterozygous samples, to manage PCR and sequencing bias, as well as biological phenomena such as copy number variation and somatic mutations. All alleles representing more than 20% of the total read count were called as present.

      3. Results and discussion

      3.1 Verogen sequencing and genotyping

      Six samples were amplified with the ForenSeq DNA Signature Prep Kit and sequenced with a MiSeq FGx® device. An average of 283,778 ± 90,840 reads were sequenced per sample, of which 62.7% was assigned to an STR locus, and 37.3% to an SNP locus, on average. The genotypes obtained by the Universal Analysis Software are shown in Supplementary Table S1. On average, 93% of the Verogen reads corresponding to one of the targeted STR loci were assigned to a true allele, ranging from 86% to 99%. This is shown in Table 1 and Supplementary Table S2, for the autosomal and the non-autosomal STR loci, respectively. Non-true-allele assignment occurs due to amplification errors and artifacts, e.g. stutter, and sequencing error. SNP genotyping could be performed for all but three loci of sample E. For these three loci, there was insufficient sequencing depth. For all STR loci, a sufficient number of reads was obtained to allow genotyping.
      Table 1True-allele alignment after Verogen sequencing and nanopore sequencing combined with both Guppy and Bonito basecalling, for all autosomal STR loci.
      LocusTrue-allele alignment (Verogen) (%)GuppyBonito
      True-allele alignment before AS filter (%)True-allele alignment after AS filter (%)Difference (Percentage point)True-allele alignment before AS filter (%)True-allele alignment after AS filter (%)Difference (Percentage point)
      D2S4419780901089923
      PentaD985160973774
      D18S51885058841410
      PentaE996775882831
      D1S656917381882842
      D9S1122927380783852
      CSF1PO977177676771
      D5S818967177681821
      D3S1358927479582842
      D4S2408977580584873
      D8S1179907176579823
      D13S317966974577814
      FGA884549434340
      vWA937377481821
      D16S53993737747873-5
      D17S1301917377481832
      D22S1045926872477792
      D2S1338877276478791
      D6S1043926670474762
      D12S391865861367681
      D19S433957780380822
      TH01958790390911
      TPOX979093394962
      D20S482926770375771
      D10S1248918082282831
      D21S11937871-784852
      D7S8209679N/AN/A85N/AN/A
      Average937175477792
      Standard deviation31010313132

      3.2 Nanopore sequencing, basecalling, alignment, and alignment filtering

      Nanopore sequencing using an R10.3 flowcell, resulted in 132,268 ± 56,719 Guppy-basecalled reads per sample that aligned to an STR locus. After AS filtering, 54.7% of these reads were retained, on average. After Bonito basecalling, 71,906 ± 38,373 reads per sample aligned to an STR locus, of which 85.4% was retained after AS filtering. SNP variant calling could be performed using 60,049 ± 27,124 Guppy basecalled reads per sample, whereas 50,441 ± 27,985 Bonito basecalled reads per sample could be used for SNP genotyping. An overview of the sequencing depth per sample is shown in Table 2. The read counts obtained after alignment and AS filtering are shown in Supplementary Files 3 and 4, for Guppy and Bonito, respectively.
      Table 2Sequencing depth after nanopore sequencing.
      Sample 1Sample 2Sample 3Sample 4Sample 5Sample 6AverageStandard deviation
      Guppy basecalled STR reads105,211103,49960,699139,864225,107159,228132,26856,719
      % retained after AS filtering52.256.156.853.753.657.054.72.0
      Bonito basecalled STR reads59,64558,12714,05180,653129,16489,79671,90638,373
      % retained after AS filtering84.185.584.286.485.885.185.40.9
      Guppy basecalled SNP reads34,71347,61226,03885,81881,93084,18160,04927,124
      Bonito basecalled SNP reads28,61142,729927077,25571,30573,47450,44127,985
      Table 1 and Supplementary Table S2 show the percentage of the nanopore reads aligning to the true-allele(s) both before and after AS filtering. Discarding the reads with an insufficiently high alignment score resulted in an increase of the true-allele alignment for almost all loci, with a maximum of 10 percentage point. However, for two loci, the true-allele alignment decreased. On average, AS filtering resulted in an increase of true-allele alignment of about 4 percentage point for Guppy reads, whereas for the Bonito reads, this metric only increased with 2 percentage point. A two-tailed, paired T-test indicated no statistical significant difference in true-allele alignment between both basecallers (p = 0.49).
      A violin plot illustrating the distribution of alignment scores after Guppy basecalling, normalized for read span, is shown in Fig. 1 for all autosomal STR loci, and in Supplementary Fig. S1 for the non-autosomal STR loci. Fig. 1 shows that for locus D7S820, an insufficient number of reads were retained after AS filtering. Therefore, this step was not applied for the D7S820 locus. The cut-off for AS filtering of 90% was chosen arbitrarily, and should preferably be optimized for each locus separately. Nevertheless, these findings clearly show that a substantial part of the obtained noise can be filtered out bioinformatically.
      Fig. 1
      Fig. 1Violin plot showing the distribution of alignment scores after Guppy basecalling, normalized by read span, for all autosomal STR loci.

      3.3 SNP genotyping after nanopore sequencing

      A heterozygous sample is theoretically expected to result in a 50:50 ratio of reads for both alleles. However, the amplification, sequencing, basecalling, and variant calling process causes deviations from this theoretical ratio. Therefore, an allele was called as present when more than 20% of the reads were assigned to this allele. Fig. 2 shows an overview of the SNP genotyping results after both Guppy and Bonito basecalling. Variant calling of both datasets resulted in an accuracy of 99%, as 555 out of 561 SNP loci were genotyped correctly, taken all SNP loci of all samples together. Due to low read depth, allelic drop-out was observed for locus rs4606077 in three samples, and locus rs907100 in one sample. Two samples were genotyped incorrectly for locus rs6955448, due to an allelic imbalance which was also present in the Verogen data and thus most probably originated during PCR. Locus rs1031825, which is characterized by the presence of a homopolymer next to the SNP position, was genotyped incorrectly for sample F. Three SNP loci could not be determined by Verogen, due to insufficient read depth. As a consequence, the nanopore sequencing data of these loci could not be compared to the Verogen data.
      Fig. 2
      Fig. 2Overview of genotyping results after both Guppy and Bonito basecalling, for all SNP loci. Green indicates correct genotyping; red indicates incorrect genotyping; blue indicates incorrect genotyping due to allelic imbalance which is also present in the Verogen data. For the loci indicated in grey, no Verogen data were obtained due to insufficient read depth. (For interpretation of the references to colour in this figure, the reader is referred to the web version of this article)

      3.4 STR genotyping after nanopore sequencing

      The AS filtered alignment results obtained after nanopore sequencing were used for STR genotyping. The concordance between nanopore and Verogen results for all samples was assessed by comparing the obtained length-based genotypes, for both Guppy basecalling and Bonito basecalling. An overview is shown in Fig. 3. In general, for both basecallers, most autosomal STR loci were called correctly for all samples. Nevertheless, some remarkable differences between both datasets should be pointed out. Genotyping using Guppy basecalled reads was correct for all loci, except for two genotypes for locus PentaD. Interestingly, these genotypes were called correctly using Bonito basecalled reads, implicating that all loci were genotyped correctly by at least one basecaller. However, all obtained genotypes for loci D18S51 and FGA were incorrect after Bonito basecalling, as well as locus D16S539 for sample A. Moreover, locus CSF1PO for sample C was also genotyped incorrectly after Bonito basecalling. Although the average true-allele alignment for this locus is relatively high, the insufficient read depth of 24 leads to incorrect genotpying.
      Fig. 3
      Fig. 3Overview of genotyping results after both Guppy and Bonito basecalling, for all autosomal STR loci. Green indicates correct genotyping; yellow indicates correct genotyping, but hampered due to other highly represented alleles; red indicates incorrect genotyping; and blue indicates incorrect genotyping due to allelic imbalance originated during PCR, and thus is also present in the Verogen data. (For interpretation of the references to colour in this figure, the reader is referred to the web version of this article)
      Fig. 3 shows that some loci, indicated in blue, were characterized by allelic imbalance which was also observed in the Verogen data. This imbalance thus originates from PCR bias or biological phenomena, such as copy number variations and somatic mutations. Due to slippage of the polymerase during amplification, stutter + 1 and stutter − 1 PCR artifacts are present in the amplified sample. Unfortunately, the signal corresponding to these artifacts is increased due to nanopore sequencing and alignment errors. For some loci, this often results in other highly represented alleles. This makes differentiation between imbalance and sequencing or alignment noise challenging. Currently, a peak should be at least 50% of the highest peak to be called as a true allele, which is an arbitrary cut-off. Making this rule less stringent would allow correct genotyping despite the occurrence of such allelic imbalance, but would lead to drop-ins.
      For the non-autosomal STR-loci, a similar pattern is observed. All samples were genotyped correctly using Guppy, except for locus DXS10135 (three genotypes) and locus DXS10103 (one genotype), as shown in Supplementary Fig. S2. Bonito basecalled reads resulted in incorrect profiles for 7 loci, for most samples. This indicates that, although the average true-allele alignment is slightly higher after Bonito basecalling, genotyping fails consistently for a specific subset of STR loci using this basecaller.
      In general, both the alignment and genotyping data show that the presence of homopolymers, high repeat numbers, complex repeat patterns, and a high similarity between repeat region and the flanking regions proved to hamper the accuracy of STR genotyping after nanopore sequencing. These findings correspond well to our previous study [
      • Tytgat O.
      • Gansemans Y.
      • Weymaere J.
      • Rubben K.
      • Deforce D.
      • Van Nieuwerburgh F.
      Nanopore sequencing of a forensic STR multiplex reveals Loci suitable for single-contributor STR profiling.
      ], where we identified these success-limiting locus-dependent characteristics. Loci that proved to be troublesome for nanopore sequencing were vWA, SE33, FGA, D21S11, D18S51, and D3S1358. Although the genotyping accuracy improved for most of these loci by using improved basecallers and AS filtering, loci FGA and D18S51 remain challenging for this purpose. Nevertheless, the data obtained in this research show a substantial improvement compared to previous studies.

      3.5 Genotyping of STR iso-alleles

      Sequencing-based genotyping of STR alleles has the added advantage of yielding information on iso-alleles, thereby increasing the discriminatory power of the assay. Frequently occurring iso-alleles were included in the reference allele library for alignment. Table 3 shows that nanopore sequencing is capable of accurately genotyping iso-alleles, as on average 96% of the reads aligning to one of the alleles with the correct repeat number, aligned to the truly present iso-allele, whereas only 4% of the reads aligned to non-true iso-alleles with the same repeat number. Moreover, as shown in Fig. 4, four genotypes could correctly be called as heterozygous, although both alleles share the same repeat number. This added discriminatory power is of great value for genotyping of low-input samples, which are prone to allelic drop-out, or mixture samples.
      Table 3Iso-allele genotyping accuracy: number of Guppy basecalled reads that aligned to a true iso-allele and the number of reads that aligned to the other iso-alleles with the same repeat number.
      LocusTrue iso-allele alignment (number of reads)Non-true iso-allele alignment (number of reads)True iso-allele alignment (%)
      DYS4484410100
      D3S1358605513098
      D8S117919983898
      vWA309798
      D9S112231319897
      D5S8184181896
      D21S11262017993
      DYS389II160411493
      Total16,57658496
      Fig. 4
      Fig. 4Iso-allele genotyping reveals four additional heterozygous samples. The Y-axis shows the relative frequency of reads per allele, the X-axis shows the alleles. Sample number and locus are indicated above the bar chart.

      3.6 Data analysis strategy

      Multiple tools have been described to perform STR genotyping using NGS data, e.g. STRinNGS [
      • Jønck C.G.
      • Qian X.
      • Simayijiang H.
      • Børsting C.
      STRinNGS v2. 0: improved tool for analysis and reporting of STR sequencing data.
      ], STRait Razor [
      • Warshauer D.H.
      • Lin D.
      • Hari K.
      • Jain R.
      • Davis C.
      • LaRue B.
      • King J.L.
      • Budowle B.
      STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data.
      ], MyFLq [
      • Van Neste C.
      • Vandewoestyne M.
      • Van Criekinge W.
      • Deforce D.
      • Van Nieuwerburgh F.
      My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
      ], RepeatHMM [
      • Liu Q.
      • Zhang P.
      • Wang D.
      • Gu W.
      • Wang K.
      Interrogating the “unsequenceable” genomic trinucleotide repeat disorders by long-read sequencing.
      ], and FDSTools [
      • Hoogenboom J.
      • van der Gaag K.J.
      • de Leeuw R.H.
      • Sijen T.
      • de Knijff P.
      • Laros J.F.
      FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.
      ]. Ren and colleagues used RepeatHMM to perform STR genotyping on a dataset obtained by nanopore sequencing of a Verogen ForenSeq DNA Signature Prep Kit library with an R9.4 flowcell and Guppy basecalling (v4.2.2) [
      • Ren Z.-L.
      • Zhang J.-R.
      • Zhang X.-M.
      • Liu X.
      • Lin Y.-F.
      • Bai H.
      • Wang M.-C.
      • Cheng F.
      • Liu J.-D.
      • Li P.
      Forensic nanopore sequencing of STRs and SNPs using Verogen’s ForenSeq DNA signature prep kit and MinION.
      ]. This resulted in a low accuracy overall. Moreover, Ren and colleagues analyzed the same dataset with a custom tool that aims to extract individual repeats from each read. However, this tool is not able to genotype partial repeats, and genotyping failed for about 50% of the autosomal STR loci. The results obtained in our research show that alignment of the reads to a reference allele database yields more accurate results, as all autosomal STR loci were genotyped correctly by at least one of both used basecallers. Moreover, we re-analyzed the reads obtained by Ren et al. [dataset] [
      • Ren Zi-Lin
      • Zhang Jia-Rong
      • Zhang Xiao-Meng
      • Lin Yan-Feng
      • Bai Hua
      • Wang Meng-Chun
      • Cheng Feng
      • Liu Jin-Ding
      • Li Peng
      • Kong Lei
      • Chen Xiao
      • Wang Sheng-Qi
      • Ni Ming
      • Yan Jiang-Wei
      Forensic nanopore sequencing of STRs and SNPs using Verogen’s ForenSeq DNA Signature Prep Kit and MinION.
      ] with our workflow for the 2800 M positive control sample, which was sequenced in triplicate. The read counts obtained after alignment can be found in Supplementary File 5. A genotyping accuracy of 100% was obtained for all three triplicates. These findings suggest that the use of the R10.3 flowcell may not increase the genotyping accuracy. Moreover, this clearly indicates the importance of selecting a suited analysis method. The major drawback to our alignment strategy is the fact that the presence of all possible (iso-) alleles in the library is crucial. Absence of the true allele in the reference library might lead to incorrect genotyping for this specific locus. The library should thus constantly be improved based on population data gathered by sequencing. As sequencing is becoming more important in the field of forensic genotyping, expanding the existing databases (e.g. pop.STR [
      • Amigo J.
      • Phillips C.
      • Salas T.
      • Formoso L.F.
      • Carracedo Á.
      • Lareu M.
      pop. STR—an online population frequency browser for established and new forensic STRs.
      ] and STRSeq [
      • Gettings K.B.
      • Borsuk L.A.
      • Ballard D.
      • Bodner M.
      • Budowle B.
      • Devesse L.
      • King J.
      • Parson W.
      • Phillips C.
      • Vallone P.M.
      STRSeq: a catalog of sequence diversity at human identification Short Tandem Repeat loci.
      ]) with data gathered by the community will be crucial. Nevertheless, nanopore sequencing has become a method capable of accurately genotyping forensic samples, and further improvements might enable implementation of this genotyping method in forensic routine.

      4. Conclusion

      Nanopore sequencing of a forensic STR and SNP multiplex was compared to Illumina sequencing for six single-contributor samples. Basecalling was performed with two state-of-the-art basecallers, Guppy and Bonito. Both datasets resulted in a 99% SNP genotyping accuracy. All autosomal STR genotypes were accurately called with at least one of both basecallers. A slightly higher fraction of the reads aligned to a true allele after Bonito basecalling, yet genotyping accuracy was lower for this basecaller, as a specific subset of loci failed consistently. Our analysis method, based on alignment of STR reads to a reference library with subsequent filtering based on the alignment score, was capable of accurately genotyping iso-alleles. The STR profiling after nanopore sequencing presented in this research is much more accurate compared to previous studies. These findings are an important step towards on-site sequencing of forensic samples using an affordable, handheld MinION device.

      CRediT authorship contribution statement

      Olivier Tytgat: Conceptualization, Investigation, Writing – original draft. Sonja Škevin: Formal analysis, Writing – review & editing. Dieter Deforce: Writing – review & editing, Supervision. Filip Van Nieuwerburgh: Conceptualization, Writing – review & editing, Supervision.

      Funding

      This work was supported by a PhD grant from the Special Research Fund (BOF) from the Ghent University [Grant BOF18/DOC/200 to O.T. ].

      Competing interest

      None declared.

      Appendix A. Supplementary material

      .
      .
      .
      .
      .

      References

        • Bruijns B.
        • Tiggelaar R.
        • Gardeniers H.
        Massively parallel sequencing techniques for forensics: a review.
        Electrophoresis. 2018; 39: 2642-2654
        • Butler J.M.
        Forensic DNA Typing: Biology, Technology, and Genetics of STR markers.
        second ed. Elsevier,, Berlington, USA2005
        • Butler J.
        • McCord B.
        • Jung J.
        • Allen R.
        Rapid analysis of the short tandem repeat HUMTH01 by capillary electrophoresis.
        BioTechniques. 1994; 17 (1066, 1068 passim): 1062-1064
        • Børsting C.
        • Mogensen H.S.
        • Morling N.
        Forensic genetic SNP typing of low-template DNA and highly degraded DNA from crime case samples.
        Forensic Sci. Int. Genet. 2013; 7: 345-352
        • Churchill J.D.
        • Schmedes S.E.
        • King J.L.
        • Budowle B.
        Evaluation of the Illumina® beta version ForenSeqTM DNA signature prep kit for use in genetic profiling.
        Forensic Sci. Int. Genet. 2016; 20: 20-29
        • Plesivkova D.
        • Richards R.
        • Harbison S.
        A review of the potential of the MinIONTM single‐molecule sequencing system for forensic applications.
        Wiley Interdiscip. Rev. Forensic Sci. 2019; 1e1323
        • Wick R.R.
        • Judd L.M.
        • Holt K.E.
        Performance of neural network basecalling tools for Oxford Nanopore sequencing.
        Genome Biol. 2019; 20: 1-10
        • Cornelis S.
        • Gansemans Y.
        • Deleye L.
        • Deforce D.
        • Van Nieuwerburgh F.
        Forensic SNP genotyping using nanopore MinION sequencing.
        Sci. Rep. 2017; 7: 41759
        • Cornelis S.
        • Gansemans Y.
        • Vander Plaetsen A.-S.
        • Weymaere J.
        • Willems S.
        • Deforce D.
        • Van F.
        Nieuwerburgh, Forensic tri-allelic SNP genotyping using nanopore sequencing.
        Forensic Sci. Int. Genet. 2019; 38: 204-210
        • Asogawa M.
        • Ohno A.
        • Nakagawa S.
        • Ochiai E.
        • Katahira Y.
        • Sudo M.
        • Osawa M.
        • Sugisawa M.
        • Imanishi T.
        Human short tandem repeat identification using a nanopore-based DNA sequencer: a pilot study.
        J. Hum. Genet. 2019; : 1-4
        • Cornelis S.
        • Willems S.
        • Van Neste C.
        • Tytgat O.
        • Weymaere J.
        • Vander Plaetsen A.-S.
        • Deforce D.
        • Van Nieuwerburgh F.
        Forensic STR profiling using Oxford Nanopore Technologies’ MinION sequencer.
        bioRxiv. 2018; 433151
        • Tytgat O.
        • Gansemans Y.
        • Weymaere J.
        • Rubben K.
        • Deforce D.
        • Van Nieuwerburgh F.
        Nanopore sequencing of a forensic STR multiplex reveals Loci suitable for single-contributor STR profiling.
        Genes. 2020; 11: 381
        • Ren Z.-L.
        • Zhang J.-R.
        • Zhang X.-M.
        • Liu X.
        • Lin Y.-F.
        • Bai H.
        • Wang M.-C.
        • Cheng F.
        • Liu J.-D.
        • Li P.
        Forensic nanopore sequencing of STRs and SNPs using Verogen’s ForenSeq DNA signature prep kit and MinION.
        Int. J. Leg. Med. 2021; : 1-9
        • Van der Verren S.E.
        • Van Gerven N.
        • Jonckheere W.
        • Hambley R.
        • Singh P.
        • Kilgour J.
        • Jordan M.
        • Wallace E.J.
        • Jayasinghe L.
        • Remaut H.
        A dual-constriction biological nanopore resolves homonucleotide sequences with high fidelity.
        Nat. Biotechnol. 2020; 38: 1415-1420
        • Amigo J.
        • Phillips C.
        • Salas T.
        • Formoso L.F.
        • Carracedo Á.
        • Lareu M.
        pop. STR—an online population frequency browser for established and new forensic STRs.
        Forensic Sci. Int. Genet. Suppl. Ser. 2009; 2: 361-362
        • Ruitberg C.M.
        • Reeder D.J.
        • Butler J.M.
        STRBase: a short tandem repeat DNA database for the human identity testing community.
        Nucleic Acids Res. 2001; 29: 320-322
        • Gettings K.B.
        • Borsuk L.A.
        • Ballard D.
        • Bodner M.
        • Budowle B.
        • Devesse L.
        • King J.
        • Parson W.
        • Phillips C.
        • Vallone P.M.
        STRSeq: a catalog of sequence diversity at human identification Short Tandem Repeat loci.
        Forensic Sci. Int. Genet. 2017; 31: 111-117
        • Willuweit S.
        • Roewer L.
        • International Forensic Y Chromosome User Group
        Y chromosome haplotype reference database (YHRD): update.
        Forensic Sci. Int. Genet. 2007; 1: 83-87
        • Butler J.M.
        • Decker A.E.
        • Vallone P.M.
        • Kline M.C.
        Allele frequencies for 27 Y-STR loci with US Caucasian, African American, and Hispanic samples.
        Forensic Sci. Int. 2006; 156: 250-260
        • Borsuk L.A.
        • Steffen C.R.
        • Kiesler K.M.
        • Vallone P.M.
        • Gettings K.B.
        Sequence-based US population data for 7 X-STR loci.
        Forensic Sci. Int. Rep. 2020; 2100160
        • Li H.
        Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
        arXiv Prepr. arXiv. 2013; 1303: 3997
        • Sherry S.T.
        • Ward M.-H.
        • Kholodov M.
        • Baker J.
        • Phan L.
        • Smigielski E.M.
        • Sirotkin K.
        dbSNP: the NCBI database of genetic variation.
        Nucleic Acids Res. 2001; 29: 308-311
        • Li H.
        • Handsaker B.
        • Wysoker A.
        • Fennell T.
        • Ruan J.
        • Homer N.
        • Marth G.
        • Abecasis G.
        • Durbin R.
        The sequence alignment/map format and SAMtools.
        Bioinformatics. 2009; 25: 2078-2079
        • Danecek P.
        • Bonfield J.K.
        • Liddle J.
        • Marshall J.
        • Ohan V.
        • Pollard M.O.
        • Whitwham A.
        • Keane T.
        • McCarthy S.A.
        • Davies R.M.
        Twelve years of SAMtools and BCFtools.
        Gigascience. 2021; 10: giab008
        • Jønck C.G.
        • Qian X.
        • Simayijiang H.
        • Børsting C.
        STRinNGS v2. 0: improved tool for analysis and reporting of STR sequencing data.
        Forensic Sci. Int. Genet. 2020; 48102331
        • Warshauer D.H.
        • Lin D.
        • Hari K.
        • Jain R.
        • Davis C.
        • LaRue B.
        • King J.L.
        • Budowle B.
        STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data.
        Forensic Sci. Int. Genet. 2013; 7: 409-417
        • Van Neste C.
        • Vandewoestyne M.
        • Van Criekinge W.
        • Deforce D.
        • Van Nieuwerburgh F.
        My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
        Forensic Sci. Int. Genet. 2014; 9: 1-8
        • Liu Q.
        • Zhang P.
        • Wang D.
        • Gu W.
        • Wang K.
        Interrogating the “unsequenceable” genomic trinucleotide repeat disorders by long-read sequencing.
        Genome Med. 2017; 9: 1-16
        • Hoogenboom J.
        • van der Gaag K.J.
        • de Leeuw R.H.
        • Sijen T.
        • de Knijff P.
        • Laros J.F.
        FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise.
        Forensic Sci. Int. Genet. 2017; 27: 27-40
        • Ren Zi-Lin
        • Zhang Jia-Rong
        • Zhang Xiao-Meng
        • Lin Yan-Feng
        • Bai Hua
        • Wang Meng-Chun
        • Cheng Feng
        • Liu Jin-Ding
        • Li Peng
        • Kong Lei
        • Chen Xiao
        • Wang Sheng-Qi
        • Ni Ming
        • Yan Jiang-Wei
        Forensic nanopore sequencing of STRs and SNPs using Verogen’s ForenSeq DNA Signature Prep Kit and MinION.
        Int J Legal Med . 2021; 135