Advertisement
Research Article| Volume 15, P2-7, March 2015

Forensic massively parallel sequencing data analysis tool: Implementation of MyFLq as a standalone web- and Illumina BaseSpace®-application

Open AccessPublished:October 14, 2014DOI:https://doi.org/10.1016/j.fsigen.2014.10.006

      Abstract

      Routine use of massively parallel sequencing (MPS) for forensic genomics is on the horizon. The last few years, several algorithms and workflows have been developed to analyze forensic MPS data. However, none have yet been tailored to the needs of the forensic analyst who does not possess an extensive bioinformatics background.
      We developed our previously published forensic MPS data analysis framework MyFLq (My-Forensic-Loci-queries) into an open-source, user-friendly, web-based application. It can be installed as a standalone web application, or run directly from the Illumina BaseSpace environment. In the former, laboratories can keep their data on-site, while in the latter, data from forensic samples that are sequenced on an Illumina sequencer can be uploaded to Basespace during acquisition, and can subsequently be analyzed using the published MyFLq BaseSpace application. Additional features were implemented such as an interactive graphical report of the results, an interactive threshold selection bar, and an allele length-based analysis in addition to the sequenced-based analysis.
      Practical use of the application is demonstrated through the analysis of four 16-plex short tandem repeat (STR) samples, showing the complementarity between the sequence- and length-based analysis of the same MPS data.

      Keywords

      1. Introduction

      Obtaining forensic DNA profiles of polymorphic short tandem repeat (STR) loci using PCR followed by capillary electrophoresis (CE) is still the gold standard. However, routine use of massively parallel sequencing (MPS) for forensic genomics is on the horizon. MPS technologies do not rely on size separation and thus relieve the limitation on locus multiplexing that is present in CE [
      • Fordyce S.L.
      • Avila-Arcos M.C.
      • Rockenbauer E.
      • Borsting C.
      • Frank-Hansen R.
      • Petersen F.T.
      • Willerslev E.
      • Hansen A.J.
      • Morling N.
      • Gilbert M.T.P.
      High-throughput sequencing of core STR loci for forensic genetic investigations using the Roche Genome Sequencer FLX platform.
      ,
      • Kidd K.K.
      • Kidd J.R.
      • Speed W.C.
      • Fang R.
      • Furtado M.R.
      • Hyland F.C.L.
      • Pakstis A.J.
      Expanding data and resources for forensic use of SNPs in individual identification.
      ]. MPS therefore creates enhanced possibilities within forensic genomics for analyzing degraded samples, mixed samples, and in dealing with kinship or population substructure [
      • McIver L.J.
      • Fondon III, J.W.
      • Skinner M.A.
      • Garner H.R.
      Evaluation of microsatellite variation in the 1000 Genomes Project pilot studies is indicative of the quality and utility of the raw data and alignments.
      ,
      • Weber-Lehmann J.
      • Schilling E.
      • Gradl G.
      • Richter D.C.
      • Wiehler J.
      • Rolf B.
      Finding the needle in the haystack: differentiating “identical” twins in paternity testing and forensics by ultra-deep next generation sequencing.
      ].
      Forensic bioinformaticians have been working on several algorithms to process MPS forensic STR data: lobSTR [
      • Gymrek M.
      • Golan D.
      • Rosset S.
      • Erlich Y.
      lobSTR: a short tandem repeat profiler for personal genomes.
      ], RepeatSeq [
      • Highnam G.
      • Franck C.
      • Martin A.
      • Stephens C.
      • Puthige A.
      • Mittelman D.
      Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles.
      ], STRait Razor [
      • Warshauer D.H.
      • Lin D.
      • Hari K.
      • Jain R.
      • Davis C.
      • LaRue B.
      • King J.L.
      • Budowle B.
      STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data.
      ], TSSV [
      • Anvar S.Y.
      • van der Gaag K.J.
      • van der Heijden J.W.F.
      • Veltrop M.H.A.M.
      • Vossen R.H.A.M.
      • de Leeuw R.H.
      • Breukel C.
      • Buermans H.P.J.
      • Verbeek J.S.
      • de Knijff P.
      • den Dunnen J.T.
      • Laros J.F.J.
      TSSV: a tool for characterization of complex allelic variants in pure and mixed genomes.
      ] and the MyFLq-framework [
      • Van Neste C.
      • Vandewoestyne M.
      • Van Criekinge W.
      • Deforce D.
      • Van Nieuwerburgh F.
      My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
      ]. LobSTR and RepeatSeq are both genome wide STR aligners, and therefore outside of the scope of forensic analysis in its current legal and technological setting, in which targeted sequencing of a limited number of validated loci are investigated.
      STRait Razor, TSSV and MyFLq are instead locus-centric, and operate on forensical loci. They require configuration information for each locus in the set, generally consisting of the repeat length of the locus, primer and/or flank sequences, and known alleles for the locus. All three programs have a similar approach to processing the STR data, which is represented in a flowchart in Fig. 1. To date, algorithms in these programs process data to the point of presenting allele candidates (step preceding the dashed red arrow in Fig. 1). It is at this point in the pipeline that data interpretation begins for the forensic analyst.
      All current applications, are command-line based and are thus not well suited to be used by forensic analysts that do not have extensive bioinformatics experience. In this report, we present the MyFLq application that we developed into an open-source, web-based application with a user-friendly graphical user interface. Additional features were implemented such as an interactive graphical report of the results, an interactive threshold selection bar, and an allele length-based analysis in addition to the sequenced-based analysis.

      2. Materials and methods

      MyFLq has been implemented both as a Django web application [
      • Forcier J.
      • Bissex P.
      • Chun W.
      Python Web Development with Django.
      ] and an Illumina BaseSpace application. Both implementations run from the same source code and users have access to the latest stable version, no matter the execution preference of the application. The BaseSpace MyFLq application requires no installation from the user. For the Django application, detailed documentation can be found on the MyFLq GitHub repository (https://github.com/beukueb/myflq). A pdf manual can be downloaded from https://gitprint.com/beukueb/myflq, covering both implementations. The development version and previous builds are only available for the Django application.

      2.1 Samples

      The same data were used as in the MyFLq framework paper [
      • Van Neste C.
      • Vandewoestyne M.
      • Van Criekinge W.
      • Deforce D.
      • Van Nieuwerburgh F.
      My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
      ]. The results presented in this report were obtained with sample 9947A_S1, which is a single contributor control DNA sample (Promega) [
      • Levin B.
      • Cheng H.
      • Reeder D.
      A human mitochondrial DNA standard reference material for quality control in forensic identification, medical diagnosis, and mutation detection.
      ]. This sample was amplified using a 16-plex PCR, based on the PowerPlex® 16 primers (Promega) [
      • Masibay A.
      • Mozer T.
      • Sprecher C.
      Promega Corporation reveals primer sequences in its testing kits.
      ]. The reference profile for 9947A with the 16-plex is shown in Supplementary Table A.1. The MyFLq framework paper [
      • Van Neste C.
      • Vandewoestyne M.
      • Van Criekinge W.
      • Deforce D.
      • Van Nieuwerburgh F.
      My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
      ] also analyzed a second single contributor sample and two multiple person mixtures. Results for these samples are available on BaseSpace, together with the FASTQ data for anyone wishing to experiment with MyFLq.

      2.2 Launching MyFLq

      To produce the results for this report, MyFLq was launched from http://basespace.illumina.com/apps. A threshold of 0.5% was set to filter read groups with a lower abundance for further analysis. The loci set and the allele database were set to the MyFLq framework paper options, as shown in Fig. 2. The database contained all the alleles from the framework paper's four DNA samples, including sample 9947A [
      • Van Neste C.
      • Vandewoestyne M.
      • Van Criekinge W.
      • Deforce D.
      • Van Nieuwerburgh F.
      My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
      ]. The database consists of all sequences of the Powerplex® 16 alleles present in these four samples. For the other options the default values were used. Detailed information on these settings can be found in Supplementary Table A.2 or the online documentation. A BaseSpace project “FSIG” was made to which the results could be saved. Finally, the analysis was launched by clicking “Continue”.
      Figure thumbnail gr2
      Fig. 2Primary analysis settings. The setting “Select loci set” controls the configuration of the set of loci that will be analyzed in the dataset. Normally these are the same loci as in the applied STR multiplex PCR. With the “Select allele database” setting, the database with the reference alleles sequences can be configured. Both configurations can be provided as csv-files if they are not available in the dropdown menus. Settings on the right are general and can be used to tweak the results. They are described in Supplementary Table A.2.

      3. Results

      3.1 Initial sequence-based result display

      Fig. 3a shows the analysis result page, that can be found under the project folder where the analysis was saved. The initial display shows an interactive visual representation that should be interpreted as a sequence-based analysis rather than a length-based analysis. The different bars represent grouped allele sequences and are sorted according to length. Spacing is however not proportional and allele candidates of the same length are not stacked on top of each other, but rather side-by-side. A green bar is given to sequences that are present in the database, a red bar when not. The vertically adjustable gray transparent zone determines the threshold for which allele candidate bars with a lower abundance will not be withheld in the final profile. By default, it is set to 10%. Note that sequences with an abundance threshold lower than 0.5% (configurable) are already filtered during the analysis.
      Figure thumbnail gr3
      Fig. 3Initial display and proportionally sorted analysis result.

      3.2 Detailed allele candidate information

      When hovering over a bar, a detailed block of information is displayed for that allele candidate. An example is shown in Fig. 4. This information can be used to examine if the underlying sequence of the bar is either a true allele or erroneous sequence (stutter, sequencing- or PCR error). The title bar of the information block shows the locus name, and the database name of the allele candidate. When the allele is not present in the database, ‘NA’ together with the number of repeats relative to known alleles is shown between brackets. Locus statistics are summarized in the left column:
      • ‘Total reads’ stands for all reads that are classified under the locus.
      • ‘Filtered reads’ stands for the reads that are retained after the 0.5% abundance threshold filter. ‘Filtered reads’ are those that are actually shown in the allele bar.
      • ‘Total unique’ and ‘Filtered unique’ stand for the number of unique sequences, and are a measure of the number of possible allele candidates. In the example in Fig. 4 there are 7 filtered sequences with more than 0.5% of the reads.
      Figure thumbnail gr4
      Fig. 4Information blocks for true D8S1179 alleles. Their one-base sequence difference is indicated by dotted lines.
      Statistics for the current allele candidate are in the right column:
      • ‘Index’ is a unique reference index label assigned to each filtered unique sequence, starting at ‘1’ with the shortest sequence for this locus in the analysis. When two sequences have the same length, the smaller index number is assigned randomly.
      • ‘Abundance’ is the percentage of the filtered reads within the locus that belong to this candidate.
      • ‘Strand distribution’ is the balance between forward and reverse sequences supporting the candidate. Sequences in the report itself are always in forward direction, as determined by the loci set configuration file. In the example in Fig. 4, the strand distribution of the first sequence is 49.17% and of the second 48.75%. They are both close to 50% as is to be expected as a result from normal PCR and sequencing circumstances.
      • ‘Clean flanks’ is the percentage of perfect flanks compared to the allele database.
      The bottom part of the information block shows the region of interest of the allele candidate sequence together with related sequences from the same locus. Related sequences with up to two differences are shown; a difference being either one repeat number difference or one base pair difference. One difference is indicated by a relation degree “Ist” and two differences by “IInd”.
      Fig. 4 shows the two information blocks of the two true alleles from locus D8S1179 in an interesting example that shows the advantage of MPS over CE. For 9947A, CE results show only one peak at locus D8S1179, resulting in a profile with a homozygous allele 13 for D8S1179. Our analysis clearly shows two alleles that have the same length (corresponding to allele 13), but have a different intra-STR sequence when compared to each other. The information blocks support this heterozygous call; only a small portion of the reads are filtered for this locus, the number of unique reads are low and the abundance of the two allele candidates is approximately 50%. The percentage of clean flanks [
      • Van Neste C.
      • Vandewoestyne M.
      • Van Criekinge W.
      • Deforce D.
      • Van Nieuwerburgh F.
      My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
      ] in the candidate alleles sequences is also very high. All these parameters indicate that the sequencing and PCR error rate is low. In the part of the information blocks that shows the related sequences, the G ↔ A difference between the two alleles is shown. The two alleles are related to each other by a “Ist” order degree. Both alleles also have relations to erroneous sequences, in fact all other sequences discovered for this locus have a relation to one or both of the true alleles.
      Fig. 5 shows the information block for a candidate allele of locus Penta E. It is the only erroneous sequence that was not automatically filtered by the 10% default threshold. The information supports that this candidate allele should be disregarded. The putative allele length is one STR repeat unit smaller than the high abundant (47.40%) sequence with index 6, indicating that it might be stutter. Apart from this stutter there are no other sequence differences (Ist relation degree). Furthermore, the clean flank percentage is rather low (59.5%), indicating possible low quality sequences. An unexpected strand distribution of 100% implies that there are no complementary reads supporting the presence of this allele candidate. Removing this allele candidate is accomplished by unchecking the “in profile” check-box.
      Figure thumbnail gr5
      Fig. 5Unknown Penta E allele. The true allele with index 6 is indicated by the dotted line.

      3.3 Allele candidates proportionally sorted according to length

      After selecting the “Length-based analysis” check-box, all allele candidates are displayed proportionally, according to their actual length within the locus, as shown in Fig. 3. For each locus, the x-axis is adjusted to show the locus length starting from the shortest allele and ending at the longest allele. The threshold bar is no longer displayed because allele candidates with the same length are now stacked on top of each other, which creates one bar that shows the total abundance of all alleles with the same length within each locus. This representation resembles a CE profile. The example of the allele candidate in Fig. 5 now visually looks like a CE stutter peak based on the relative length and abundance difference as compared to the true allele.

      3.4 Final profile

      After reviewing the profile by setting the threshold to an appropriate value, and removing allele candidates of poor quality, pressing the “Make profile” button yields the final profile. This profile can then be used to query databases or compare to the profile of a sample of interest. Fig. 6 shows the final profile for sample 9947A_S1. Using the threshold of 10%, it has one Penta E allele 13 that is undetected relative to the known genotype (Table A.1). This allele is present in the data at an abundance of 8.85% and its corresponding green bar can be seen clearly in Fig. 3. The sub-optimal results of the pentanucleotide loci, Penta D and Penta E, were previously discussed in detail [
      • Van Neste C.
      • Vandewoestyne M.
      • Van Criekinge W.
      • Deforce D.
      • Van Nieuwerburgh F.
      My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
      ].
      Figure thumbnail gr6
      Fig. 6Final profile for sample 9947A_S1 with 10% threshold.

      4. Discussion

      We show how an MPS data-set can be analyzed using an easy-to-use graphical user interface, requiring a limited number of parameters and almost no bioinformatics expertise. The interactive visual representation of the results shows additional information when hovering over the alleles, allowing for in-depth analysis of the underlying sequences and the related statistics. For clarity of explanation we chose to display and discuss the analysis of a single contributor sample, but the MyFLq framework equally works on mixtures because no assumptions on mixture composition are made to perform the analysis. The main added value of MPS over CE indeed lay in the analysis of mixed and degraded samples [
      • Van Neste C.
      • Vandewoestyne M.
      • Van Criekinge W.
      • Deforce D.
      • Van Nieuwerburgh F.
      My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
      ]. With MPS, sequences can be analyzed more in depth to determine whether they are genuinely from one of the original contributors of a sample, or instead more likely to be the product of a PCR or sequencing error. Additionally, due to the ability to multiplex more loci than CE affords, broader genetic interrogation can be achieved in a single reaction, thus conserving precious samples.
      The reported results comprise only 16 loci, but MyFLq can run with any number of loci. When running MyFLq with a custom loci set, the primers of these loci can be imported. The allele database is not strictly necessary to run the program. In exploratory studies, for example if building a database of known alleles, MyFLq can be run with an empty allele database. The GitHub repository contains example files for users that need either a custom locus set or custom allele database.
      The used allele database was very small as it only compromised the alleles of the five contributors. Sequences that are currently not in the database are marked as red bars. These bars are very useful to visually monitor the noise level. In the future, with a larger database, it could be that erroneous sequences are nonetheless present in the database, as they could be true alleles for individuals that are not present in the sample. The solution to that problem could be to mark rare alleles (e.g. alleles with a population prevalence <1%) with a different color. The combination of unknown alleles and rare alleles would then indicate the level of noise. A further limitation of the current database is its nomenclature. Currently same-sized alleles get an arbitrary name within the database, which would make it difficult to perform searches in other databases without the original sequence. When an international nomenclature for MPS STR alleles has been established, it will be incorporated in MyFLq.
      When all allele candidates have been reviewed, the “Make profile” action generates a report with only the selected alleles. This is the profile that a forensic analyst can use to either store in a database, to query against a database, or for direct comparison to a known sample of interest. Future versions of the software will include possibilities to interact directly with sample databases. New feature requests can be made through the GitHub website.

      5. Conclusion

      MyFLq is the first open-source, web-based forensic MPS DNA analysis software with an easy-to-use graphical user interface. It can run natively on Illumina BaseSpace, or independently on a forensic laboratory's server. The possibility to run the program directly from the Illumina BaseSpace environment means no extensive bioinformatics skills are required.

      Competing interests

      C.V.N. participated in an internship program at Illumina, Inc. to provide feedback on building a native BaseSpace application to the Illumina developers.

      Acknowledgments

      Funding was provided by the Ghent University Multidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to networks’
      The authors would also like to thank Illumina, Inc. for providing the data and making MyFLq easily accessible on their BaseSpace platform.

      Appendix A. Supplementary data

      The following are the supplementary data to this article:

      References

        • Fordyce S.L.
        • Avila-Arcos M.C.
        • Rockenbauer E.
        • Borsting C.
        • Frank-Hansen R.
        • Petersen F.T.
        • Willerslev E.
        • Hansen A.J.
        • Morling N.
        • Gilbert M.T.P.
        High-throughput sequencing of core STR loci for forensic genetic investigations using the Roche Genome Sequencer FLX platform.
        Biotechniques. 2011; 51: 127https://doi.org/10.2144/000113721
        • Kidd K.K.
        • Kidd J.R.
        • Speed W.C.
        • Fang R.
        • Furtado M.R.
        • Hyland F.C.L.
        • Pakstis A.J.
        Expanding data and resources for forensic use of SNPs in individual identification.
        Forensic Sci. Int. Genet. 2012; 6: 646-652https://doi.org/10.1016/j.fsigen.2012.02.012
        • McIver L.J.
        • Fondon III, J.W.
        • Skinner M.A.
        • Garner H.R.
        Evaluation of microsatellite variation in the 1000 Genomes Project pilot studies is indicative of the quality and utility of the raw data and alignments.
        Genomics. 2011; 97: 193-199https://doi.org/10.1016/j.ygeno.2011.01.001
        • Weber-Lehmann J.
        • Schilling E.
        • Gradl G.
        • Richter D.C.
        • Wiehler J.
        • Rolf B.
        Finding the needle in the haystack: differentiating “identical” twins in paternity testing and forensics by ultra-deep next generation sequencing.
        Forensic Sci. Int. Genet. 2014; 9: 42-46https://doi.org/10.1016/j.fsigen.2013.10.015
        • Gymrek M.
        • Golan D.
        • Rosset S.
        • Erlich Y.
        lobSTR: a short tandem repeat profiler for personal genomes.
        Genome Res. 2012; 22: 1154-1162https://doi.org/10.1101/gr.135780.111
        • Highnam G.
        • Franck C.
        • Martin A.
        • Stephens C.
        • Puthige A.
        • Mittelman D.
        Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles.
        Nucleic Acids Res. 2013; 41https://doi.org/10.1093/nar/gks981
        • Warshauer D.H.
        • Lin D.
        • Hari K.
        • Jain R.
        • Davis C.
        • LaRue B.
        • King J.L.
        • Budowle B.
        STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data.
        Forensic Sci. Int. Genet. 2013; 7: 409-417
        • Anvar S.Y.
        • van der Gaag K.J.
        • van der Heijden J.W.F.
        • Veltrop M.H.A.M.
        • Vossen R.H.A.M.
        • de Leeuw R.H.
        • Breukel C.
        • Buermans H.P.J.
        • Verbeek J.S.
        • de Knijff P.
        • den Dunnen J.T.
        • Laros J.F.J.
        TSSV: a tool for characterization of complex allelic variants in pure and mixed genomes.
        Bioinformatics. 2014; 30: 1651-1659https://doi.org/10.1093/bioinformatics/btu068
        • Van Neste C.
        • Vandewoestyne M.
        • Van Criekinge W.
        • Deforce D.
        • Van Nieuwerburgh F.
        My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.
        Forensic Sci. Int. Genet. 2014; 9: 1-8https://doi.org/10.1016/j.fsigen.2013.10.012
        • Forcier J.
        • Bissex P.
        • Chun W.
        Python Web Development with Django.
        1st ed. Addison-Wesley Professional, 2008
        • Levin B.
        • Cheng H.
        • Reeder D.
        A human mitochondrial DNA standard reference material for quality control in forensic identification, medical diagnosis, and mutation detection.
        Genomics. 1999; 55: 135-146https://doi.org/10.1006/geno.1998.5513
        • Masibay A.
        • Mozer T.
        • Sprecher C.
        Promega Corporation reveals primer sequences in its testing kits.
        J. Forensic Sci. 2000; 45: 1360-1362