If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Corresponding author at: American Registry of Pathology, 15245 Shady Grove Rd., Suite 335, Rockville, MD 20850, United States. Tel.: +1 301 257 0794.
Armed Forces DNA Identification Laboratory, 115 Purple Heart Dr., Dover AFB, DE 19902, United StatesAmerican Registry of Pathology, 120A Old Camden Rd., Camden, DE 19934, United StatesUniversity of Maryland, College Park, 8082 Baltimore Ave., College Park, MD 20740, United States
1 Present address: Michigan State Police, 333 S. Grand Ave., Lansing, MI 48909, United States. 2 Present address: Federal Bureau of Investigation, 2501 Investigation Parkway, Quantico, VA 22135, United States.
Forensic mitochondrial DNA (mtDNA) testing requires appropriate, high quality reference population data for estimating the rarity of questioned haplotypes and, in turn, the strength of the mtDNA evidence. Available reference databases (SWGDAM, EMPOP) currently include information from the mtDNA control region; however, novel methods that quickly and easily recover mtDNA coding region data are becoming increasingly available. Though these assays promise to both facilitate the acquisition of mitochondrial genome (mtGenome) data and maximize the general utility of mtDNA testing in forensics, the appropriate reference data and database tools required for their routine application in forensic casework are lacking. To address this deficiency, we have undertaken an effort to: (1) increase the large-scale availability of high-quality entire mtGenome reference population data, and (2) improve the information technology infrastructure required to access/search mtGenome data and employ them in forensic casework.
Here, we describe the application of a data generation and analysis workflow to the development of more than 400 complete, forensic-quality mtGenomes from low DNA quantity blood serum specimens as part of a U.S. National Institute of Justice funded reference population databasing initiative. We discuss the minor modifications made to a published mtGenome Sanger sequencing protocol to maintain a high rate of throughput while minimizing manual reprocessing with these low template samples. The successful use of this semi-automated strategy on forensic-like samples provides practical insight into the feasibility of producing complete mtGenome data in a routine casework environment, and demonstrates that large (>2 kb) mtDNA fragments can regularly be recovered from high quality but very low DNA quantity specimens. Further, the detailed empirical data we provide on the amplification success rates across a range of DNA input quantities will be useful moving forward as PCR-based strategies for mtDNA enrichment are considered for targeted next-generation sequencing workflows.
] for use by practitioners, these data and indeed all publicly available forensic mtDNA reference data only include information from the mtDNA control region (CR). Emerging technologies such as next-generation sequencing are capable of producing mtDNA coding region data from extremely low DNA quality and quantity forensic specimens [
]. At present, however, no suitable database of complete mitochondrial genomes (mtGenomes) is available for forensic queries. Most of the more than 15,000 entire mtGenome haplotypes available in GenBank have not been developed for forensic purposes or to forensic standards. Some contain errors, associated metadata is often incomplete and/or absent, raw electropherograms are unavailable for review and, in almost all cases, the datasets do not represent randomly sampled populations. Thus, before new methods and applications targeting entire mtGenome data can be implemented in routine forensic practice, high quality reference sequences that adhere to forensic standards are required [
The specific goals and objectives of our current National Institute of Justice (NIJ) funded databasing initiative are the production of 550 complete, high-quality mtGenomes spanning three U.S. population groups, and database structure and query modifications to EMPOP to accommodate entire mtGenome data. Here, we report on aspects of the data generation portion of this project, and the development of 433 forensic-quality mtGenome haplotypes from low template specimens. We describe the application of an automated mtGenome sequencing protocol [
] and multi-step data analysis workflow to samples with a very low quantity of DNA, and the steps taken to maintain high-throughput data production with minimal manual sample reprocessing. We assess the practical success of (a) the mtGenome protocol with those minor modifications, and (b) the overall data production strategy, on these forensic-like samples through evaluation of key processing metrics and results from critical data quality control checks.
2. Materials and methods
The samples used for this databasing effort were anonymized blood serum specimens from the Department of Defense Serum Repository [
Serum specimens from the Department of Defense Serum Repository: The Armed Forces Health Surveillance Center, U.S. Department of Defense, Silver Spring, MD [November 8, 2010; August 1, 2011; and October 20, 2011].
]. Since the DNA-containing blood components have been removed by centrifugation, only a small amount of cell-free DNA typically remains in blood serum. To assure the generation of forensic-quality mtGenome profiles from these high quality but low template specimens, and to avoid the types of errors found in some entire mtGenome data sets [
], and we employed a rigorous data review process. Automated pipetting was performed on a MICROLAB® STARlet for pre-PCR work, and either a Tecan Genesis® 2000 workstation (Tecan Group Ltd., San Jose, CA) or MICROLAB® STARplus instrument (Hamilton Robotics, Reno, NV) for post-PCR reaction set-up, using custom methods developed in-house for this project. An overview of our entire data production and review workflow is shown in Fig. S1.
Blood serum specimens were robotically transferred from tubes to 96-well plates, and DNA was extracted by a combination of robotic pipetting and manual centrifugation using the QIAamp 96 DNA Blood Kit (QIAGEN, Valencia, CA). Some extracts were quantified following extraction using an mtDNA qPCR assay [
], which provides relative quantitation values based upon comparison to a standard curve of mtDNA present in a known genomic DNA concentration. Thus, quantities reported in this paper reflect total genomic DNA quantities, not mtDNA quantities specifically.
Amplification of the complete mtGenome was performed according to the protocol described in Lyons et al. [
], with minor modifications to improve first-pass amplification success with extremely low DNA quantity samples. Extract input quantities for PCR were generally doubled (from 3 to 6 μL per 50 μL reaction) when qPCR results indicated concentrations below 3 pg/μL. In some instances, such as when sample extracts exhibited evidence of inhibition or to improve amplification success for one or two target fragments (amplicons 4 and 6) when extract DNA concentrations were unknown, Taq polymerase concentrations in the PCR reactions were doubled (from 2.5 to 5 units).
Amplification success was assessed by automated capillary electrophoresis on a QIAxcel instrument (QIAGEN), successfully amplified products were enzymatically purified, and each mtGenome was subsequently Sanger sequenced in 135 reactions using the protocol described in Lyons et al. [
]. Sequencing products were purified by gel filtration, dehydrated, and resuspended in formamide. Sequence detection was performed on an Applied Biosystems 3730 DNA Analyzer (Life Technologies, Applied Biosystems, Foster City, CA) using a 50 cm capillary array. All post-quantification pipetting steps were performed robotically with the exception of enzymatic purification, where automated pipetting of highly viscous reagents would have resulted in the waste of a large volume of enzyme. Sample placement during any necessary manual reprocessing was always performed with at least one, and sometimes two, witnesses.
The data review process we employed followed a strategy previously and successfully used for the production of high-quality mtDNA CR sequences, which included raw data review by no fewer than three distinct scientists at two laboratories (AFDIL and EMPOP), and electronic data transfer with two additional profile reviews [
] to confirm phylogenetic consistency across the eight amplicons. In addition, all private mutations, heteroplasmies and transversions were re-reviewed in the raw data. Lastly, final profile haplogroups were assigned using an automated, maximum likelihood-based tool, EMMA [
For a set of 242 blood serum extracts quantified prior to amplification, DNA quantities ranged from 0.00 to 777.64 pg/μL with an average of 14.91 pg/μL (s.d. 53.79). Thirty-three of these samples, or 13.6%, exhibited at least one amplification failure during the first-pass automated processing (Fig. 1). The vast majority (86.6%) of the amplification failures, and all but one instance in which multiple regions for the same sample failed to amplify, occurred when DNA input quantities were less than 10 pg. The average DNA quantity for samples with multiple amplification failures was 1.00 pg/μL (s.d. 0.80). At DNA input quantities equal to or greater than 10 pg, 99.4% of amplifications were successful. In terms of sample handling, to maintain a high rate of throughput and minimize manual reprocessing, extracts for which only a single region failed to amplify were re-amplified manually prior to sequencing, whereas samples for which more than one fragment failed to amplify were typically dropped and not processed further. Fig. 2 shows the number of samples dropped by DNA input quantity.
Manual reprocessing was also performed when the first pass robotic processing did not produce complete sequence coverage (defined as at least two strands of sequence data) across the entire mtGenome. In most instances the reprocessing involved manual sequencing from the original PCR products to fill in small gaps in the sequence coverage. However, when multiple new sequences from the same genome region were required, the sample was sometimes re-amplified to produce a better quality PCR product. For a large majority (70.9%) of a set of 433 low DNA quantity samples, the first pass of automated data generation produced complete sequence coverage across the entire mtGenome and no manual reprocessing was necessary. For 13.2% and 6.2% of the samples, respectively, minimal (defined as one or two additional sequencing reactions) or moderate (three to nine additional sequencing reactions) reprocessing was required to achieve the desired sequence coverage (Fig. 3). For 9.7% of samples more extensive reprocessing (ten or more manual sequencing reactions) was performed, and usually included complete re-amplification of one or more regions of the genome. An example of the typical sequence data quality produced for this project is shown in Fig. S2.
Initial results utilizing an earlier version of the Lyons et al. amplification strategy made clear that some of the exceptionally low template blood serum specimens required extensive reprocessing for amplicons 2 and 6 in particular. For instance, among the forty samples with PCR inputs less than 10 pg processed using the initial amplicon 2 PCR primers, twelve samples (30.0%) required reamplification and resequencing of that amplicon; and among the twenty-nine samples with PCR inputs less than 10 pg processed using the initial amplicon 6 PCR primers, eleven samples (37.9%) required reamplification and resequencing of the fragment. To increase the first pass success rates for these two amplicons, the PCR primer sets were redesigned partway through this databasing project. To assess success rates using the published strategy [
], all blood serum samples amplified prior to the PCR primer redesign were reconsidered without the amplicon 2 or 6 reprocessing requirements. This reduced the number of samples which required moderate or extensive manual sequencing from 15.9% to 10.2%, with only twenty of 433 samples (5.5%) requiring extensive reprocessing (Fig. 3).
The extent of manual sequencing required was also examined in comparison to PCR input DNA quantity for a set of 230 extracts (the 242 quantified extracts referenced above, minus the twelve samples which were not processed beyond amplification due to multiple amplification failures; Fig. 4). All nine samples which required extensive manual reprocessing and nearly all samples which required moderate manual sequencing had PCR input DNA quantities less than 50 pg. For the nine samples with DNA inputs less than 50 pg which required extensive reprocessing, most of the initial sequence data quality issues were caused by a failure of the post-amplification enzymatic purification which necessitated reamplification and complete manual resequencing of the fragment. Among the forty-three samples with input DNA quantities greater than 50 pg, only one required more than two manual reactions to achieve complete mtGenome sequence coverage. For these samples, the average number of additional sequences required was 0.33, which equates to approximately one manual reaction for every three haplotypes.
In addition to the more qualitative assessments of sequencing success described above, we also performed a quantitative evaluation of sequencing failure rates in comparison to input DNA quantity. For a qPCR-quantified set of 185 samples with no amplification failures, Sequence Scanner v 1.0 (Life Technologies, Applied Biosystems) was used to capture the electrophoretic signal intensities for 21,601 sequencing products detected on the 3730 DNA Analyzer (Life Technologies, Applied Biosystems). For these data, we defined a failed sequence as one with at least two of the four signal intensities below 100 relative florescence units (RFUs). To reflect the published protocol [
], sequences generated from PCR products developed using the initial amplicon 2 and 6 primer sets (discussed above) were excluded from the analysis.
A scatter plot of the percentage of failed sequences at each PCR input DNA quantity is displayed in Fig. 5. For samples for which PCR DNA inputs were less than 50 pg, the average sequence failure rate was 2.51% (s.d. of 0.05), which equates to approximately three failed sequences per sample. Among samples for which PCR DNA inputs were greater than 50 pg, the average sequence failure rate was 0.82% (s.d. of 0.02); and only one of these thirty-nine samples had a sequence failure rate greater than 5.0%. The picture provided by these data is highly similar to that developed from the reprocessing data (Fig. 4). These two complementary analyses demonstrate that, using the published protocol [
] with the minor amplification modifications and sample handling strategy described here, sequencing was largely successful but variable when PCR input DNA quantities were less than 50 pg, and nearly always successful when DNA input quantities exceeded 50 pg.
Sequencing success/failure was also investigated in relation to QIAxcel-measured amplification product concentration. For the 2677 sequencing reactions performed from PCR product concentrations below 2.00 ng/μL/1000 bp, a clear relationship between sequencing failure and product concentration only emerged when the data were broadly categorized (Fig. 6). Both the percentage of failed sequences (defined by electrophoretic signal intensities, as described above) and the resequencing rate (calculated by comparing the number of manual sequences required to the number of sequences produced in the initial automated processing) were higher when PCR product concentrations were below 1.00 ng/μl/1000 bp as compared to product concentrations in the 1.01–2.00 ng/μl/1000 bp range. When product concentrations were greater than 1.00 ng/μl/1000 bp, the resequencing rate was only 0.37%. However, the more obvious trend observed across all of these lower amplification product concentrations was that sequencing failure was highly amplicon-specific. More than 90% of the 198 sequences with low signal intensities resulted from just two target regions: amplicon 4, with 68.0% of the sequencing failures, and amplicon 6, with 25.1% of the sequencing failures.
To summarize the performance of the automated protocol with the modifications described here across all 433 low DNA quantity samples, we calculated an overall resequencing rate: the number of manual sequences required in comparison to the 135 sequences generated per sample as part of the initial automated processing. When all manual sequence reprocessing was considered the resequencing rate was 2.84%. However, when data from amplicons 2 and 6 prior to their redesign was excluded to reflect the published protocol design [
], the resequencing rate was 1.20%. This latter value reflects an average of 1.59 manual sequencing reactions required per sample to develop a complete, forensic-quality mtGenome haplotype from a successfully amplified, low template extract.
3.2 Data review
The use of a multi-amplicon protocol for mtDNA data generation and manual reprocessing carries some risk of sample swaps and other human errors. Further, amplification of a contaminant or co-amplification of a NUMT may be possible with the low DNA quantity serum specimens used in this project. For these reasons, meticulous, redundant review of the raw electropherogram data (following the strategy described in [
]) and post-review data quality control checks were critical aspects of our workflow.
Subsequent to the AFDIL raw data reviews, phylogenetic checks of the complete mtGenome profiles were performed as a quality control measure. A preliminary haplogroup was assigned to each haplotype on the basis of haplogroup-defining polymorphisms, and the sample haplotype was subsequently compared to a list of expected mutations for the haplogroup using PhyloTree [
]. All missing mutations (those expected based on the haplogroup but not observed in the sample haplotype) and private mutations (differences from the reference sequence that are not a part of the PhyloTree haplogroup definition) were investigated by reviewing the raw sequence data and the sample processing record, and any suspicious amplicon-based patterns were further compared to the complete mtDNA phylogeny. Among the 433 completed mtGenome haplotypes which have undergone phylogenetic evaluation, representing more than 3500 amplifications and nearly 60,000 sequencing reactions, zero instances of sample swaps or other data generation errors were identified.
Following EMPOP examination of the raw data for each sample, a comparison of the AFDIL and EMPOP-generated mtGenome haplotypes (both developed by comparison to the rCRS [
]) was performed electronically. In instances of non-concordance the raw data was re-reviewed at both laboratories, and corrections based on mutual agreement were made to the haplotypes as necessary. From the 263 samples compared thus far (more than 4.3 million base pairs of sequence data), a discrepancy between the AFDIL and EMPOP haplotypes was identified in just eight samples. In four instances a point heteroplasmy was missed in the AFDIL data analysis; two cases represented indel alignment disparities between the AFDIL and EMPOP data reviews; and the remaining two discrepancies were due to manual electropherogram editing differences. In one instance resequencing from the original PCR product was performed to confirm a point heteroplasmy. In all cases the mtGenome haplotypes were corrected to result in 100% final concordance.
] Sanger sequencing protocol combined with the sample handling strategy described and applied here reliably produced high quality data from very low DNA quantity specimens in the first pass of automated data generation, and most samples did not require any manual reprocessing to generate complete mtGenome haplotypes. Amplification was successful 99.4% of the time when DNA inputs were greater than 10 pg, and no PCR failures were observed at inputs greater than 50 pg. Sequencing success – assessed both in terms of sequencing failure (determined by electrophoretic signal) and the amount of reprocessing required to generate a complete haplotype – was variable but generally still high when PCR DNA input quantities were less than 50 pg. At PCR inputs exceeding 50 pg, an average of just 0.82% of sequencing reactions failed and only one manual sequencing reaction was required for every three haplotypes. At QIAxcel-measured PCR product concentrations less than 2 ng/μl/1000 bp, more than 90% of the sequencing failures were observed in just two target regions (amplicons 4 and 6). In regards to data review, the efficacy of automated processing combined with a rigorous review strategy in preventing errors with this multi-amplicon protocol was evident from the absence of problems detected at the stage of phylogenetic data evaluation. Further, few discordant profiles were identified upon cross-validation of the AFDIL and EMPOP reviews.
The amplification and sequencing success rates reported here demonstrate that it is feasible to generate forensic-quality complete mtGenome haplotypes in a routine casework environment from forensic-like (low template) specimens. The development of this large, thoroughly evaluated data set from blood serum samples provides clear evidence that amplicons exceeding 2000 base pairs can regularly be recovered from very low DNA quantity specimens; and the data also provide detailed information on both PCR and Sanger sequencing success rates across a range of qPCR-measured mtDNA quantities. The processing metrics detailed here may thus be useful to forensic practitioners when attempting to determine the specific mtDNA amplicons, assays or markers to pursue when DNA quantities are known and case sample extract volumes are limited. Additionally, the data provide an indication of the first-pass amplification success rates that could be expected with low DNA quantity specimens in a high-throughput environment if the PCR strategy were applied as an enrichment method for targeted next-generation sequencing of mtDNA.
In total, our NIJ-funded databasing effort has thus far produced 263 and 170 entire mtGenome haplotypes for the U.S. Caucasian and African-American population groups, respectively. The genomes will be published, and made publicly available in GenBank and searchable in EMPOP, upon completion of the project. Immediately, though, these high-quality data, produced via well-established and validated Sanger sequencing technology, will be used as an etalon dataset for a posteriori quality control of all mtGenome data evaluated by EMPOP prior to their acceptance for publication in Forensic Science International: Genetics and the International Journal of Legal Medicine [
]. Ultimately, the NIJ-funded project will not only yield high quality mtGenome data against which new sequences developed with both current and next-generation sequencing technologies can be measured, but it will also provide reliable, complete mtGenome reference data and associated software tools necessary for implementation of mtGenome testing in routine mtDNA casework.
Role of funding
This project was supported by Award No. 2011-MU-MU-K402 to Jodi A. Irwin, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice. The opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect those of the Department of Justice. The National Institute of Justice funding was administered by the American Registry of Pathology. The work leading to these results also received funding from the Austrian Science Fund (FWF) [P22880-B12] and was financially supported from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement n° 285487. None of these entities had any role in study design; collection, analysis or interpretation of data; in the writing of this report; or in the decision to submit this paper for publication.
Conflict of interest
The authors thank Martin Bodner, Liane Fendt, Petra Kralj and Catarina Gomes Xavier for EMPOP data review, Odile Loreille for discussion, and Lt Col Laura Regan, Timothy McMahon, James Canik, Lanelle Chisholm, Shairose Lalani, Michael Fasano, COL Louis Finelli, Cynthia Thomas, Michael Parry, Richard Scheithauer and Michael Cummings for administrative and logistical support. We also thank two anonymous reviewers whose comments and suggestions for revision improved this paper. The opinions or assertions presented herein are the private views of the authors and should not be construed as official or as reflecting the views of the Department of Defense, its branches, the U.S. Army Medical Research and Materiel Command, the Armed Forces Medical Examiner System, the Federal Bureau of Investigation, the Michigan State Police or the U.S. Government. Commercial equipment, instruments and materials are identified to specify some experimental procedures. In no case does such identification imply a recommendation or endorsement by the U.S. Department of Defense, the U.S. Department of the Army, the Federal Bureau of Investigation, the Michigan State Police or the U.S. Government, nor does it imply that any of the materials, instruments or equipment identified are necessarily the best available for the purpose.
Appendix A. Supplementary data
The following are the supplementary data to this article:
Serum specimens from the Department of Defense Serum Repository: The Armed Forces Health Surveillance Center, U.S. Department of Defense, Silver Spring, MD [November 8, 2010; August 1, 2011; and October 20, 2011].
☆This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-No Derivative Works License, which permits non-commercial use, distribution, and reproduction in any medium, provided the original author and source are credited.