Advertisement

Optimized variant calling for estimating kinship

Published:September 30, 2022DOI:https://doi.org/10.1016/j.fsigen.2022.102785

      Highlights

      • Two bioinformatic pipelines are described that characterize short variants at the level of the genome.
      • A set of filters are produced for the estimation of kinship.
      • When filtered, both BCFtools and DeepVariant produce accurate assessments of kinship using as little as 500 pg of DNA.
      • Accurate genotyping and accurate whole genome characterization for kinship estimation are related yet distinctly different.

      Abstract

      One of the fundamental goals of forensic genetics is sample attribution, i.e., whether an item of evidence can be associated with some person or persons. The most common scenario involves a direct comparison, e.g., between DNA profiles from an evidentiary item and a sample collected from a person of interest. Less common is an indirect comparison in which kinship is used to potentially identify the source of the evidence. Because of the sheer amount of information lost in the hereditary process for comparison purposes, sampling a limited set of loci may not provide enough resolution to accurately resolve a relationship. Instead, whole genome techniques can sample the entirety of the genome or a sufficiently large portion of the genome and as such they may effect better relationship determinations. While relatively common in other areas of study, whole genome techniques have only begun to be explored in the forensic sciences. As such, bioinformatic pipelines are introduced for estimating kinship by massively parallel sequencing of whole genomes using approaches adapted from the medical and population genomic literature. The pipelines are designed to characterize a person’s entire genome, not just some set of targeted markers. Two different variant callers are considered, contrasting a classical variant calling algorithm (BCFtools) to a more modern deep convolution neural network (DeepVariant). Two different bioinformatic pipelines specific to each variant caller are introduced and evaluated in a titration series. Filters and thresholds are then optimized specifically for the purposes of estimating kinship as determined by the KING-robust algorithm. With the appropriate filtering and thresholds in place both tools perform similarly, with DeepVariant tending to produce more accurate genotypes, though the resultant types of inaccuracies tended to produce slightly less accurate overall estimates of relatedness

      Keywords

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      Subscribe:

      Subscribe to Forensic Science International: Genetics
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect

      References

        • Erlich Y.
        • Narayanan A.
        Routes for breaching and protecting genetic privacy.
        Nat. Rev. Genet. 2014; 15: 409-421
        • Cotterman C.
        Relatives and human genetic analysis.
        Sci. Mon. 1941; 53: 227-234
        • Jacquard A.
        Genetic information given by a relative.
        Biometrics. 1972; : 1101-1114
        • Ge J.
        • Budowle B.
        • Chakraborty R.
        Choosing relatives for DNA identification of missing persons.
        J. Forensic Sci. 2011; 56: S23-S28
        • Karantzali E.
        • Rosmaraki P.
        • Kotsakis A.
        • Le M.-G.
        • Pajolec Roux-Le
        • Fitsialos G.
        The effect of FBI CODIS Core STR Loci expansion on familial DNA database searching. Forensic Science.
        Int.: Genet. 2019; 43102129
        • Ellegren H.
        Microsatellites: simple sequences with complex evolution.
        Nat. Rev. Genet. 2004; 5: 435-445
        • Conrad D.F.
        • Keebler J.E.
        • DePristo M.A.
        • Lindsay S.J.
        • Zhang Y.
        • Casals F.
        • Idaghdour Y.
        • Hartl C.L.
        • Torroja C.
        • Garimella K.V.
        Variation in genome-wide mutation rates within and between human families.
        Nat. Genet. 2011; 43: 712
        • Ge J.
        • Budowle B.
        • Chakraborty R.
        DNA identification by pedigree likelihood ratio accommodating population substructure and mutations.
        Invest. Genet. 2010; 1: 8
        • Schmitt M.W.
        • Kennedy S.R.
        • Salk J.J.
        • Fox E.J.
        • Hiatt J.B.
        • Loeb L.A.
        Detection of ultra-rare mutations by next-generation sequencing.
        Proc. Natl. Acad. Sci. 2012; 109: 14508-14513
        • Kennedy S.R.
        • Schmitt M.W.
        • Fox E.J.
        • Kohrn B.F.
        • Salk J.J.
        • Ahn E.H.
        • Prindle M.J.
        • Kuong K.J.
        • Shen J.-C.
        • Risques R.-A.
        Detecting ultralow-frequency mutations by duplex sequencing.
        Nat. Protoc. 2014; 9: 2586-2606
        • Browning S.R.
        • Browning B.L.
        Identity by descent between distant relatives: detection and applications.
        Annu. Rev. Genet. 2012; 46: 617-633
        • Browning B.L.
        • Browning S.R.
        Detecting identity by descent and estimating genotype error rates in sequence data.
        Am. J. Hum. Genet. 2013; 93: 840-851
        • Purcell S.
        • Neale B.
        • Todd-Brown K.
        • Thomas L.
        • Ferreira M.A.
        • Bender D.
        • Maller J.
        • Sklar P.
        • De Bakker P.I.
        • Daly M.J.
        PLINK: a tool set for whole-genome association and population-based linkage analyses.
        Am. J. Hum. Genet. 2007; 81: 559-575
        • Abecasis G.R.
        • Cherny S.S.
        • Cookson W.O.
        • Cardon L.R.
        Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.
        Nat. Genet. 2002; 30: 97-101
        • Conomos M.P.
        • Reiner A.P.
        • Weir B.S.
        • Thornton T.A.
        Model-free estimation of recent genetic relatedness.
        Am. J. Hum. Genet. 2016; 98: 127-148
        • Csűrös M.
        Non-identifiability of identity coefficients at biallelic loci.
        Theor. Popul. Biol. 2014; 92: 22-29
        • Kong A.
        • Masson G.
        • Frigge M.L.
        • Gylfason A.
        • Zusmanovich P.
        • Thorleifsson G.
        • Olason P.I.
        • Ingason A.
        • Steinberg S.
        • Rafnar T.
        Detection of sharing by descent, long-range phasing and haplotype imputation.
        Nat. Genet. 2008; 40: 1068-1075
        • Turner S.D.
        • Nagraj V.P.
        • Scholz M.
        • Jessa S.
        • Acevedo C.
        • Ge J.
        • Woerner A.E.
        • Budowle B.
        Evaluating the impact of dropout and genotyping error on snp-based kinship analysis with forensic samples.
        Front. Genet. 2022; : 13
        • Manichaikul A.
        • Mychaleckyj J.C.
        • Rich S.S.
        • Daly K.
        • Sale M.
        • Chen W.M.
        Robust relationship inference in genome-wide association studies.
        Bioinformatics. 2010; 26: 2867-2873
        • Moltke I.
        • Albrechtsen A.
        RelateAdmix: a software tool for estimating relatedness between admixed individuals.
        Bioinformatics. 2014; 30: 1027-1028
        • Thornton T.
        • Tang H.
        • Hoffmann T.J.
        • Ochs-Balcom H.M.
        • Caan B.J.
        • Risch N.
        Estimating kinship in admixed populations.
        Am. J. Hum. Genet. 2012; 91: 122-138
        • Gorden E.M.
        • Greytak E.M.
        • Sturk-Andreaggi K.
        • Cady J.
        • McMahon T.P.
        • Armentrout S.
        • Marshall C.
        Extended kinship analysis of historical remains using SNP capture.
        Forensic Sci. Int. Genet. 2022;
        • Nøhr A.K.
        • Hanghøj K.
        • Erill G.G.
        • Moltke I.
        • Albrechtsen A.
        NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data.
        G3: Genes Genomes Genet. 2021; : 1-9
        • Hanghøj K.
        • Moltke I.
        • Andersen P.A.
        • Manica A.
        • Korneliussen T.S.
        Fast and accurate relatedness estimation from high-throughput sequencing data in the presence of inbreeding.
        GigaScience. 2019; 8: giz034
        • Waples R.K.
        • Albrechtsen A.
        • Moltke I.
        Allele frequency‐free inference of close familial relationships from genotypes or low‐depth sequencing data.
        Mol. Ecol. 2019; 28: 35-48
        • Korneliussen T.S.
        • Moltke I.
        NgsRelate: a software tool for estimating pairwise relatedness from next-generation sequencing data.
        Bioinformatics. 2015; 31: 4009-4011
        • Danecek P.
        • Auton A.
        • Abecasis G.
        • Albers C.A.
        • Banks E.
        • DePristo M.A.
        • Handsaker R.E.
        • Lunter G.
        • Marth G.T.
        • Sherry S.T.
        • McVean G.
        • Durbin R.
        • Group G.P.A.
        The variant call format and VCFtools.
        Bioinformatics. 2011; 27: 2156-2158
        • Nielsen R.
        • Korneliussen T.
        • Albrechtsen A.
        • Li Y.
        • Wang J.
        SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data.
        PLoS One. 2012; 7e37558
        • Parson W.
        • Dür A.
        EMPOP—A forensic mtDNA database.
        For. Sci. Int. Genet. 2007; 1: 88-92
        • McKenna A.
        • Hanna M.
        • Banks E.
        • Sivachenko A.
        • Cibulskis K.
        • Kernytsky A.
        • Garimella K.
        • Altshuler D.
        • Gabriel S.
        • Daly M.
        • DePristo M.A.
        The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
        Genome Res. 2010; 20: 1297-1303
        • Yun T.
        • Li H.
        • Chang P.-C.
        • Lin M.F.
        • Carroll A.
        • McLean C.Y.
        Accurate, scalable cohort variant calls using DeepVariant and GLnexus.
        Bioinformatics. 2020; 36: 5582-5589
        • Garrison E.
        • Marth G.
        Haplotype-based variant detection from short-read sequencing.
        arXiv. 2012; 1207: 3907
        • DePristo M.A.
        • Banks E.
        • Poplin R.
        • Garimella K.V.
        • Maguire J.R.
        • Hartl C.
        A framework for variation discovery and genotyping using next-generation DNA sequencing data.
        Nat. Genet. 2011; 43: 491-498
        • Maruki T.
        • Lynch M.
        Genotype calling from population-genomic sequencing data.
        G3 Genes Genomes Genet. 2017; 7: 1393-1404
        • Cooke D.P.
        • Wedge D.C.
        • Lunter G.
        A unified haplotype-based method for accurate and comprehensive variant calling.
        Nat. Biotechnol. 2021; 39: 885-892
        • Poplin R.
        • Chang P.-C.
        • Alexander D.
        • Schwartz S.
        • Colthurst T.
        • Ku A.
        • Newburger D.
        • Dijamco J.
        • Nguyen N.
        • Afshar P.T.
        A universal SNP and small-indel variant caller using deep neural networks.
        Nat. Biotechnol. 2018; 36: 983-987
        • Supernat A.
        • Vidarsson O.V.
        • Steen V.M.
        • Stokowy T.
        Comparison of three variant callers for human whole genome sequencing.
        Sci. Rep. 2018; 8: 1-6
        • Li H.
        A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.
        Bioinformatics. 2011; 27: 2987-2993
        • Li H.
        • Handsaker B.
        • Wysoker A.
        • Fennell T.
        • Ruan J.
        • Homer N.
        • Marth G.
        • Abecasis G.
        • Durbin R.
        The sequence alignment/map format and SAMtools.
        Bioinformatics. 2009; 25: 2078-2079
        • Li H.
        • Durbin R.
        Fast and accurate short read alignment with Burrows-Wheeler transform.
        Bioinformatics. 2009; 25: 1754-1760
        • Li H.
        Toward better understanding of artifacts in variant calling from high-coverage samples.
        Bioinformatics. 2014; 30: 2843-2851
        • Karolchik D.
        • Hinrichs A.S.
        • Furey T.S.
        • Roskin K.M.
        • Sugnet C.W.
        • Haussler D.
        • Kent W.J.
        The UCSC table browser data retrieval tool.
        Nucleic Acids Res. 2004; 32: D493-D496
        • Amemiya H.M.
        • Kundaje A.
        • Boyle A.P.
        The ENCODE blacklist: identification of problematic regions of the genome.
        Sci. Rep. 2019; 9: 1-5
        • Quinlan A.R.
        • Hall I.M.
        BEDTools: a flexible suite of utilities for comparing genomic features.
        Bioinformatics. 2010; 26: 841-842
        • Arthur R.
        • Schulz-Trieglaff O.
        • Cox A.J.
        • O’Connell J.
        AKT: ancestry and kinship toolkit.
        Bioinformatics. 2017; 33: 142-144
        • Wickham H.
        ggplot2: Elegant Graphics for Data Analysis.
        Springer, 2016
        • Turner S.D.
        • Nagraj V.
        • Scholz M.
        • Jessa S.
        • Acevedo C.
        • Ge J.
        • Woerner A.E.
        • Budowle B.
        skater: an R package for SNP-based kinship analysis, testing, and evaluation.
        F1000Research. 2022; 11: 18
        • Kivioja T.
        • Vähärautio A.
        • Karlsson K.
        • Bonke M.
        • Enge M.
        • Linnarsson S.
        • Taipale J.
        Counting absolute numbers of molecules using unique molecular identifiers.
        Nat. Methods. 2012; 9: 72-74
        • Payseur B.A.
        • Nachman M.W.
        Natural selection at linked sites in humans.
        Gene. 2002; 300: 31-42
        • Köster J.
        • Rahmann S.
        Snakemake—a scalable bioinformatics workflow engine.
        Bioinformatics. 2012; 28: 2520-2522