Advertisement

A genotype likelihood function for DNA mixtures

  • Benjamin Crysup
    Affiliations
    Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, USA
    Search for articles by this author
  • August E. Woerner
    Correspondence
    Corresponding author at: Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, USA.
    Affiliations
    Center for Human Identification, University of North Texas Health Science Center, Fort Worth, TX, USA

    Department of Microbiology, Immunology and Genetics, University of North Texas Health Science, USA
    Search for articles by this author
Published:September 15, 2022DOI:https://doi.org/10.1016/j.fsigen.2022.102776

      Highlights

      • A biallelic likelihood function is presented for balanced and imbalanced mixtures.
      • The function can be used to deconvolve two-person mixtures, including when one of the genotypes is known.
      • The approach is compatible with modern imputation software and may permit kinship estimation on simple mixtures.

      Abstract

      The recent advent of genetic genealogy has brought about a renewed interest in genome-scale forensic analyses, of which kinship estimation is a critical component. Most genomic kinship estimators consider SNPs (single nucleotide polymorphisms), often leveraging the co-inheritance of shared alleles to inform their analyses. While current estimators cannot directly evaluate mixed samples, there exist well-established SNP-based kinship estimators tailored to considering challenged samples, including low-pass whole genome sequencing. As an example, several studies have shown remarkable success in imputing genotype posterior probabilities in low template samples when linked sites are considered. Critical to these approaches is the ability to account for genotype uncertainty; the lack of an expression for a genotype likelihood in imbalanced mixtures has prevented direct application. This work develops such an expression. The formulation is fully compatible with genotype imputation software, suggesting a genomic pipeline that estimates genotype likelihoods, performs imputation, and then estimates kinship when the sample is a mixture. Further, when framed as an imbalanced mixture, the problem of mixture deconvolution is reducible to the problem of genotyping mixed samples. Herein, the ability to genotype two-person mixtures is assessed through example and in silico settings. While certain mixture scenarios and classes of sites are inherently inseparable, simulations of read depths between 60 and 190 appear to produce likelihoods of sufficient magnitude to deconvolve two-person mixtures whenever the mixture fraction is moderately imbalanced. The described approach and results suggest a path forward for estimating the kinship coefficient (and similar inferences on relatedness) when the sample is a mixture.

      Keywords

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      Subscribe:

      Subscribe to Forensic Science International: Genetics
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect

      References

        • Willems T.
        • et al.
        The landscape of human STR variation.
        Genome Res. 2014; 24: 1894-1904
        • Sherry S.T.
        • et al.
        dbSNP: the NCBI database of genetic variation.
        Nucleic Acids Res. 2001; 29: 308-311
        • Ellegren H.
        Microsatellites: simple sequences with complex evolution.
        Nat. Rev. Genet. 2004; 5: 435-445
        • Conrad D.F.
        • et al.
        Variation in genome-wide mutation rates within and between human families.
        Nat. Genet. 2011; 43: 712
        • Mirkin S.M.
        Expandable DNA repeats and human disease.
        Nature. 2007; 447: 932-940
        • Gemayel R.
        • et al.
        Variable tandem repeats accelerate evolution of coding and regulatory sequences.
        Annu. Rev. Genet. 2010; 44: 445-477
        • Gymrek M.
        • et al.
        Interpreting short tandem repeat variations in humans using mutational constraint.
        Nat. Genet. 2017; 49: 1495
        • Schlötterer C.
        • Tautz D.
        Slippage synthesis of simple sequence DNA.
        Nucleic Acids Res. 1992; 20: 211-215
        • Voskoboinik L.
        • Darvasi A.
        Forensic identification of an individual in complex DNA mixtures.
        Forensic Sci. Int. Genet. 2011; 5: 428-435
        • Woerner A.E.
        • et al.
        Techniques for estimating genetically variable peptides and semi-continuous likelihoods from massively parallel sequencing data.
        Forensic Sci. Int. Genet. 2022; 102719
        • Li H.
        • Durbin R.
        Fast and accurate short read alignment with Burrows-Wheeler transform.
        Bioinformatics. 2009; 25: 1754-1760
        • DePristo M.A.
        • et al.
        A framework for variation discovery and genotyping using next-generation DNA sequencing data.
        Nat. Genet. 2011; 43: 491-498
        • McKenna A.
        • et al.
        The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
        Genome Res. 2010; 20: 1297-1303
        • Poplin R.
        • et al.
        A universal SNP and small-indel variant caller using deep neural networks.
        Nat. Biotechnol. 2018; 36: 983-987
        • Browning B.L.
        • Browning S.R.
        A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals.
        Am. J. Hum. Genet. 2009; 84: 210-223
        • Fuchsberger C.
        • Abecasis G.R.
        • Hinds D.A.
        minimac2: faster genotype imputation.
        Bioinformatics. 2015; 31: 782-784
        • Rubinacci S.
        • et al.
        Efficient phasing and imputation of low-coverage sequencing data using large reference panels.
        Nat. Genet. 2021; 53: 120-126
        • Crysup B.
        • Budowle B.
        • Woerner A.E.
        ProDerAl: reference position dependent alignment.
        Bioinformatics. 2021;
        • Dolzhenko E.
        • et al.
        ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions.
        Bioinformatics. 2019; 35: 4754-4756
        • Garrison E.
        • et al.
        Variation graph toolkit improves read mapping by representing genetic variation in the reference.
        Nat. Biotechnol. 2018; 36: 875-879
        • Li H.
        A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.
        Bioinformatics. 2011; 27: 2987-2993
        • Maruki T.
        • Lynch M.
        Genotype calling from population-genomic sequencing data.
        G3 Genes Genomes Genet. 2017; 7: 1393-1404
        • Brookes C.
        • et al.
        Characterising stutter in forensic STR multiplexes.
        Forensic Sci. Int. Genet. 2012; 6: 58-63
        • Walsh P.S.
        • Fildes N.J.
        • Reynolds R.
        Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus vWA.
        Nucleic Acids Res. 1996; 24: 2807-2812
        • Garrison E.
        • Marth G.
        Haplotype-based variant detection from short-read sequencing.
        arXiv. 2012; 1207: 3907
        • Li H.
        • et al.
        The sequence alignment/map format and SAMtools.
        Bioinformatics. 2009; 25: 2078-2079
        • Lunter G.
        • Goodson M.
        Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads.
        Genome Res. 2011; 21: 936-939
        • Yun T.
        • et al.
        Accurate, scalable cohort variant calls using DeepVariant and GLnexus.
        Bioinformatics. 2020; 36: 5582-5589
        • Li H.
        • Ruan J.
        • Durbin R.
        Mapping short DNA sequencing reads and calling variants using mapping quality scores.
        Genome Res. 2008; 18: 1851-1858
      1. R.C. Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2017.

        • Eddelbuettel D.
        • et al.
        Rcpp: seamless R and C++ integration.
        J. Stat. Softw. 2011; 40: 1-18
        • Wickham H.
        ggplot2: Elegant Graphics for Data Analysis.
        Springer, 2016
        • Young B.A.
        • et al.
        Estimating number of contributors in massively parallel sequencing data of STR loci.
        Forensic Sci. Int Genet. 2019; 38: 15-22
        • Swaminathan H.
        • et al.
        NOCIt: a computational method to infer the number of contributors to DNA samples analyzed by STR genotyping.
        Forensic Sci. Int. Genet. 2015; 16: 172-180
        • Haned H.
        • et al.
        Estimating the number of contributors to forensic DNA mixtures: does maximum likelihood perform better than maximum allele count?.
        J. Forensic Sci. 2011; 56: 23-28
        • Nielsen R.
        • et al.
        SNP calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data.
        PLoS One. 2012; 7e37558
        • Consortium T.G.P.
        An integrated map of genetic variation from 1,092 human genomes.
        Nature. 2012; 491: 10
        • Turner S.D.
        • et al.
        Evaluating the impact of dropout and genotyping error on SNP-based kinship analysis with forensic samples.
        Front. Genet. 2022; : 13
        • Azhari G.
        • et al.
        Decomposition of individual SNP patterns from mixed DNA samples.
        Forensic Sci. 2022; 2: 455-472