Advertisement
Research Article| Volume 59, 102719, July 2022

Techniques for estimating genetically variable peptides and semi-continuous likelihoods from massively parallel sequencing data

      Highlights

      • A bioinformatic pipeline is described to estimate genetically variable peptide profiles from whole genome sequencing data.
      • The pipeline is designed to consider either short- or long-read massively parallel sequencing data.
      • A semicontinuous likelihood that considers linkage and codon degeneracy is introduced.
      • The likelihood formulation is applied to single-source samples and mixtures.

      Abstract

      Forensic genetic investigations typically rely on analysis of DNA for attribution purposes. There are times, however, when the amount and/or the quality of the DNA is limited, and thus little or no information can be obtained regarding the source of the sample. An alternative biochemical target that also contains genetic signatures is protein. One class of genetic signatures is protein polymorphisms that are a direct consequence of simple/single/short nucleotide polymorphisms (SNPs) in DNA. However, to interpret protein polymorphisms in a forensic context, certain complexities must be understood and addressed. These complexities include: 1) SNPs can generate 0, 1, or arbitrarily many polymorphisms in a polypeptide; and 2) as an object of expression that is modulated by alleles, genes and interactions with the environment, proteins may be present or absent in a given sample. To address these issues, a novel approach was taken to generate the expected protein alleles in a reference sample based on whole genome (or exome) sequence data and assess the significance of the evidence using a haplotype-based semi-continuous likelihood algorithm that leverages whole proteome data. Converting the genomic information into the proteomic information allows for the zero-to-many relationship between SNPs and GVPs to be abstracted away. When viewed as a haplotype, many GVPs that correspond to the same SNP is equivalent to many SNPs in perfect linkage disequilibrium (LD). As long as the likelihood formulation correctly accounts for LD, the correspondence between the SNP and the proteome can be safely neglected. Tests were performed on simulated samples, including single-source and two-person mixtures, and the power of using a classical semi-continuous likelihood versus one that has been adapted to neglect drop-out was compared. Additionally, summary statistics and a rudimentary set of decision guidelines were introduced to help identify mixtures from protein data.

      Keywords

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      Subscribe:

      Subscribe to Forensic Science International: Genetics
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect

      References

        • Harris H.
        Enzyme polymorphisms in man.
        Proc. R. Soc. Lond. Ser. B Biol. Sci. 1966; 164: 298-310
        • Parker G.J.
        • McKiernan H.E.
        • Legg K.M.
        • Goecker Z.C.
        Forensic proteomics.
        Forensic Sci. Int Genet. 2021; 54102529
        • Rodriguez J.
        • Gupta N.
        • Smith R.D.
        • Pevzner P.A.
        Does trypsin cut before proline?.
        J. Proteome Res. 2008; 7: 300-305
        • Kim S.
        • Pevzner P.A.
        MS-GF+ makes progress towards a universal database search tool for proteomics.
        Nat. Commun. 2014; 5: 1-10
        • Eng J.K.
        • Jahan T.A.
        • Hoopmann M.R.
        Comet: an open‐source MS/MS sequence database search tool.
        Proteomics. 2013; 13: 22-24
        • McIlwain S.
        • Tamura K.
        • Kertesz-Farkas A.
        • Grant C.E.
        • Diament B.
        • Frewen B.
        • Howbert J.J.
        • Hoopmann M.R.
        • Käll L.
        • Eng J.K.
        • MacCoss M.J.
        • Noble W.S.
        Crux: rapid open source protein tandem mass spectrometry analysis.
        J. Proteome Res. 2014; 13: 4488-4491
        • Solntsev S.K.
        • Shortreed M.R.
        • Frey B.L.
        • Smith L.M.
        Enhanced global post-translational modification discovery with MetaMorpheus.
        J. Proteome Res. 2018; 17: 1844-1851
        • Lander E.S.
        • Waterman M.S.
        Genomic mapping by fingerprinting random clones: a mathematical analysis.
        Genomics. 1988; 2: 231-239
        • Lo H.S.
        • Wang Z.
        • Hu Y.
        • Yang H.H.
        • Gere S.
        • Buetow K.H.
        • Lee M.P.
        Allelic variation in gene expression is common in the human genome.
        Genome Res. 2003; 13: 1855-1862
        • Zhang K.
        • Li J.B.
        • Gao Y.
        • Egli D.
        • Xie B.
        • Deng J.
        • Li Z.
        • Lee J.H.
        • Aach J.
        • Leproust E.M.
        • Eggan K.
        • Church G.M.
        Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human.
        Nat. Methods. 2009; 6: 613-618
        • Curran J.
        • Gill P.
        • Bill M.
        Interpretation of repeat measurement DNA evidence allowing for multiple contributors and population substructure.
        Forensic Sci. Int. 2005; 148: 47-53
        • Gill P.
        • Kirkham A.
        • Curran J.
        LoComatioN: a software tool for the analysis of low copy number DNA profiles.
        Forensic Sci. Int. 2007; 166: 128-138
        • Balding D.J.
        • Nichols R.A.
        DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands.
        Forensic Sci. Int. 1994; 64: 125-140
        • Kidd K.K.
        • Pakstis A.J.
        • Speed W.C.
        • Lagacé R.
        • Chang J.
        • Wootton S.
        • Haigh E.
        • Kidd J.R.
        Current sequencing technology makes microhaplotypes a powerful new type of genetic marker for forensics.
        Forensic Sci. Int Genet. 2014; 12: 215-224
        • Delaneau O.
        • Zagury J.-F.
        • Robinson M.R.
        • Marchini J.L.
        • Dermitzakis E.T.
        Accurate, scalable and integrative haplotype estimation.
        Nat. Commun. 2019; 10: 1-10
        • Sherry S.T.
        • Ward M.H.
        • Kholodov M.
        • Baker J.
        • Phan L.
        • Smigielski E.M.
        • Sirotkin K.
        dbSNP: the NCBI database of genetic variation.
        Nucleic Acids Res. 2001; 29: 308-311
        • Parker G.J.
        • Leppert T.
        • Anex D.S.
        • Hilmer J.K.
        • Matsunami N.
        • Baird L.
        • Stevens J.
        • Parsawar K.
        • Durbin-Johnson B.P.
        • Rocke D.M.
        • Nelson C.
        • Fairbanks D.J.
        • Wilson A.S.
        • Rice R.H.
        • Woodward S.R.
        • Bothner B.
        • Hart B.R.
        • Leppert M.
        Demonstration of protein-based human identification using the hair shaft proteome.
        PloS One. 2016; 11e0160653
        • Lewontin R.C.
        • Hartl D.L.
        Population genetics in forensic DNA typing.
        Science. 1991; 254: 1745-1750
        • Stranger B.E.
        • Brigham L.E.
        • Hasz R.
        • Hunter M.
        • Johns C.
        • Johnson M.
        • Kopen G
        • Leinweber W.F.
        • Lonsdale J.T.
        • McDonald A.
        • Mestichelli B.
        Enhancing GTEx by bridging the gaps between genotype, gene expression, and disease The eGTEx Project.
        Nat. Genet. 2017; 49: 1664
        • Consortium G.
        The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans.
        Science. 2015; 348: 648-660
        • Li H.
        • Handsaker B.
        • Wysoker A.
        • Fennell T.
        • Ruan J.
        • Homer N.
        • Marth G.
        • Abecasis G.
        • Durbin R.
        The sequence alignment/map format and SAMtools.
        Bioinformatics. 2009; 25 (Aug 15): 2078-2079
        • McKenna A.
        • Hanna M.
        • Banks E.
        • Sivachenko A.
        • Cibulskis K.
        • Kernytsky A.
        • Garimella K.
        • Altshuler D.
        • Gabriel S.
        • Daly M.
        • DePristo M.A.
        The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
        Genome Res. 2010; 20: 1297-1303
        • Patterson M.
        • Marschall T.
        • Pisanti N.
        • Van Iersel L.
        • Stougie L.
        • Klau G.W.
        • Schönhuth A.
        WhatsHap: weighted haplotype assembly for future-generation sequencing reads.
        J. Comput. Biol. 2015; 22: 498-509
        • Zheng-Bradley X.
        • Streeter I.
        • Fairley S.
        • Richardson D.
        • Clarke L.
        • Flicek P.
        1000 Genomes Project Consortium. Alignment of 1000 Genomes Project reads to reference assembly GRCh38.
        Gigascience. 2017; 6 (gix038)
        • Danecek P.
        • McCarthy S.A.
        BCFtools/csq: haplotype-aware variant consequences.
        Bioinformatics. 2017; 33: 2037-2039
        • McLaren W.
        • Gil L.
        • Hunt S.E.
        • Riat H.S.
        • Ritchie G.R.
        • Thormann A.
        • Flicek P.
        • Cunningham F.
        The ensembl variant effect predictor.
        Genome Biol. 2016; 17: 122
        • Manber U.
        • Myers G.
        Suffix arrays: a new method for on-line string searches.
        siam J. Comput. 1993; 22: 935-948
        • Vogel C.
        • Marcotte E.M.
        Insights into the regulation of protein abundance from proteomic and transcriptomic analyses.
        Nat. Rev. Genet. 2012; 13: 227-232
        • Woerner A.E.
        • Hewitt F.C.
        • Gardner M.W.
        • Freitas M.A.
        • Schulte K.Q.
        • LeSassier D.S.
        • Baniasad M.
        • Reed A.J.
        • Powals M.E.
        • Smith A.R.
        • Albright N.C.
        • Ludolph B.C.
        • Zhang L.
        • Allen L.W.
        • Weber K.
        • Budowle B.
        An algorithm for random match probability calculation from peptide sequences. Forensic Science.
        Forensic Sci. Int Genet. 2020; 47: 47
        • Hansson O.
        • Gill P.
        Characterisation of artefacts and drop-in events using STR-validator and single-cell analysis. Forensic Science.
        Forensic Sci. Int Genet. 2017; 30: 57-65
        • Gill P.
        • Buckleton J.
        A universal strategy to interpret DNA profiles that does not require a definition of low-copy-number. Forensic Science.
        Forensic Sci. Int Genet. 2010; 4: 221-227
        • Balding D.J.
        • Buckleton J.
        Interpreting low template DNA profiles. Forensic science.
        Forensic Sci. Int Genet. 2009; 4: 1-10
        • Buckleton J.
        • Curran J.
        • Goudet J.
        • Taylor D.
        • Thiery A.
        • Weir B.S.
        Population-specific FST values for forensic STR markers: a worldwide survey.
        Forensic Sci. Int Genet. 2016; 23: 91-100
      1. Team RC. R: A language and environment for statistical computing. R Found Stat Comput Vienna, Austria. 2017.

        • Wickham H.
        • Averick M.
        • Bryan J.
        • Chang W.
        • McGowan L.
        • François R.
        • Grolemund G.
        • Hayes A.
        • Henry L.
        • Hester J.
        • Kuhn M.
        • Pedersen T.
        • Miller E.
        • Bache S.
        • Müller K.
        • Ooms J.
        • Robinson D.
        • Seidel D.
        • Spinu V.
        • Takahashi K.
        • Vaughan D.
        • Wilke C.
        • Woo K.
        • Yutani H.
        Welcome to the Tidyverse.
        J. Open Source Softw. 2019; 4: 1686
        • Wickham H.
        ggplot2: elegant graphics for data analysis.
        Springer, 2016
        • Sachs M.C.
        plotROC: a tool for plotting ROC curves.
        J. Stat. Softw. 2017; 79: 79
        • Karczewski K.J.
        • Francioli L.C.
        • Tiao G.
        • Cummings B.B.
        • Alföldi J.
        • Wang Q.
        • Collins R.L.
        • Laricchia K.M.
        • Ganna A.
        • Birnbaum D.P.
        • Gauthier L.D.
        • Brand H.
        • Solomonson M.
        • Watts N.A.
        • Rhodes D.
        • Singer-Berk M.
        • England E.M.
        • Seaby E.G.
        • Kosmicki J.A.
        • Walters R.K.
        • Tashman K.
        • Farjoun Y.
        • Banks E.
        • Poterba T.
        • Wang A.
        • Seed C.
        • Whiffin N.
        • Chong J.X.
        • Samocha K.E.
        • Pierce-Hoffman E.
        • Zappala Z.
        • O’Donnell-Luria A.H.
        • Minikel E.V.
        • Weisburd B.
        • Lek M.
        • Ware J.S.
        • Vittal C.
        • Armean I.M.
        • Bergelson L.
        • Cibulskis K.
        • Connolly K.M.
        • Covarrubias M.
        • Donnelly S.
        • Ferriera S.
        • Gabriel S.
        • Gentry J.
        • Gupta N.
        • Jeandet T.
        • Kaplan D.
        • Llanwarne C.
        • Munshi R.
        • Novod S.
        • Petrillo N.
        • Roazen D.
        • Ruano-Rubio V.
        • Saltzman A.
        • Schleicher M.
        • Soto J.
        • Tibbetts K.
        • Tolonen C.
        • Wade G.
        • Talkowski M.E.
        • Genome Aggregation Database C.
        • Neale B.M.
        • Daly M.J.
        • MacArthur D.G.
        The mutational constraint spectrum quantified from variation in 141,456 humans.
        Nature. 2020; 581: 434-443
        • Keinan A.
        • Clark A.G.
        Recent explosive human population growth has resulted in an excess of rare genetic variants.
        Science. 2012; 336: 740-743
        • Gao F.
        • Keinan A.
        Inference of super-exponential human population growth via efficient computation of the site frequency spectrum for generalized models.
        Genetics. 2016; 202: 235-245
        • Lohmueller K.E.
        • Albrechtsen A.
        • Li Y.
        • Kim S.Y.
        • Korneliussen T.
        • Vinckenbosch N.
        • Tian G.
        • Huerta-Sanchez E.
        • Feder A.F.
        • Grarup N.
        • Jørgensen T.
        • Jiang T.
        • Witte D.R.
        • Sandbæk A.
        • Hellmann I.
        • Lauritzen T.
        • Hansen T.
        • Pedersen O.
        • Wang J.
        • Nielsen R.
        Natural selection affects multiple aspects of genetic variation at putatively neutral sites across the human genome.
        PLoS Genet. 2011; 7: 10
        • Nielsen R.
        • Korneliussen T.
        • Albrechtsen A.
        • Li Y.
        • Wang J.
        SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data.
        PLoS One. 2012; 7e37558
        • Woerner A.E.
        • Veeramah K.R.
        • Watkins J.C.
        • Hammer M.F.
        The role of phylogenetically conserved elements in shaping patterns of human genomic diversity.
        Mol. Biol. Evol. 2018; 35: 2284-2295
        • Budowle B.
        • Giusti A.M.
        • Waye J.S.
        • Baechtel F.S.
        • Fourney R.M.
        • Adams D.E.
        • Presley L.A.
        • Deadman H.A.
        • Monson K.L.
        Fixed-bin analysis for statistical evaluation of continuous distributions of allelic data from VNTR loci, for use in forensic comparisons.
        Am. J. Hum. Genet. 1991; 48: 841-855
        • Phillips C.
        • Amigo J.
        • Tillmar A.O.
        • Peck M.A.
        • de la Puente M.
        • Ruiz-Ramírez J.
        • Bittner F.
        • Idrizbegović Š.
        • Wang Y.
        • Parsons T.J.
        • Lareu M.V.
        A compilation of tri-allelic SNPs from 1000 Genomes and use of the most polymorphic loci for a large-scale human identification panel.
        Forensic Sci. Int Genet. 2020; 46102232
        • Ge J.
        • King J.
        • Mandape S.
        • Budowle B.
        Enhanced mixture interpretation with macrohaplotypes based on long-read DNA sequencing.
        Int. J. Leg. Med. 2021; 135: 2189-2198
        • Yun T.
        • Li H.
        • Chang P.-C.
        • Lin M.F.
        • Carroll A.
        • McLean C.Y.
        Accurate, scalable cohort variant calls using DeepVariant and GLnexus.
        Bioinformatics. 2020; 36: 5582-5589
        • Musumeci L.
        • Arthur J.W.
        • Cheung F.S.
        • Hoque A.
        • Lippman S.
        • Reichardt J.K.
        Single nucleotide differences (SNDs) in the dbSNP database may lead to errors in genotyping and haplotyping studies.
        Hum. Mutat. 2010; 31: 67-73
        • Supernat A.
        • Vidarsson O.V.
        • Steen V.M.
        • Stokowy T.
        Comparison of three variant callers for human whole genome sequencing.
        Sci. Rep. -Uk. 2018; 8: 1-6
        • Coble M.D.
        • Bright J.A.
        Probabilistic genotyping software: an overview.
        Forensic Sci. Int.: Genet. 2019; 38: 219-224
        • Bright J.A.
        • Taylor D.
        • Curran J.M.
        • Buckleton J.S.
        Developing allelic and stutter peak height models for a continuous method of DNA interpretation.
        Forensic Sci. Int Genet. 2013; 7: 296-304
        • Cheng K.
        • Lin M.-H.
        • Moreno L.
        • Skillman J.
        • Hickey S.
        • Cuenca D.
        • Hudlow W.R.
        • Just R.
        • Bright J.A.
        • Buckleton J.
        • Curran J.M.
        Modeling allelic analyte signals for aSTRs in NGS DNA profiles.
        J. Forensic Sci. 2021; 66: 1234-1245
        • Mason K.E.
        • Anex D.
        • Grey T.
        • Hart B.
        • Parker G.
        Protein-based forensic identification using genetically variant peptides in human bone.
        Forensic Sci. Int. 2018; 288: 89-96
        • Milan J.A.
        • Wu P.-W.
        • Salemi M.R.
        • Durbin-Johnson B.P.
        • Rocke D.M.
        • Phinney B.S.
        • Rice R.H.
        • Parker G.J.
        Comparison of protein expression levels and proteomically-inferred genotypes using human hair from different body sites.
        Forensic Sci. Int Genet. 2019; 41: 19-23
        • Borja T.
        • Karim N.
        • Goecker Z.
        • Salemi M.
        • Phinney B.
        • Naeem M.
        • Rice R.
        • Parker G.
        Proteomic genotyping of fingermark donors with genetically variant peptides.
        Forensic Sci. Int Genet. 2019; 42: 21-30
        • Russell L.
        • Cooper S.
        • Wivell R.
        • Kerr Z.
        • Taylor D.
        • Buckleton J.
        • Bright J.A.
        A guide to results and diagnostics within a STRmixTM report.
        Wiley Interdiscip. Rev.: Forensic Sci. 2019; 1e1354
        • Smart U.
        • Cihlar J.C.
        • Mandape S.N.
        • Muenzler M.
        • King J.L.
        • Budowle B.
        • Woerner A.E.
        A continuous statistical phasing framework for the analysis of forensic mitochondrial DNA mixtures.
        Genes. 2021; 12: 128
        • Inman K.
        • Rudin N.
        • Cheng K.
        • Robinson C.
        • Kirschner A.
        • Inman-Semerau L.
        • Lohmueller K.E.
        Lab retriever: a software tool for calculating likelihood ratios incorporating a probability of drop-out for forensic DNA profiles.
        BMC Bioinfom. 2015; 16: 298
        • Gill P.
        • Benschop C.
        • Buckleton J.
        • Bleka Ø.
        • Taylor D.
        A review of probabilistic genotyping systems: EuroForMix.
        DNAStatistX STRmixTM. Genes. 2021; 12: 1559
        • Mitchell A.A.
        • Tamariz J.
        • Kathleen O.C.
        • Ducasse N.
        • Prinz M.
        • Caragine T.
        Likelihood ratio statistics for DNA mixtures allowing for drop-out and drop-in.
        Forensic Sci. Int.: Genet. Suppl. Ser. 2011; 3: e240-e241
        • Perlin M.W.
        • Legler M.M.
        • Spencer C.E.
        • Smith J.L.
        • Allan W.P.
        • Belrose J.L.
        • Duceman B.W.
        Validating TrueAllele® DNA mixture interpretation.
        J. Forensic Sci. 2011; 56: 1430-1447
      2. Plott TJ, Karim N., Durbin-Johnson BP, Swift DP, Scott Youngquist R., Salemi M., et al. Age-Related Changes in Hair Shaft Protein Profiling and Genetically Variant Peptides. Forensic Science International: Genetics.