Advertisement

How to choose sets of ancestry informative markers: A supervised feature selection approach

Published:February 15, 2020DOI:https://doi.org/10.1016/j.fsigen.2020.102259

      Highlights

      • We provide AIMsetfinder, a tool to systematically select ancestry informative markers (AIMs).
      • Simulations of human population structure can be used to assess the performance of AIM selection procedures.
      • 12 SNPs identified by AIMsetfinder suffice to classify all African, European, East-Asian, and South-Asian individuals in the 1000 Genomes project correctly.

      Abstract

      Inference of the Biogeographical Ancestry (BGA) of a person or trace relies on three ingredients: (1) a reference database of DNA samples including BGA information; (2) a statistical clustering method; (3) a set of loci which segregate dependent on geographical location, i.e. a set of so-called Ancestry Informative Markers (AIMs). We used the theory of feature selection from statistical learning in order to obtain AIMsets for BGA inference. Using simulations, we show that this learning procedure works in various cases, and outperforms ad hoc methods, based on statistics like FST or informativeness for the choice of AIMs. Applying our method to data from the 1000 genomes project (excluding Admixed Americans) we identified an AIMset of 12 SNPs, which gives a vanishing misclassification error on a continental scale, as do other published AIMsets. In fact, cross validation shows that there exists a multitude of sets with comparable performance to the optimal AIMset. On a sub-continental scale, we find a set of 55 SNPs for distinguishing the five European populations. The misclassification error is reduced by a factor of two relative to published AIMsets, but is still 30% and therefore too large in order to be useful in forensic applications.

      Keywords

      To read this article in full you will need to make a payment

      Purchase one-time access:

      Academic & Personal: 24 hour online accessCorporate R&D Professionals: 24 hour online access
      One-time access price info
      • For academic or personal research use, select 'Academic and Personal'
      • For corporate R&D use, select 'Corporate R&D Professionals'

      Subscribe:

      Subscribe to Forensic Science International: Genetics
      Already a print subscriber? Claim online access
      Already an online subscriber? Sign in
      Institutional Access: Sign in to ScienceDirect

      References

        • 1000 Genomes Project Consortium
        • Auton A.
        • Brooks L.D.
        • Durbin R.M.
        • Garrison E.P.
        • Kang H.M.
        • Korbel J.O.
        • Marchini J.L.
        • McCarthy S.
        • McVean G.A.
        • Abecasis G.R.
        A global reference for human genetic variation.
        Nature. 2015; 526: 68-74
        • Akey J.M.
        • Eberle M.A.
        • Rieder M.J.
        • Carlson C.S.
        • Shriver M.D.
        • Nickerson D.A.
        • Kruglyak L.
        Population history and natural selection shape patterns of genetic variation in 132 genes.
        PLoS Biol. 2004; 2: e286
        • Al-Asfi M.
        • McNevin D.
        • Mehta B.
        • Power D.
        • Gahan M.E.
        • Daniel R.
        Assessment of the precision id ancestry panel.
        Int. J. Legal Med. 2018; 132: 1581-1594
        • Angrist M.
        Personal genomics: where are we now?.
        Appl. Transl. Genomics. 2016; 8: 1-3
        • Bradbury C.
        • Köttgen A.
        • Staubach F.
        Off-target phenotypes in forensic DNA phenotyping and biogeographic ancestry inference: a resource.
        Forensic Sci. Int.: Genet. 2019; 38: 93-104
        • Chaitanya L.
        • Breslin K.
        • Zuñiga S.
        • Wirken L.
        • Pospiech E.
        • Kukla-Bartoszek M.
        • Sijen T.
        • de Knijff P.
        • Liu F.
        • Branicki W.
        • Kayser M.
        • Walsh S.
        The HIrisPlex-S system for eye, hair and skin colour prediction from DNA: introduction and forensic developmental validation.
        Forensic Sci. Int.: Genet. 2018; 35: 123-135
        • Cheung E.
        • Phillips C.
        • Eduardoff M.
        • Victoria Lareu M.
        • McNevin D.
        Performance of ancestry-informative SNP and microhaplotype markers.
        Forensic Sci. Int.: Genet. 2019; 43: 102141
        • Cheung E.Y.Y.
        • Gahan M.E.
        • McNevin D.
        Prediction of biogeographical ancestry from genotype: a comparison of classifiers.
        Int. J. Legal Med. 2017; 131: 901-912
        • Cheung E.Y.Y.
        • Gahan M.E.
        • McNevin D.
        Prediction of biogeographical ancestry in admixed individuals.
        Forensic Sci. Int.: Genet. 2018; 36: 104-111
      1. 1000 Genomes Project Consortium. 1000 Genomes Project: Developing a Research Resource for Studies of Human Genetic Variation. Consent to Participate. https://www.internationalgenome.org/sites/1000genomes.org/files/docs/Informed (download 15 August 2019).

        • Drummond A.J.
        • Rambaut A.
        • Shapiro B.
        • Pybus O.G.
        Bayesian coalescent inference of past population dynamics from molecular sequences.
        Mol. Biol. Evol. 2005; 22: 1185-1192
        • Elhaik E.
        • Tatarinova T.
        • Chebotarev D.
        • Piras I.S.
        • Maria Calò C.
        • De Montis A.
        • Atzori M.
        • Marini M.
        • Tofanelli S.
        • Francalacci P.
        • Pagani L.
        • et al.
        Geographic population structure analysis of worldwide human populations infers their biogeographical origins.
        Nature Commun. 2014; 5: 3513
        • Fink D.
        A Compendium of Conjugate Priors in Progress Report: Extension and Enhancement of Methods for Setting Data Quality Objectives. Tech. Rep..
        Montana State University, 1995
        • Fondevila M.
        • Phillips C.
        • Santos C.
        • Freire Aradas A.
        • Vallone P.M.
        • Butler J.M.
        • Lareu M.V.
        • Carracedo A.
        Revision of the SNPforID 34-plex forensic ancestry test: assay enhancements, standard reference sample genotypes and extended population studies.
        Forensic Sci. Int.: Genet. 2013; 7: 63-74
      2. T.N. Frudakis, M.D. Shriver, Compositions and methods for inferring ancestry, US Patent 0229231 A1 (2004). https://patentimages.storage.googleapis.com/dd/3c/d7/75365f60149c53/US20040229231A1.pdf.

        • Gannett L.
        Biogeographical ancestry and race.
        Stud. Hist. Philos. Biol. Biomed. Sci. 2014; 47 Part A: 173-184
        • Gutenkunst R.N.
        • Hernandez R.D.
        • Williamson S.H.
        • Bustamante C.D.
        Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data.
        PLoS Genet. 2009; 5: e1000695
        • Halder I.
        • Kip K.E.
        • Mulukutla S.R.
        • Aiyer A.N.
        • Marroquin O.C.
        • Huggins G.S.
        • Reis S.E.
        Biogeographic ancestry, self-identified race, and admixture-phenotype associations in the Heart SCORE Study.
        Am. J. Epidemiol. 2012; 176: 146-155
        • Hastie T.
        • Tibshirani R.
        • Friedman J.
        The Elements of Statistical Learning.
        2nd ed. Springer, 2008
        • Höher G.
        • Fiegenbaum M.
        • Almeida S.
        Molecular basis of the Duffy blood group system.
        Blood Transf. [[Trasfusione del Sangue]]. 2018; 16: 93-100
        • Hudson R.R.
        • Slatkin M.
        • Maddison W.P.
        Estimation of levels of gene flow from DNA sequence data.
        Genetics. 1992; 132: 583-589
        • Hudson R.R.
        Properties of a neutral allele model with intragenic recombination.
        Theoret. Popul. Biol. 1983; 23: 183-201
        • Jia J.
        • Wei Y.-L.
        • Qin C.-J.
        • Hu L.
        • Wan L.-H.
        • Li C.-X.
        Developing a novel panel of genome-wide ancestry informative markers for bio-geographical ancestry estimates.
        Forensic Sci. Int.: Genetics. 2014; 8: 187-194
        • Jonnalagadda M.
        • Faizan M.
        • Ozarkar S.
        • Ashma R.
        • Kulkarni S.
        • Norton H.
        • Parra E.
        A Genome-Wide Association Study of Skin and Iris Pigmentation among Individuals of South Asian Ancestry.
        Genome Biol. Evol. 2019; 11: 1066-1076
        • Kelleher J.
        • Etheridge A.M.
        • McVean G.
        Efficient coalescent simulation and genealogical analysis for large sample sizes.
        PLoS Comput. Biol. 2016; 12: e1004842
        • Kidd K.K.
        • Speed W.C.
        • Pakstis A.J.
        • Furtado M.R.
        • Fang R.
        • Madbouly A.
        • Maiers M.
        • Middha M.
        • Friedlaender F.R.
        • Kidd J.R.
        Progress toward an efficient panel of SNPs for ancestry inference.
        Forensic Sci. Int.: Genet. 2014; 10: 23-32
        • Kingman J.F.C.
        The coalescent.
        Stochast. Process. Appl. 1982; 13: 235-248
        • Kosoy R.
        • Nassir R.
        • Tian C.
        • White P.A.
        • Butler L.M.
        • Silva G.
        • Kittles R.
        • Alarcon-Riquelme M.E.
        • Gregersen P.K.
        • Belmont J.W.
        • De La Vega F.M.
        • Seldin M.F.
        Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America.
        Human Mutat. 2009; 30: 69-78
        • McManus K.F.
        • Taravella A.M.
        • Henn B.M.
        • Bustamante C.D.
        • Sikora M.
        • Cornejo O.E.
        Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans.
        PLoS Genet. 2017; 13: e1006560
        • Murphy K.
        Naive Bayes Classifiers. Technical Report.
        2006
        • Nassir R.
        • Kosoy R.
        • Tian C.
        • White P.A.
        • Butler L.M.
        • Silva G.
        • Kittles R.
        • Alarcon-Riquelme M.E.
        • Gregersen P.K.
        • Belmont J.W.
        • De La Vega F.M.
        • Seldin M.F.
        An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels.
        BMC Genet. 2009; 10: 39
        • Nievergelt C.M.
        • Maihofer A.X.
        • Shekhtman T.
        • Libiger O.
        • Wang X.
        • Kidd K.K.
        • Kidd J.R.
        Inference of human continental origin and admixture proportions using a highly discriminative ancestry informative 41-SNP panel.
        Investig. Genet. 2013; 4: 13
        • Paschou P.
        • Ziv E.
        • Burchard E.G.
        • Choudhry S.
        • Rodriguez-Cintron W.
        • Mahoney M.W.
        • Drineas P.
        Pca-correlated snps for structure identification in worldwide human populations.
        PLoS Genet. 2007; 3: 1672-1686
        • Phillips C.
        Forensic genetic analysis of bio-geographical ancestry.
        Forensic Sci. Int.: Genet. 2015; 18: 49-65
        • Phillips C.
        • Parson W.
        • Lundsberg B.
        • Santos C.
        • Freire-Aradas A.
        • Torres M.
        • Eduardoff M.
        • Børsting C.
        • Johansen P.
        • Fondevila M.
        • Morling N.
        • Schneider P.
        • EUROFORGEN-NoE Consortium
        • Carracedo A.
        • Lareu M.V.
        Building a forensic ancestry panel from the ground up: the EUROFORGEN Global AIM-SNP set.
        Forensic Sci. Int.: Genet. 2014; 11: 13-25
        • Phillips C.
        • Salas A.
        • Sánchez J.J.
        • Fondevila M.
        • Gómez-Tato A.
        • Álvarez Dios J.
        • Calaza M.
        • Casares de Cal M.
        • Ballard D.
        • Lareu M.V.
        • Carracedo A.
        • The SNPforID Consortium
        Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs.
        Forensic Sci. Int.: Genet. 2007; 1: 273-280
        • Phillips C.
        • Santos C.
        • Fondevila M.
        • Carracedo Á.
        • Lareu M.V.
        Inference of ancestry in forensic analysis I: autosomal ancestry-informative marker sets..
        Forensic DNA Typing Protocols, Vol. 1420 of Methods in Molecular Biology. Springer, New York2016: 233-253
        • Pritchard J.
        • Stephens M.
        • Donnelly P.
        Inference of population structure using multilocus genotype data.
        Genetics. 2000; 155: 945-954
        • Rosenberg N.A.
        Algorithms for selecting informative marker panels for population assignment.
        J. Comput. Biol.: J. Comput. Mol. Cell Biol. 2005; 12: 1183-1201
        • Rosenberg N.A.
        • Li L.M.
        • Ward R.
        • Pritchard J.K.
        Informativeness of genetic markers for inference of ancestry.
        Am. J. Hum. Genet. 2003; 73: 1402-1422
        • Sampson J.N.
        • Kidd K.K.
        • Kidd J.R.
        • Zhao H.
        Selecting SNPs to identify ancestry.
        Ann. Hum. Genet. 2011; 75: 539-553
        • Santos C.
        • Phillips C.
        • Gomez-Tato A.
        • Alvarez-Dios J.
        • Carracedo A.
        • Lareu M.V.
        Inference of ancestry in forensic analysis II: analysis of genetic data.
        Methods Mol. Biol. 2016; 1420: 255-285
        • Santos C.
        • Phillips C.
        • Fondevila M.
        • Daniel R.
        • van Oorschot R.A.H.
        • Burchard E.G.
        • Schanfield M.S.
        • Souto L.
        • Uacyisrael J.
        • Via M.
        • Carracedo A.
        • Lareu M.V.
        Pacifiplex: an ancestry-informative SNP panel centred on Australia and the Pacific region.
        Forensic Sci. Int.: Genet. 2016; 20: 71-80
        • Sherry S.T.
        • Ward M.H.
        • Kholodov M.
        • Baker J.
        • Phan L.
        • Smigielski E.M.
        • Sirotkin K.
        dbSNP: the NCBI database of genetic variation.
        Nucleic Acids Res. 2001; 29: 308-311
        • Shriver M.D.
        • Smith M.W.
        • Jin L.
        • Marcini A.
        • Akey J.M.
        • Deka R.
        • Ferrell R.E.
        Ethnic-affiliation estimation by use of population-specific DNA markers.
        Am. J. Hum. Genet. 1997; 60: 957-964
        • Stokowskia R.
        • Krishna Pant P.V.
        • Dadd T.
        • Fereday A.
        • Hinds D.
        • Jarman C.
        • Filsell W.
        • Ginger R.
        • Green M.
        • van der Ouderaa F.J.
        • Cox D.R.
        A genomewide association study of skin pigmentation in a south Asian population.
        Am. J. Hum. Genet. 2007; 81: 1119-1132
        • Stoneking M.
        An Introduction to Molecular Anthropology.
        Wiley, New York2017
        • ThermoFisher
        Precision ID Ancestry Panel.
        2016 (download 08 August 2019)
        • Wakeley J.
        Coalescent Theory: An Introduction.
        Roberts & Company, 2008
        • Walsh S.
        • Chaitanya L.
        • Breslin K.
        • Muralidharan C.
        • Bronikowska A.
        • Pospiech E.
        • Koller J.
        • Kovatsi L.
        • Wollstein A.
        • Branicki W.
        • Liu F.
        • Kayser M.
        Global skin colour prediction from DNA.
        Hum. Genet. 2017; 136: 847-863
        • Walsh S.
        • Liu F.
        • Ballantyne K.N.
        • van Oven M.
        • Lao O.
        • Kayser M.
        Irisplex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information.
        Forensic Sci. Int.: Genet. 2011; 5: 170-180
        • Zhang J.
        Ancestral informative marker selection and population structure visualization using sparse Laplacian eigenfunctions.
        PLoS ONE. 2010; 5: e13734
        • Zhao S.
        • Shi C.-M.
        • Ma L.
        • Liu Q.
        • Liu Y.
        • Wu F.
        • Chi L.
        • Chen H.
        AIM-SNPtag: a computationally efficient approach for developing ancestry-informative SNP panels.
        Forensic Sci. Int.: Genet. 2019; 38: 245-253
        • Zöllner S.
        • Pritchard J.K.
        Coalescent-based association mapping and fine mapping of complex trait loci.
        Genetics. 2005; 169: 1071-1092