If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Corresponding author at: CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing, China.
CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing, ChinaDepartment of Genetic Identification, Erasmus MC University Medical Center Rotterdam, Rotterdam, the Netherlands
CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing, China
Department of Internal Medicine, Erasmus MC University Medical Center Rotterdam, Rotterdam, the NetherlandsDepartment of Epidemiology, Erasmus MC University Medical Center Rotterdam, Rotterdam, the Netherlands
Department of Pediatrics, Division of Endocrinology, Sophia Children’s Hospital, Erasmus MC University Medical Center Rotterdam, Rotterdam, the Netherlands
Current forensic DNA phenotyping focuses on pigmentation traits.
•
Update on DNA predictability of tall stature in Europeans is presented.
•
689 SNPs provided AUC of 0.79, while a subset of 412 SNPs achieved 0.76.
•
New models improved prediction accuracy compared to previous ones.
Abstract
Predicting adult height from DNA has important implications in forensic DNA phenotyping. In 2014, we introduced a prediction model consisting of 180 height-associated SNPs based on data from 10,361 Northwestern Europeans enriched with tall individuals (770 > 1.88 standard deviation), which yielded a mid-ranged accuracy (AUC = 0.75 for binary prediction of tall stature and R2 = 0.12 for quantitative prediction of adult height). Here, we provide an update on DNA-based height predictability considering an enlarged list of subsequently-published height-associated SNPs using data from the same set of 10,361 Europeans. A prediction model based on the full set of 689 SNPs showed an improved accuracy relative to previous models for both tall stature (AUC = 0.79) and quantitative height (R2 = 0.21). A feature selection analysis revealed a subset of 412 most informative SNPs while the corresponding prediction model retained most of the accuracy (AUC = 0.76 and R2 = 0.19) achieved with the full model. Over all, our study empirically exemplifies that the accuracy for predicting human appearance phenotypes with very complex underlying genetic architectures, such as adult height, can be improved by increasing the number of phenotype-associated DNA variants. Our work also demonstrates that a careful sub-selection allows for a considerable reduction of the number of DNA predictors that achieve similar prediction accuracy as provided by the full set. This is forensically relevant due to restrictions in the number of SNPs simultaneously analyzable with forensically suitable DNA technologies in the current days of targeted massively parallel sequencing in forensic genetics.
] is a fast-developing subfield of forensic genetics aiming at inferring externally visible information (appearance, bio-geographic ancestry, and chronological age) of an unknown crime scene sample donor directly from DNA. Such information can assist police investigations in finding unknown perpetrators of crime in cases where the standard forensic STR-profiling is non-informative due to the lack of known suspects. So far, the most successful FDP examples are restricted to human pigmentation traits, which is explained by the presence of major gene effects, so that statistical models consisting of a limited number of SNPs can provide highly accurate prediction results [
]. Because SNPs with larger effects are easier to identify in gene mapping studies, and prediction tools based on a limited number of SNPs can easily be developed, FDP systems for eye, hair and skin colour prediction from trace DNA have already been developed and forensically validated [
Widening appearance DNA prediction beyond pigmentation traits is generally troubled by the absence of major gene effects in non-pigmentation appearance traits, thus requiring a much larger number of DNA predictors due to their small phenotypic effects [
]. Recent progress illustrated, as is theoretically expected, that when genome-wide association studies (GWASs) based on increased sample size are applied for appearance traits, larger numbers of genome-wide significant SNPs with small effects are found, such as shown for hair structure [
Genome-wide association meta-analysis of individuals of European ancestry identifies new loci explaining a substantial fraction of hair color variation and heritability.
Genome-wide association meta-analysis of individuals of European ancestry identifies new loci explaining a substantial fraction of hair color variation and heritability.
] that when the enlarged numbers of identified SNPs are applied in prediction studies, they provide increased prediction accuracies compared to earlier models based on fewer numbers of SNPs [
]. However, for having sufficient statistical power to identify large numbers of associated SNPs with small effects, GWAS meta-analyses with very large combined sample size are required, which until recently were not available for any appearance traits, with the notable exception of adult height.
Adult height is characterized by a high degree of heritability estimated at about 80% [
]. To date, four sizable GWAS meta-analyses on adult height with increasing sample size have been conducted by the international Genetic Investigation of Anthropometric Traits (GIANT) Consortium [
]. These studies have demonstrated that the genetic complexity of adult height is extremely high, involving many hundreds and expectedly thousands of independently contributing genetic loci characterized by common SNPs with small to very small phenotypic height effects [
] and described the performance of a prediction model consisting of 180 height-associated SNPs previously identified by the first GIANT height GWAS meta-analysis consisting of 183,727 Europeans [
]. This model provided a prediction accuracy expressed by the area under the receiver operating characteristic curve (AUC) of 0.75 (95% CI 0.72-0.79) for binary tall stature prediction and R2 of 0.12 (95% CI 0.10-0.14) for quantitative (full-range) height prediction [
]. In parallel, the second GWAS meta-analysis was published by the GIANT Consortium in 2014 based on a largely increased sample size consisting of 253,228 Europeans, which identified 697 independently contributing genetic loci [
]. In the present study, we use these 697 genome-wide height associated SNPs to update the capacity of DNA-based predictability of adult height employing the same set of 10,361 Dutch Europeans we used earlier [
From the records of the Division of Pediatric endocrinology at the Erasmus University Medical Center, Sophia Children’s Hospital, we identified former patients who attended this clinic for evaluation of tall stature. Eligible subjects were traced using municipal registries and invited by mail to participate in this study. The height of all participants were above +1.88 standard deviation score (SDs) according to Dutch standards (http://www.tno.nl/groei), which corresponds to the 3% upper tail of the height distribution in Dutch adults, approximately >195 cm in men and >180 cm in women at age 30, after correcting for secular trend [
]. Details regarding the inclusion criteria, height phenotyping, microarray genotyping, SNP imputation and quality controls have been described in our previous study [
]. After all genomic and phenotypic quality controls, the current study included 462 unrelated Dutch tall individuals. The Dutch tall stature study has been approved by the Medical Ethics Committee of the Erasmus MC (registration number MEC-2005-091) and all study participants provided written informed consent.
2.2 The Rotterdam study (RS)
The Rotterdam Study is a prospective cohort study ongoing since 1990 in the city of Rotterdam in The Netherlands [
]. After genomic and phenotypic quality controls, the current study included 9,899 participants from the RS. The tall stature was set as the sex and age adjusted residuals >1.88 standard deviations (308 tall individuals). After merging with the Dutch tall individuals, the current study includes a total of 10,361 individuals, comprising 770 tall (>1.88 standard deviation) and 9,591 normal-height individuals, and 2,530,557 autosomal SNPs.
The Rotterdam Study has been approved by the Medical Ethics Committee of the Erasmus MC (registration number MEC 02.1015) and by the Dutch Ministry of Health, Welfare and Sport (Population Screening Act WBO, license number 1071272-159521-PG) and all study participants provided written informed consent. The Rotterdam Study has been entered into the Netherlands National Trial Register (NTR; www.trialregister.nl) and into the WHO International Clinical Trials Registry Platform (ICTRP; www.who.int/ictrp/network/primary/en/) under shared catalogue number NTR6831.
The Rotterdam samples used here represent a small fraction (3.9%) of the samples used for discovery purposes by Wood et al. [
Sex and age adjusted height residuals were considered as a quantitative phenotype and the individuals with height residuals greater than 1.88 standard deviations were considered as tall stature. The height residuals, ԑ, were derived using linear regression, i.e., . We calculated the polygenic scores (also called weighted allele sums) for all individuals according to their genotypes of the 689 height associated SNPs that survived quality control and the regression beta values reported by Wood et al. [
]. The polygenic score was used to fit linear models for quantitative height prediction and logistic models for binary tall stature prediction in a randomly selected 80% training sample. Then we applied the models to predict quantitative height and binary tall stature in the remaining 20% testing sample, repeated for 1,000 replicates, and estimated the mean accuracy parameters and the 5%–95% boundary values. For quantitative prediction the accuracy parameters included R2 and mean absolute error (MAE). An R2 value equals to the square of the correlation (r) between the predicted and observed values, which corresponds the proportion of the sex and age adjusted height residual variance explained by the SNP predictors. , where yi is the predicted and xi is the observed value for a given sample. For binary prediction the accuracy parameters included the Area Under the Receiver Operating Characteristic (ROC) Curves, or AUC, Sensitivity, Specificity, Positive Predictive Value (PPV), and Negative Predictive Value (NPV). The AUC is the integral of ROC curves as an overall estimate of the prediction accuracy for a binary classifier, ranging from 0.5 representing total lack of prediction to 1.0 representing completely accurate prediction. Sensitivity, specificity, PPV, and NPV were derived from a 2 by 2 confusion table consisting of the numbers of true positives, true negatives, false positives and false negatives according to standard formulas where a ‘positive’ prediction was defined using the predicted probability > 0.5.
Due to technical needs in forensic genetics to keep the number of DNA markers at a minimum so that at best they can be analyzed simultaneously, we conducted a feature selection analysis in the training set to reselect a subset of the most tall stature-informative SNPs. Backward stepwise logistic regression analysis was conducted according rank of the Akaike Information Criterion (AIC), , where k is the number of SNPs, n is the sample size, and is the fitted value. The SNP set with the minimal AIC value was considered as the theoretically optimal subset for prediction. This SNP subset was then used to construct linear and logistic models and to predict quantitative height and binary tall stature in the testing set.
3. Results
Among the 697 independent genome-wide significant height-associated SNPs reported by Wood et al. [
], 689 passed quality control in our study. Among these, 634 SNPs (92.0%) had allele effects on the same direction as reported by Wood et al., 308 SNPs (44.7%) showed nominally significant association (p < 0.05) with quantitative height and 178 SNPs (25.8%) showed nominally significant association (p < 0.05) with binary tall stature in our dataset (Table S1). The polygenic scores derived using the 689 SNPs and the regression betas from Wood et al. [
] showed significant positive correlations with sex and age adjusted height residuals (r = 0.46, p < 1 × 10−300), and showed a significant increase from non-tall (mean = 0.41) to tall individuals (mean = 0.94, p < 1 × 10−300, Fig. 1A).
Fig. 1(A) Density plot of polygenic scores in 770 tall and 9,591 non-tall Dutch Europeans. (B). Relationship between number of SNP predictors and prediction accuracy for binary tall stature prediction in 770 tall and 9,591 non-tall Dutch Europeans. The dashed red line marks the cut-off at 412 SNPs we used as best-fit model, because further increasing the number of SNP predictors in the model did not result in increased AUC values (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article).
To provide the null distribution of the prediction results, we randomly selected 689 SNPs over the genome, estimated their effects (linear regression beta) in the training set, constructed polygenic scores, built linear and logistic models, and predicted binary tall stature and quantitative height in the testing set. This set of random 689 SNPs provided a mean AUC value of 0.53, a mean R2 of 0.001, and a mean MAE of 9.56 cm after 1000 replicates (Table 1). These estimates were close to the expected values under the null hypothesis of no predictive power, and were similar to our previous estimates using the 180 randomly selected SNPs [
] (Table 1). In other words, 689 SNPs randomly selected from the genome provided no predictive value for tall stature and for quantitative height, as may be expected.
Table 1Accuracy for predicting tall stature and quantitative adult height using different SNPs sets in 10,361 Dutch Europeans.
Binary tall stature prediction
Quantitative height prediction
SNP list
N
AUC
L95
U95
Sens
Spec
NPV
PPV
R2
L95
U95
MAE
L95
U95
Random set
180
0.52
0.48
0.55
1.00
0.00
0.00
0.93
0.00
0.00
0.01
9.54
5.45
84.20
Random set
689
0.53
0.47
0.58
0.73
0.53
0.65
0.65
0.00
0.00
0.00
9.56
5.49
49.47
Lango Allen et al. 2010
180
0.75
0.72
0.99
0.01
0.76
0.33
0.93
0.12
0.10
0.14
5.25
5.07
5.41
Wood et al. 2014
689
0.79
0.75
0.98
0.01
0.74
0.55
0.93
0.21
0.18
0.24
5.00
4.85
5.16
Feature selection from Wood et al. 2014
412
0.76
0.73
0.99
0.03
0.69
0.15
0.93
0.19
0.17
0.22
5.09
4.93
5.31
N: number of SNPs used in the model; AUC: the Area Under the Receiver Operating Characteristic Curves; Sens: Sensitivity; Spec: Specificity: NPV: Negative Predictive Value: PPV: Positive Predictive Value; MAE: mean absolute error; L95 and U95: the 95% lower and upper boundary values from 1000 replicates of 80%-20% cross-validations.
] showed fairly accurate prediction for tall stature (AUC = 0.79, 95% CI: 0.75-0.82, Table S2) and also demonstrated prediction for quantitative height (R2 = 0.21, 95% CI: 0.18-0.24, MAE =5.00 cm, 95% CI: 4.85–5.16, Table 1). Notably, these estimates were improved compared to our previous models based on the 180 height-associated SNPs identified by Lango Allen et al. [
Next, we tested for the trade-off between the number of predictive DNA markers and the model performance. At this end, we conducted a feature selection analysis using stepwise logistic regression in the training set, and ranked all SNPs according to their contribution to the model performance (Fig. 1B). From this analysis, a model consisting of 412 SNPs is proposed as the best-fitting model with the minimal AIC value (Fig. 1B). When applied in the testing set, this 412-SNP model achieved decreased prediction accuracies compared to the full 689-SNP model (AUC = 0.76, 95% CI: 0.73-0.80; R2 = 0.19, 95% CI: 0.17-0.22; MAE =5.09 cm, 95% CI: 4.93–5.31, Table 1, Table S2), while most of the prediction accuracy provided by the full model was retained with this reduced model.
4. Discussion
Although adult height is characterized by high heritability [
A note on the background to, and refereeing of, R. A. Fisher’s 1918 paper’ on the correlation between relatives on the supposition of Mendelian inheritance.
]. This biological situation provides major challenges to the DNA-based prediction of adult height as being relevant in FDP. Our study demonstrated that DNA-based height prediction for both tall stature and continuous height can be improved by increasing the number of DNA predictors from 180 SNPs, as we applied previously [
], to 689 SNPs as tested here. The model we introduce here based on 689 SNPs achieves the highest currently available prediction accuracy for tall stature.
However, analyzing the full set of 689 SNPs in low quality and low quantity DNA samples available from compromised crime scene material, as often confronted with in forensic DNA testing, is technically challenging. Therefore, we attempted to reduce the number of SNP predictors by maximizing the prediction accuracy, which resulted in a best-fitted model based on 412 (60%) SNPs achieving a similar accuracy as the full set of 689 SNPs. From a forensic genetics perspective it would be desirable to establish an even smaller set achieving similarly high prediction accuracy; however, our findings demonstrate (see Fig. 1B) that the genetic complexity of adult height does not allow this without losing more prediction accuracy, not even for tall stature prediction. Nevertheless, our study demonstrates that the level of accuracy improvement as shown here brings DNA-based prediction of tall stature closer to practical forensic applications. For comparison, AUC values achieved with the HIrisPlex system, which has already been used in anthropological applications and forensic casework, range from 0.74 to 0.95 for eye colour and from 0.75 to 0.92 for hair colour [
] performed an adult height GWAS based on whole exome sequencing (WES) data in 711,428 Europeans, and reported 83 height-associated coding variants with low minor allele frequencies but increased effect size. Of these, only 7 passed the quality control in our microarray-based study, of which one was already included in the 697 SNPs we tested. Of the 6 remaining SNPs, none showed significant association with height in our data (P > .05). With the exception of one, we therefore could not include the SNPs highlighted by Marouli et al. in our study. In principle, the suitability of including rare DNA variants with increased effect size in a prediction model depends not only on the minor allele frequency and thus the chance to observe the predictive minor allele in a newly tested DNA sample, but also the number of such rare DNA predictors being available, which has to be large to increase the chance that at least some will be present with their predictive minor allele in the tested DNA sample. More height GWASs based on WES and whole genome sequencing (WGS) data need to be carried out in the future to identify more rare SNPs with increased height effects, to be tested in prediction modelling studies together with common height SNPs.
Theoretically, it is expected that using more than the 689 SNPs height-informative DNA markers we tested here in the prediction modeling will yield to higher prediction accuracy outcomes. However, because there is no linear relationship between the number of DNA predictors and the prediction accuracy, this needs to be empirically tested. Very recently in August 2018, Yengo et al. [
] published the third GWAS meta-analysis on height of the GIANT Consortium based on 693,529 Europeans, which highlighted 512 new height-associated loci not previously identified in the second GIANT height GWAS by Wood et al [
] we considered here. Unfortunately, due to the timing of our study, we were unable to include these new loci in our current prediction analyses, which shall be done in future studies.
Prior to the time of the advent of targeted MPS technologies in forensic genetics, any consideration of hundreds of SNPs for crime scene DNA analysis would have been technically impossible due to the limited multiplex capacity of the previously used SNP genotyping technologies that are suitable to low quality and low quantity DNA, which typically does not allow reliable use of SNP microarrays. Recent studies demonstrated that targeted MPS allows the simultaneous analysis of some hundreds of SNPs [
Simultaneous whole mitochondrial genome sequencing with short overlapping amplicons suitable for degraded DNA using the ion torrent personal genome machine.
Forensic Y-SNP analysis beyond SNaPshot: high-resolution Y-chromosomal haplogrouping from low quality and quantity DNA using Ion AmpliSeq and targeted massively parallel sequencing.
] is currently not possible with targeted MPS and is irrelevant for many forensic cases where the low DNA quality and quantity does not allow the application of SNP microarrays. With the availability of targeted MPS technologies and their demonstrated suitability for forensic DNA analysis, appearance prediction models involving hundreds of SNPs, as demonstrated here for tall stature, are now becoming feasible for forensic genetics in general and Forensic DNA Phenotyping in particular.
Besides the technical problems related to DNA quality and quantity, it is generally preferred in forensic genetics to use targeted DNA over the non-targeted DNA technologies because in many countries forensic DNA analysis is regulated by law and restricted to certain uses of DNA. Non-targeted genomic screening technologies, such as SNP microarray analysis or whole genome sequencing analysis, may deliver far more genetic information than what forensic geneticists are legally allowed to have in many countries. It may be expected that with the current fast-pace developments in targeted MPS technologies, technical solutions to simultaneously genotype thousands of SNPs with high accuracy and sensitivity from DNA with low quality and quantity, thereby fulfilling requirements of forensic DNA analysis, may become available in the future, which will boost the genomic prediction of human complex appearance phenotypes including adult height and more. A promising technological development is targeted capture sequencing [
] Targeted enrichment using probe capture has been introduced to forensic DNA analysis for whole mitogenome sequencing and sequencing of hundreds of SNPs [
Applications of probe capture enrichment next generation sequencing for whole mitochondrial genome and 426 nuclear SNPs for forensically challenging samples.
] while forensic applications of targeted capture sequencing thousands of SNPs are pending as of yet.
5. Conclusion
By providing an update on the DNA predictability of tall stature in Europeans we introduce a prediction model consisting of 689 height-associated SNPs that achieved AUC of 0.79. Given the forensic need to reduce the number of SNP predictors while maximizing the prediction accuracy, we demonstrated that a subset of 412 SNPs achieved similar prediction accuracy with AUC of 0.76. To further increase prediction accuracy of tall stature, and especially for the full range of adult height, many more independently contributing SNP predictors are needed together with forensically suitable DNA technologies for their simultaneous analysis from crime scene stains.
Conflict of interest
SLS Drop has received research grants from Ace, Ferring and Eli Lilly. All other authors declare no conflicts of interest.
Acknowledgements
We thank the many scientists and volunteers involved in the Rotterdam Study for their collaboration and participation as well as the participants in the Dutch tall cohort study for their cooperation. Two anonymous reviewers are acknowledged for their constructive and useful comments on an earlier version of the manuscript.
This study received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement number 740580 (VISAGE). FL was supported by the National Key R&D Program of China (2017YFC0803501), Ministry of Public Security Technique Research Plan (2016JSYJA04), Basic Research Project Grant (2016JB037, 2017JB025), and National Natural Science Foundation of China (91651507). KZ was supported by the China Scholarship Council (CSC). The generation and management of GWAS genotype data for the Rotterdam Study is supported by the Netherlands Organisation of Scientific Research NWO Investments (nr. 175.010.2005.011, 911-03-012). Research Institute for Diseases in the Elderly (014-93-015; RIDE2), the Netherlands Genomics Initiative /Netherlands Organisation for Scientific Research (NWO) project nr. 050-060-810. The Rotterdam Study is funded by the Erasmus MC University Medical Center Rotterdam, the Erasmus University Rotterdam, the Netherlands Organization for the Health Research and Development (ZonMw), the Research Institute for Diseases in the Elderly (RIDE), the Ministry of Education, Culture and Science of the Netherlands, the Ministry for Health, Welfare and Sports of the Netherlands, the European Commission (DG XII), the Municipality of Rotterdam and the Netherlands Genomics Initiative (NGI) / Netherlands Organization for Scientific Research (NWO) within the framework of the Netherlands Consortium on Healthy Ageing (NCHA). None of the funding agencies had influenced the design, execution or results of this study.
Appendix A. Supplementary data
The following are Supplementary data to this article:
Genome-wide association meta-analysis of individuals of European ancestry identifies new loci explaining a substantial fraction of hair color variation and heritability.
A note on the background to, and refereeing of, R. A. Fisher’s 1918 paper’ on the correlation between relatives on the supposition of Mendelian inheritance.
Simultaneous whole mitochondrial genome sequencing with short overlapping amplicons suitable for degraded DNA using the ion torrent personal genome machine.
Forensic Y-SNP analysis beyond SNaPshot: high-resolution Y-chromosomal haplogrouping from low quality and quantity DNA using Ion AmpliSeq and targeted massively parallel sequencing.
Applications of probe capture enrichment next generation sequencing for whole mitochondrial genome and 426 nuclear SNPs for forensically challenging samples.