If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Recently, the field of predicting phenotypes of externally visible characteristics (EVCs) from DNA genotypes with the final aim of concentrating police investigations to find persons completely unknown to investigating authorities, also referred to as Forensic DNA Phenotyping (FDP), has started to become established in forensic biology. We previously developed and forensically validated the IrisPlex system for accurate prediction of blue and brown eye colour from DNA, and recently showed that all major hair colour categories are predictable from carefully selected DNA markers. Here, we introduce the newly developed HIrisPlex system, which is capable of simultaneously predicting both hair and eye colour from DNA. HIrisPlex consists of a single multiplex assay targeting 24 eye and hair colour predictive DNA variants including all 6 IrisPlex SNPs, as well as two prediction models, a newly developed model for hair colour categories and shade, and the previously developed IrisPlex model for eye colour. The HIrisPlex assay was designed to cope with low amounts of template DNA, as well as degraded DNA, and preliminary sensitivity testing revealed full DNA profiles down to 63 pg input DNA. The power of the HIrisPlex system to predict hair colour was assessed in 1551 individuals from three different parts of Europe showing different hair colour frequencies. Using a 20% subset of individuals, while 80% were used for model building, the individual-based prediction accuracies employing a prediction-guided approach were 69.5% for blond, 78.5% for brown, 80% for red and 87.5% for black hair colour on average. Results from HIrisPlex analysis on worldwide DNA samples imply that HIrisPlex hair colour prediction is reliable independent of bio-geographic ancestry (similar to previous IrisPlex findings for eye colour). We furthermore demonstrate that it is possible to infer with a prediction accuracy of >86% if a brown-eyed, black-haired individual is of non-European (excluding regions nearby Europe) versus European (including nearby regions) bio-geographic origin solely from the strength of HIrisPlex eye and hair colour probabilities, which can provide extra intelligence for future forensic applications. The HIrisPlex system introduced here, including a single multiplex test assay, an interactive tool and prediction guide, and recommendations for reporting final outcomes, represents the first tool for simultaneously establishing categorical eye and hair colour of a person from DNA. The practical forensic application of the HIrisPlex system is expected to benefit cases where other avenues of investigation, including STR profiling, provide no leads on who the unknown crime scene sample donor or the unknown missing person might be.
Over the last few years, the prediction of externally visible characteristics (EVCs) from DNA has been an interesting topic of study for many reasons, in particular, its anticipated use within forensic genetics [
] resulting in the chosen term Forensic DNA Phenotyping (FDP). The ability to predict the physical appearance of an individual directly from crime scene material can in principle help police investigations by limiting a large number of potential suspects in cases where perpetrators unknown to the investigating authorities are involved. These include cases where conventional STR profiling could not provide a hit within the forensic DNA (profile) database, or could not provide a match with a suspect singled-out by police investigation, or cases where an STR profile could simply not be generated due to low quality and/or quantity of DNA available. Using EVC information obtained from the crime scene material via FDP, police would then proceed with more concentrated enquires, and finally request standard forensic STR profiling only for the reduced number of EVC matching suspects aiming DNA individualisation for court room use. Obviously, the more EVCs that are predictable from crime scene material, the better a person's appearance can be described, and in turn the smaller the number of appearance-matching potential suspects for subsequent forensic STR profiling. Also in missing person cases where a body was found decomposed with no EVC information discernable from visual inspection, or body parts that do not provide EVC information including bones, FDP is expected to provide leads for finding the right antemortem samples or family members for final STR-based identification.
The use of DNA (or other biomarkers) for investigative purposes termed ‘DNA intelligence’, rather than for identification purposes in the court room as currently applied in forensics, marks a completely new application of DNA in forensics and is currently at the early stages of development. At present there is only one FDP tool available that has already been developmentally validated for forensic use and that is the IrisPlex system, capable of predicting eye colour from DNA [
] none of them introduced a tool that had undergone systematic forensic developmental validation testing as of yet. The IrisPlex system allows the prediction of eye colour from minute amounts of DNA (31 pg DNA input full profiles) and has proven to be 94% accurate for predicting blue and brown eye colour when tested on a European set of >3800 individuals [
The previous progress on categorical eye colour DNA predictability together with the strong genetic and phenotypic relationship between eye and hair colour variation, as well as the increased understanding of the genetic basis of hair colour, all suggest that hair colour may represent the next-promising candidate EVC for DNA prediction after eye colour. Hair colour (as well as eye colour), is generally known to be highly variable in people of (at least partial) European descent and those from nearby regions such as the Middle East and parts of Western Asia [
], with individuals displaying numerous variations of hair colour shade that are usually summarised in four main categories of colour such as red, blond, brown and black. In contrast, people from any other parts of the world (and without European/nearby genetic admixture) usually display the ancestral black hair colour (together with the ancestral brown eye colour) phenotype. Variation in hair (and eye) colour is assumed to be of European origin and is thought to have reached their currently observed frequencies via sexual selection (i.e. mate choice preferences) [
]. The genetic basis of human hair colour variation has been studied considerably in the last few years. Recent studies either employing the candidate gene approach or genome-wide association and/or linkage analysis have identified genes and DNA variants likely to be involved in human hair colour variation [
]. Some preliminary attempts have already been made towards the prediction of hair colour from informative DNA variants. In fact, an early red hair prediction protocol based on a combination of non-synonymous single nucleotide polymorphisms (SNPs) in the MC1R gene that incur the red hair phenotype effect was already developed for forensic use more than ten years ago [
] in their genome-wide association study for European pigmentation traits developed a hair colour prediction tool, which was capable of excluding red and either blond or brown hair colour in its prediction for many of their individuals. More recently, Valenzuela et al. [
] assessed 75 SNPs from 24 genes previously implicated in hair, skin and eye colour in samples of various bio-geographic origins (Europe and elsewhere) and found that three of them, i.e. rs12913832 (HERC2), rs16891982 (SLC45A2) and rs1426654 (SLC24A5) combined gave the best prediction for light and dark hair colour.
Armed with previous knowledge on hair colour associated DNA variants and in considering the most up-to-date list of DNA variants related to human hair colour variation available at the time, we recently performed an evaluation of 46 SNPs from 13 genes [
] for model-based population-wise hair colour prediction aiming to find a set of most hair colour predictive DNA variants. In this previous study we identified a set of 13 DNA markers (2 MC1R combined marker sets and 11 single DNA markers) from 11 genes (MC1R, HERC2, OCA2, SLC45A2 (MATP), KITLG, EXOC2, TYR, SLC24A4, IRF4, PIGU/ASIP and TYRP1) containing most hair colour predictive information. This DNA marker set provided a high degree of population-based, prevalence-adjusted overall prediction accuracy as expressed by the area under the curve of a receiver operating characteristic curve (AUC) with estimates at 0.93 for red, 0.87 for black, 0.82 for brown, and 0.81 for blond hair colour, where 1 means completely accurate prediction. However, the genotyping methodology used in this previous screening study did not allow simultaneous genotyping of all 22 identified hair colour predictive DNA markers in a single reaction as would be appreciated in forensic DNA analysis where there can be limited amounts of starting material. Furthermore, in the previous study, only samples with hair colour genotypes and phenotypes from a single country in Eastern Europe, i.e. Poland, were available, whereas the inclusion of individuals from other European regions, such as Western and Southern parts, would be beneficial in order to enrich with individuals displaying hair colours such as brown and black that are more common in these parts of Europe.
In the present study, we developed and evaluated the sensitivity of a single-tube multiplex assay targeting the 22 previously recognised hair colour predictive DNA variants as well as the six eye colour predictive SNPs from our previously developed IrisPlex system (four of which are overlapping). We employed the SNaPshot technology because it can be easily implemented in forensic DNA laboratories as no additional equipment or serious interference with protocols is needed to apply it. Furthermore, we assessed the power of the 22 DNA variants to predict hair colour categories, as well as hair colour shade, via model-based prediction studies using an expanded database of hair colour genotype and phenotype data for >1500 individuals from Eastern, Western and Southern parts of Europe that displayed varying degrees of hair colouration. Moreover, we investigated via analysing a worldwide set of individuals from 51 populations (HGDP-CEPH), whether or not the reliability of hair colour prediction available with these 22 DNA variants depends on knowledge of bio-geographic ancestry. We present and make available for future use, the first system for parallel prediction of hair and eye colour from DNA we termed HIrisPlex, consisting of a single multiplex assay for 24 eye and/or hair colour predictive DNA variants and two prediction models, i.e. a newly developed model for hair colour and shade prediction and the previously developed IrisPlex model for eye colour prediction. An interactive spreadsheet tool for obtaining individual hair colour, hair colour shade, and eye colour prediction probabilities from HIrisPlex genotypes as well as a prediction guide for accurate interpretation of individual hair colour and shade probabilities are made available to enhance the practical use of the HIrisPlex system in future applications such as forensics.
2. Materials and methods
2.1 Subjects, imagery and hair and eye colour classification
DNA samples and hair colour information was collected from 1551 European subjects living in Poland (n = 1093), the Republic of Ireland (n = 339) and Greece (n = 119). All participants gave informed consent. The study was approved in part by the Ethics Committee of the Jagiellonian University, number KBET/17/B/2005 and the Commission on Bioethics of the Regional Board of Medical Doctors in Krakow number 48 KBL/OIL/2008. Hair and eye colour phenotypes were collected by a combination of self-assessment and professional single observer grading (Polish data). The professional grader (AKK) for the polish dataset is a medical doctor (dermatologist) who evaluated hair colour upon observation, and questioning of individuals in circumstances where hair was dyed or grey. For hair colour phenotype self-assessment, individuals were asked to fill into the questionnaire, the colour of their hair during their 20s, and at what age grey/white hairs started to appear (Irish collection), this avoided the effects of hair greying and whitening on phenotyping. Sample collection in Ireland included high-resolution eye and hair photographic imagery. In a brief description, hair and eye images were taken using a Nikon D3100 with an AF-S Micro Nikkor 60 mm macro lens, the aperture, shutterspeed and ISO were fixed to f = 22, 1/125, and 200 respectively. A ring flash (model Speedlight SB-R200) and an average distance of 7 cm was used from the eye and from the back of the head for hair imagery. This ensured consistent sampling and regulated lighting conditions, including lens settings of a 0.2 and 0.23 fixed focal length. All individuals were asked to fill in a questionnaire that included basic information, such as gender and age as well as data concerning eye and hair pigmentation phenotype. However, due to many Irish individuals having dyed or grey hair, self-reported hair colour classifications were used for this set in model training. For the Greek collection, a buccal swab was taken from each individual and a self-reported questionnaire regarding hair and eye colour information was collected. For both the Irish and Greek set, hair colour was classified into 7 categories: blond (5.9%), light-brown (34%), dark-brown (45.2%), auburn (5.7%), blond-red (1.3%), red (2.2%), and black (5.7%). For the Polish dataset, this data was collected as previously reported [
] and hair colour was classified into 7 categories: blond (13.7%), dark-blond (44.2%), brown (22.6%), auburn (1%), blond-red (3.9%), red (3.8%), and black (10.8%)). For hair colour prediction analyses, we grouped blond and dark-blond into one blond category (42.6%), light brown and dark brown into one brown category (39.3%) and auburn, blond-red, and red into one red category (8.8%) with black as an additional fourth category (9.3%). Eye colour was classified into 3 categories blue, brown and intermediate (including green). The term category in this context refers to the grouping of similar phenotypic colours into one group to separate them from another colour group, i.e. blond category, black category. Table 1 displays the numbers of hair and eye colour phenotypes including sex, within all 3 populations sampled. Notably red hair in the Polish population and green eye colour in the Irish population were intentionally enriched due to their rare occurrence, therefore both phenotypes do not reflect natural population frequencies.
Table 1Phenotype frequencies according to hair and eye colour categories (including sex) for the full combined set of individuals from Poland, Ireland and Greece.
]. Saliva samples collected from individuals in Ireland were extracted using the Puregene DNA isolation kit (Qiagen, Hilden, Germany). Buccal swabs collected from individuals in Greece were extracted using an in-house organic extraction protocol. DNA from the H952 subset of the HGDP-CEPH panel that represents 952 individuals from 51 worldwide populations [
] were purchased from CEPH. Due to lack of DNA in some samples belonging to the HGDP-CEPH 952 set, 7 individuals could not be genotyped by the HIrisPlex assay, and therefore the final number of worldwide samples was 945.
All samples were genotyped using the HIrisPlex assay. The assay includes 23 SNPs and 1 insertion/deletion (INDEL) polymorphism, altogether 24 DNA variants, from 11 genes: MC1R, HERC2, OCA2, SLC24A4, SLC45A2, IRF4, EXOC2, TRYP1, TYR, KITLG, and PIGU/ASIP. Further information on these 24 markers can be found in Table 2, including primer sequences. The 24 PCR primer pairs were designed using the default parameters of the program Primer3Plus [
], which is a free web-based design software. PCR fragments were designed to be as short as possible to cater for degraded DNA, and therefore all are less than 160 bp in length. To reduce the possibility of primer pairs interacting with each other, the program Autodimer [
For the population genotyping, genomic DNA quantities ranging from 300 pg to 3 ng in 1 μl formats were amplified per individual in a 10 μl reaction volume consisting of 1× PCR buffer, 2.5 mM MgCl2, 220 μM of each dNTP, and 1.75 U AmpliTaq Gold DNA polymerase (Applied Biosystems Inc., Foster City, CA) including PCR primer concentrations found in Table 2. Thermo-cycling was performed on the 96-well GeneAmp® PCR system 9700 (Applied Biosystems) under the following conditions (1) 95 °C for 10 min, (2) 33 cycles of 95 °C for 30 s and 61 °C for 30 s, (3) 5 min at 61 °C. PCR products were cleaned with ExoSAP-IT (USB Corp., Cleveland, OH), as recommended by the manufacturer. Following removal of unincorporated dNTPs and primers. The multiplex SBE (single base extension) assay was performed using 2 μl of product with 1 μl of ABI SNaPshot kit (Applied Biosystems, Foster City, CA) reaction mix in a total reaction volume of 5 μl. Single base extension (SBE) primer sequences and concentrations used in the assay can be found in Table 2. Thermocycling conditions were as follows: 96 °C for 2 min and 25 cycles of 96 °C for 10 s, 50 °C for 5 s and 60 °C for 30 s. Products were cleaned using SAP (USB Corp.), following manufacturers guidelines and 1 μl of cleaned product was run on the ABI 3130xl Genetic Analyser (Applied Biosystems) with POP-7 on a 36 cm capillary array following the SNaPshot kit sample preparation guidelines, however run parameters of 2.5 kV for 10 s injection voltage and run time of 500 s at 60 °C were used for increased sensitivity.
For assay sensitivity studies, genotyping results from two different individuals were assessed from serial dilutions of DNA input samples of 500 pg, 250 pg, 125 pg, 63 pg and 31 pg. Each result was investigated for allelic drop out, which includes peaks below the 50-rfu threshold that cannot be called. The determination of sensitivity was based on the production of a full profile in every replicate at a particular DNA input level.
2.3 HIrisPlex DNA variants and their use for eye/hair colour prediction including in a worldwide sample
The HIrisPlex assay consists of 24 DNA variants (23 SNPs and 1 INDEL), 6 of these markers, rs12913832 (HERC2), rs1800407 (OCA2), rs12896399 (SLC24A4), rs16891982 (SLC45A2 (MATP)), rs1393350 (TYR) and rs12203592 (IRF4) are taken from the IrisPlex system which has already been well established [
] and are used for the eye colour prediction part of the HIrisPlex system. The results of these 6 SNPs when their minor allele is input into the HIrisPlex prediction tool are used to predict the eye colour of the individual using the IrisPlex model as previously published [
]. When their minor alleles are input into the HIrisPlex prediction tool, they are used to predict the hair colour of the individual using the HIrisPlex hair prediction model developed in this paper. From the four hair colour categories of blond, brown, red and black, the highest probability value is indicative of the predicted hair colour following guidelines that are published within this paper and described in the next section.
For worldwide hair colour prediction, we assessed the HIrisPlex assay performance on 945 samples from 51 populations of the HGDP-CEPH set. The MapViewer 7 (Golden Software, Inc., Golden, CO, USA) package was used to plot the predicted hair colour categories and the distribution of SNP genotypes on the world map. A non-metric multidimensional scaling (MDS) plot was produced to illustrate the pairwise FST distances [
] of the 24 eye and hair colour SNPs between populations, using SPSS 17.0.2 for Windows (SPSS Inc., Chicago, USA). Analysis of molecular variance (AMOVA) (Excoffier 1992) was performed using Arlequin v3.11 [
]. A threshold assessment of prediction probabilities for each hair colour category was also carried out including a combined eye and hair colour prediction probability threshold in the inference of a Non-European individual with Black hair and brown eyes. For the assessment of an age-dependent hair colour change, a Pearson correlation was calculated and the graph plotted using SPSS 17.0.2 for Windows (SPSS Inc., Chicago, USA).
2.4 Prediction modelling for hair colour
To develop a hair colour prediction model using samples from several sites with varying levels of hair colour due to their position within Europe, central, western and southern Europe, we took a random subset of 80% of the samples from each site, Poland (n = 875), Ireland (n = 272) and Greece (n = 96). This 80% subset was used to train the model and was based on Multinomial Logistic Regression (MLR), as previously published by Liu et al. [
]. In brief, individuals were categorised according to their hair phenotypes and were split into 4 categories, Blond (n = 529), Brown (n = 490), Red (n = 109) and Black (n = 115). For their genotypes, 22 of the 24 HIrisPlex DNA variations (as described above) were used to test for hair colour differentiation and use in the prediction model. By inputting the minor allele of each DNA variant, including its phenotype and applying MLR, alpha and beta values are generated that form the core of the prediction model. This model then allows the probabilistic prediction of an individuals hair colour category solely based on the input of the 22 variant minor alleles into the HIrisPlex hair colour prediction tool. To assess the effect of the light and dark shades of hair colour that may be contributed from blond and black respectively, a similar approach was used that combined the individuals grouped in the light category (blond, n = 529) versus a dark category (black, n = 115). Red hair individuals were omitted (n = 109) from this analysis as their resulting colour is based upon an MC1R cumulative mutation and not on the continuous spectrum of light to dark (i.e. blond to black). Brown hair individuals (n = 490) were omitted, as only the extremes of light and dark were required. Therefore using this two-pronged model approach, a predicted hair colour is generated with an approximate indication of the colour being light or dark (i.e. light brown, dark brown) due to the influence of the genotypes commonly associated with the light/dark categories, of blond and black respectively. The further 20% of the combined dataset (total n = 308), i.e. from Poland (n = 218), Ireland (n = 67) and Greece (n = 23), was used to assess the accuracy of the prediction model in terms of the final hair colour prediction being correct or incorrect based on colour category, shade and use of the hair colour prediction guide that is described in detail in Section 3, and an assessment of optimal category thresholds was undertaken. The steps to take when acquiring a prediction based on colour and shade are outlined in a guide provided below.
3. Results and discussion
3.1 HIrisPlex genotyping assay – design and sensitivity
The HIrisPlex assay was designed with the intention to cope with low template and degraded DNA, a standard concern when genotyping forensic casework samples. Therefore, care was taken to ensure small PCR amplicon sizes of <160 bp for all of the 24 DNA variants considered. During optimisation of the single multiplex assay, a balance of homozygote allele heights and their associated heterozygote allele heights was catered for to be as consistent as possible when viewing the combined set. With this we aimed to limit the chances of heterozygote dropout at the lower concentration levels. For the INDEL variant N29insA (first peak in the assay, Fig. 1) however, the peak height is lower, on average by a factor of 2 depending on the sample DNA input, relative to the 23 SNPs in the multiplex. This is due to difficulties within the design that is known to occur with INDELs. Nevertheless, this does not affect the assay until the very low DNA input levels (<63 pg) for which sensitivity was assessed. Notably, N29insA is extremely rare in the prediction of red hair individuals alone; only 4 out of a total 137 red hair phenotype individuals had this mutation in our dataset. Hence, in most of the cases, this technical issue is not likely to affect the practical use of the HIrisPlex assay. If, however, allelic drop-out for N29insA is indeed observed in a case, N29insA shall be genotyped using the more sensitive singleplex assay to take full advantage of the red hair colour prediction available with the marker set considered here.
Our population studies revealed that DNA inputs of >500 pg usually yield a balanced profile with high relative fluorescence units (rfu) levels, especially for homozygote SNP alleles. For a first investigation of the sensitivity threshold of the HIrisPlex assay, two individuals were genotyped in a duplicate dilution series of DNA input at 500 pg, 250 pg, 125 pg, 63 pg, and 31 pg, established after DNA quantification at 500 pg using Quantifiler Human DNA Quantification kit (Applied Biosystems). These individuals were chosen for maximising as much as possible the heterozygous state of the 24 DNA variants, which is important, as signals from heterozygote alleles are not as strong as homozygote alleles for the same marker. From Fig. 1 it is evident that at 500 pg and lower, peak height imbalance occurs and this should be taken into account when assessing genotype calls at these lower DNA levels; however, genotype accuracy is not affected until very low DNA input levels. Peak imbalance can sometimes be confused with the possibility of a DNA mixture from different individuals; but it is important to note here that in most circumstances HIrisPlex will be used after an STR profile has been generated from crime scene material (and found not to be informative), therefore the presence of a DNA mixture should be evident from the STR profile. The sensitivity of the 24 HIrisPlex assay is high, with full profiles observed at DNA input levels down to and including 63 pg, while allele drop out occurs at the lowest examined level of 31 pg DNA input for some HIrisPlex DNA variants (Fig. 1). In particular, dropout was observed in 5 instances for this set of profiles, at N29insA, rs1042602, rs4959270, rs1800407 and rs1393350. One drop-in occurred at 31 pg starting DNA of a C allele at Y152OCH.
Overall, the HIrisPlex assays sensitivity, according to the preliminary assessment done here, is comparable to some other complex SNaPshot™ assays such as an 18-plex designed by Freire-Aradas et al. [
] for human individual identification from highly degraded DNA using autosomal SNPs. For that assay, full profiles down to 78 pg/μl DNA input were observed with partial profiles down to 31 pg DNA input, as similar for the HIrisPlex assay. These minimal input levels are lower than those reported for other autosomal SNP assays such as the two multiplex assays together covering 44 SNPs for individual identification by Lou et al. [
] where a DNA input of at least 125 pg is needed to receive a full profile. Notably, our previously developed IrisPlex assay that includes the same 6 eye colour predictive SNPs as also included in the HIrisPlex assay gave full profiles down to a level of about 31 pg input DNA [
], which is slightly more sensitive than the HIrisPlex assay presented here. This is at least partly explained by the 4 times larger number of DNA variants included in the HIrisPlex assay relative to the previously developed IrisPlex assay. For practical applications this may mean that if allelic dropout due to low quality/quantity input DNA is indicated by complete locus drop-out at any of the 6 HIrisPlex SNPs for eye colour, the more sensitive IrisPlex assay may be applied subsequently and may provide a full 6-SNP profile for eye colour prediction on critical DNA samples.
3.2 HIrisPlex model-based hair colour prediction
MC1R polymorphisms are largely recessive when considered individually, but also interact with each other through a genetic mechanism known as “compound heterozygosity” [
], the MC1R variants Y152OCH, N29insA, rs1805006, rs11547464, rs1805007, rs1805008, rs1805009, rs1805005, rs2228479, rs1110400 and rs885479 were all collapsed into two markers, MC1R-R (R/R, R/wt, wt/wt) and MC1R-r (r/r, r/wt, wt/wt), depending on the penetrance of the mutant alleles. Thus, the total 22 hair colour markers were considered as 13 markers in our previous prediction analysis, including, MC1R_R, MC1R_r, rs1042602 (TYR), rs4959270 (EXOC2), rs28777 (SLC45A2 (MATP)), rs683 (TYRP1), rs2402130 (SLC24A4), rs12821256 (KITLG), rs2378249 (PIGU/ASIP), rs12913832 (HERC2), rs1800407 (OCA2), rs16891982 (SLC45A2 (MATP)) and rs12203592 (IRF4). In the current study, we had two main reasons for the development of a new hair colour prediction model utilising a 22 DNA variant set without collapsing into MC1R-R and MC1R-r. First, we were able to produce a larger dataset that provides a broader representation of Europe and its highly variable hair colour regions. Notably, we not only increased the sample size relative to our previous study [
] by 3-fold, but in addition to considering more Eastern Europeans from Poland (also used before) we also added individuals from Western Europe, i.e. Ireland and from Southern Europe, i.e. Greece. These three countries display very different hair colour phenotype frequencies (Table 1), which would also impact on the modelling. The use of samples from three European regions and countries provides an increase in overall sample size and also a better representation of the hair colour phenotype variation across Europe, but this also increases the different genotype combinations observable. Second, some of the MC1R variants also contribute to hair colours other than red [
] and 78% in our own set contain at least one of the MC1R mutations without displaying the red hair phenotype, other Europeans from other regions to which our hair colour prediction tool may be applied in the future may also reflect this. Therefore, a new hair colour prediction model was developed to examine the input of each single DNA variation for hair colour categorical prediction, including the individual impact of all MC1R variants separately.
Table 3Assessment of the contribution of each HIrisPlex DNA variant for hair colour prediction within the model in terms of betas and probability (p) values. The values generated reflect a binary category assessment of colour prediction, i.e. blond versus non-blond, brown versus non-brown, etc. The lowest (and thus most statistically significant) p values for each category are highlighted for the respectively associated DNA variants.
Fig. 2 shows a hypothesised tree model illustrating how each of the 22 DNA variants contributes towards a categorical hair colour prediction as inferred from our current data. This scenario represents the extreme of a 2 minor allele input for each single DNA variant and the largest single hair colour category effect that is seen on the models prediction, based on that input. However, it is important to note here (and as further outlined below) that it is the combination of all 22 DNA variants together in a single model that finally allows the prediction of hair colours as we suggest with this study.
Table 3 provides a measure of the strength of each DNA variant's contribution towards each hair colour category prediction using beta values including p-values obtained from the MLR model. The analysis is based on the combined 80% model-building subset of 1243 Polish, Irish and Greek individuals assigned into a red versus non-red colour category which then displays each DNA variants contribution towards red hair colour within the model. For the other categories (i.e. blond versus non-blond, brown versus non-brown and black versus non-black), we used a total set of 1134 individuals representing the 80% model-building subset but now omitting the red hair individuals from the analyses due to their rare DNA variants and the fact that red hair is not a continuous colour but more a combined MC1R mutation effect on colour change [
] in several DNA variants, i.e. rs12913832 (HERC2) and rs12203592 (IRF4) with high statistical support (P 10−6 to 10−16) in the present enlarged dataset considering Poland, Ireland and Greece. Although less powerful, additional DNA variants also show significant evidence (p < 0.05) for some hair colours, such as rs2402130 (SLC24A4), rs12821256 (KITLG), rs4959270 (EXOC2), rs1805006 (MC1R), rs1805007 (MC1R), rs1805008 (MC1R) for blond, rs1805006 (MC1R) and rs2402130 (SLC24A4) for brown, and rs1805007 (MC1R) for black. Red hair colour prediction is observed with highest probability values (P 10−8 to 10−16) for several of the individually considered MC1R variants as expected, i.e. rs1805008, rs1805007, rs1805009 and rs11547464, and with somewhat less statistical strength (p < 0.05) for other MC1R variants, i.e. rs1805005, rs1805006 and rs1110400. However, due to the very low frequency in our set of individuals of the generally rare MC1R variant allele at N29insA (INDEL) and Y152OCH, their contribution towards red hair probabilities are particularly high (Table 3; red hair beta values of −22 and −19.4 respectively), i.e. the presence of an A allele at N29insA or Y152OCH produces red hair prediction probabilities of 1. This effect is not mirrored in the other MC1R variants investigated and reflects the presence of these very rare alleles (heterozygote and homozygote state) within all individuals displaying a red hair phenotype in our model training set at a very low frequency (n = 6). Although this does not affect the final prediction of red hair, it is important to note the abnormally high probability values for red when these rare variants are present. Notably, some DNA variants outside the MC1R gene also show significant red hair colour probabilities (p < 0.05), i.e. rs12913832 (HERC2), and rs2378249 (PIGU/ASIP).
Fig. 3 provides the results of HIrisPlex prediction for a subset of 44 Irish individuals where high-resolution non-dyed hair colour imagery was available to illustrate the model's performance. The individuals natural hair colour images were ordered according to their predicted hair colour category probability values achieved via HIrisPlex analysis while the actual hair colour phenotypes were not considered in the ordering. From left to right, top to bottom, the images are ordered from the highest to lowest HIrisPlex prediction probabilities for black hair and then the lowest to highest prediction probabilities for brown, red and blond hair respectively. As evident, there is a high correlation with the predicted hair colour category from HIrisPlex and the hair colour phenotype observed from visual inspection of these images.
Table 4 shows the accuracy of hair colour prediction in the 20% model-testing subset of the Polish, Irish, and Greek individuals (n = 308). It is important to emphasise here that these individuals were not used for model building. The highest probability category approach (as opposed to the prediction-guide approach explained in the next paragraph) considers the colour category with the highest predicted probability as the final predicted colour and does not take other categories into account for the final prediction. Using this approach, we tested various probability thresholds, from no threshold, to p > 0.7, which we previously recommended for eye colour prediction using the IrisPlex system [
]. As seen in Table 4, using the p > 0.7 (B) threshold increases the percentage of correct calls relative to the value obtained without using any threshold (A) for some hair colours such as red hair by ∼10% (i.e. from 89.5 without threshold to 100% with threshold), and for blond hair by ∼6.5% (i.e. from 57.2 to 63.6%), whereas no difference was seen for brown hair at 75%, and for black we saw a decrease by ∼8.5% (i.e. from 28.6 to 20%). The low prediction accuracy obtained with this approach for black hair may reflect the difficulty of defining the true black hair colour phenotype relative to the dark brown phenotype within this European dataset, where black hair is rare. Notably, the low correct call rate of 28.6% for black (without using a threshold) is mainly caused by 30 individuals with non-black self-reported phenotypes that were predicted as black by the HIrisPlex model. Of these, almost all (i.e. 90%) had the brown–dark brown phenotype. We could speculate that at least some of them may have been self-categorised as black if black hair colour would be more frequent in the sampled populations and therefore easier to differentiate from dark brown in the phenotyping procedure. Although red hair is also rare in the European population (albeit in our Polish dataset it was enriched for) this problem is less expected for red hair as red is usually well differentiable from other hair colours, perhaps with the exception of the blond-red individuals. The prediction accuracy for blond hair, being lower than those for red and brown hair colour with and without threshold, is partly due to another phenomenon that will be discussed in detail in Section 3.3; age-dependent hair colour changes. As brown hair is the intermediary stage between blond and black, no prediction threshold for this category is required as can be seen in Table 4. Even at the 75% correct call rate, the incorrect 5/8 defined themselves as being dark blond. Since we know an overlap exists between light-brown and dark-blond in people's perception and definition of colour, it is best to consider dark blond the same colour as light brown. Therefore, brown hair colour may also be seen at black and blond category predictions <0.7 p depending on their light and dark shade predictions and this is where the use of the prediction guide (see next paragraph) is more informative. For the red hair category, as its occurrence is independent of the continuous spectrum of dark to light (black to blond), and mutations in the MC1R gene produce a prediction within the category of red hair, all (with >0.7 p threshold) or nearly all (89.5% without threshold) individuals for which the red hair category was the highest prediction probability were correctly predicted as seen in Table 4. Notably, the two individuals that were incorrectly predicted red without using a threshold defined themselves as blond and brown, respectively; upon inspection of a hair image of the latter individual that was available to us, it did in fact display light red hints of colour. This reflects another example of how the phenotyping procedure, particularly self-reported hair colour grading as done in our Irish and Greek datasets, influences DNA prediction accuracy. However, it is important to point out here that for 11(39%) individuals that had defined themselves as having red hair, the red hair probability was not the highest, relative to probabilities for non-red hair colour, and these individuals were therefore missed out with HIrisPlex using this highest-probability approach. Furthermore, for 8 (6%) of the phenotypic blond, 96 (80%) of the phenotypic brown, and 17 (59%) of the phenotypic black hair individuals the highest predicted hair colour category did not correspond to the phenotypic hair colour category and hence these individuals were missed using this highest-probability approach. This illustrates the limitation with the highest-probability approach that we aimed to overcome by developing and applying a prediction-guide approach as discussed next.
Table 4HIrisPlex hair colour prediction accuracies obtained from a 308 separate model testing set of individuals from Poland, Ireland and Greece (individuals were not considered for prediction model building for which a different set of 1243 individuals was used) using two approaches: the highest probability category approach (with and without thresholds) and the prediction guide approach (see Fig. 4 for the prediction guide).
To take full advantage of the genotype–phenotype relationship for hair colour and the 22 hair-colour predictive DNA variants included in the HIrisPlex system we developed a hair colour prediction guide considering categorical hair colour probabilities in combination with light/dark hair colour shade probabilities as obtained from the HIrisPlex genotype data (Fig. 4, see also Section 3.5 for additional practical recommendations). The reason for considering light/dark shade prediction in addition to categorical hair colour prediction in the final approach is that the 22 DNA variants not only impact on the main hair colour categories, but also on more detailed hair colour information, which is difficult to measure; hence, we express in light/dark prediction probabilities. For this, we took the individuals from the black category, now termed dark, and the individuals from the blond category, now termed light, and designed an additional prediction model for light and dark colour shade. Therefore, the HIrisPlex genotype input finally provides the core prediction colour category with an added level or shade, i.e. light or dark. This part of the prediction should be useful as additional information to the initial prediction category, e.g. to differentiate light blond from dark blond (light brown), or light brown from dark brown/black. It becomes particularly beneficial in the lower hair colour category prediction probability levels (i.e. category prediction <0.7 p for non-red) as the categories are closer together and may be more difficult to accurately predict one category over another due to given genotype combinations. A >0.9 threshold is used for light versus dark shade prediction. As seen in Table 4(C), using the prediction guide approach the correct call percentages were for all hair colours considerably higher than using the highest probability category approach, except for red hair. In fact, using the prediction guide approach we obtained on average 69.5% correct calls for blond, 78.5% for brown and 87.5% for black. Particularly black hair prediction was strongly improved by using the prediction guide approach with an increase of almost 60% on average relative to the highest probability category approach without a threshold. For an explanation of why blond is the least accurately predictable hair colour with currently available DNA markers, also after applying the prediction guide, see Section 3.3. Although we saw an apparent decrease of accurate prediction for red hair with the prediction guide approach (80% versus 89.5% with highest-probability approach without threshold), this can be explained by the total number of red predictions made by the models and if they were correct or not. In particular, for the highest probability approach the model was incorrect at predicting red only 2 times but missed out on 11 actual reds from our dataset. The prediction guide approach, although was inaccurate for red hair prediction for 6 individuals, it managed to predict 24 out of the 28 actual red hair phenotypes from our test set. In summary, the number of individuals in our 308 model-test set that were missed by HIrisPlex hair colour prediction using the prediction guide approach were 4 (14%) of the phenotypic red, 8 (19.5%) of the phenotypic blond, 7 (6%) of the phenotypic d-blond/l-brown, 28 (31%) of the phenotypic d-brown and 26 (90%) of the phenotypic black, with an overall hair colour prediction accuracy of 76%. All are considerably less than what was missed when applying the highest probability category approach, apart from black hair where we believe phenotyping inaccuracy/perception of colour plays a role as discussed above already, as 21 of those individuals were predicted as having d-brown hair and may have in fact displayed d-brown hair that was perceived as black within Europe. We therefore recommend using the prediction guide approach for properly interpreting HIrisPlex genotype data and the probability values derived from our prediction tool to infer the most likely hair colour phenotype in future practical applications.
There are several important differences between eye and hair colour, both on the phenotypic as well as the genotypic levels, that may play a role in why some eye colours (i.e. blue and brown) appear to be currently predictable from DNA with higher accuracy than some hair colours (i.e. all non-red hair colours). Rs12913832 from the HERC2 gene plays a major role in the functional aspects of iris pigmentation [
] and its proposed model of action reflects a type of on/off switch from the absence of the T allele (and the homozygous presence of the C-allele) resulting in blue eye colour, to the presence of one or two T allele(s) reflecting brown eye colour [
] via a series of functional genetic experiments that the rs12913832 T-allele leads to binding of several transcription factors and a chromatin loop with the promoter of the neighbouring pigmentation gene OCA2 leading to elevated OCA2 expression and dark pigmentation. In contrast, when the rs12913832 C-allele is present, transcription factor binding, loop formation and OCA2 expression are all reduced leading to light pigmentation. Because of its strong functional involvement, HERC2 rs12913832 shows the strongest predictive power on categorical eye colour with an AUC of 0.877 for blue and 0.899 for brown alone for this SNP [
]. However, the effect of rs12913832 is considerably less on hair colour than it is on eye colour for reasons yet to be unveiled, and there are no other high impact hair colour SNP that take its place. For instance, in our full dataset using 1551 individuals, the correlation of rs12913832 with eye colour is nearly twice as high (Pearson correlation r2 = 0.46, p = 2.2e−16) as its correlation with hair colour (Pearson correlation r2 = 0.24, p = 2.2e−16). Furthermore, the colour distribution of European hair appears much wider than that of European eyes, requiring the combination of several similar gene effects [
]. Thus, categorical hair colour prediction is expected to be more error-prone especially when involving factors such as shade and intensity, etc. at least with the DNA markers known thus far. Additional effects such as environmental contributions particularly life time that are much stronger on certain hair colours than they are on all eye colours also influence hair colour prediction accuracy more so than eye colour prediction accuracy and will be discussed in the following chapter (see Section 3.3).
3.3 Age-dependent hair colour changes and consequences for hair colour prediction
Age-dependent changes in hair colour are evident from anecdotal knowledge. The most often observed age-dependent hair colour changes occurs from light blond during childhood towards dark blond/light brown as an adult, but can also occur from light brown to dark brown/almost black. Suggestions of hormonal changes during adolescence have been advocated as a possible explanation [
], but the molecular basis are yet to be unveiled. In order to study the effect of age-dependent hair colour change on hair colour prediction from child to adulthood we recorded via questionnaires in the Irish sample set hair colour during childhood and adulthood separately, including the approximate age of the hair colour change. Of the 339 Irish individuals, 157 contained current images in which the hair was not dyed and not grey, and from these the 8 individuals that were classified as blond in adulthood were 100% correctly predicted by the HIrisPlex system following the prediction guide approach. However, for 14 individuals with light brown to black phenotypes the HIrisPlex model had faltered and gave a high blond prediction probability (>0.7 p) with high light shade probabilities (>0.9 p). On further examination of these incorrectly predicted individuals, 8 (57%) of them noted that a change in hair colour regarding a darkening from blond to brown had occurred in their younger lives at ages ranging from 9 to 12 years. Furthermore, we found a high and statistically significant correlation (Pearson correlation r2 = 0.81, p < 0.01) between the increase in brown (darkening of hair) and the increase in age since the hair colour change occurred for those Irish individuals for whom such data were available to us (Fig. 7), which substantiates that the hair colour change observed is age dependent in these individuals. From this data we can see that our current HIrisPlex system works to a high degree of accuracy for hair colour prediction, but there may be processes that alter the hair colour over an individual's lifetime (possibly molecular processes) without changing the HIrisPlex predicted hair colour of the individual. For instance, an adult that had blond hair as a young child, but now displays light–dark brown/black hair colour is likely to display blond HIrisPlex genotypes and therefore a blond hair colour prediction will be obtained. This is due to the fact that the hair colour SNPs included in the HIrisPlex system, as well as any additional hair colour associated DNA variant available today, were identified in studies dealing with adults, and not in studies that particularly searched for bio-markers informative for the age-dependent hair colour change, which is still yet to be carried out. It is important to note therefore that the HIrisPlex model cannot decipher between these change-affected individuals and blonds who remain blonds from childhood to adulthood and thus a HIrisPlex prediction of blond hair may be inaccurate to a certain degree (30% (Supplementary Table 3) in our dataset). This limitation in DNA-based hair colour prediction will remain as long as bio-markers informative for indicating age-dependent hair colour changes are not identified. Furthermore, this age-dependent study was conducted using images solely taken from a small Irish set (childhood hair colour was not available for the Polish and the Greek set); it is worth mentioning that this may reflect a trend in other countries within Europe, however we do not have this information as of present. Therefore more samples and increased accuracy testing of the HIrisPlex system on a broader collection around Europe would be advantageous to get a better measure of this phenomenon. Furthermore, activities shall be placed for finding the processes/genes responsible for age-dependent hair colour changes and developing respective bio-markers that may increase hair colour prediction accuracy in the future.
A different aspect of age-dependent hair colour change is the loss of hair colour when turning grey and white at a more or less advanced age, which likely represents a different mechanism of action [
] than changing from one hair colour to another. We examined the Irish population of 339 individuals for which we had questionnaire information on the age at which grey or white hairs had started to grow. As shown in Supplementary Fig. 1, after the age of 30 there are more individuals starting to produce grey or white hairs relative to those who do not, confirming anecdotal knowledge. However, we have no data on how long it will take for those individuals who started to have grey hairs to turn grey to a substantially obvious phenotypic degree. For practical considerations, knowing the natural hair colour for an individual during its youth that now at more advanced age displays an obvious grey or white hair phenotype will not be directly useful in an investigative search, but this information can still be useful albeit less strongly, when asked for natural hair colour prior to greying in these questionable individuals during a police inquiry. For differentiating whether a crime scene sample donor still had his/her natural hair colour, or perhaps turned grey or white already, a molecular age estimation performed on crime scene samples such as blood would be useful in combination with the HIrisPlex application. Previously, our group developed a DNA test for chronological age, which allows age-group estimation on an accurate level [
]. Obviously, any dyed hair colour, as long as it produces a hair colour different from the natural hair colour category, would not be identifiable with HIrisplex or any other DNA-based hair colour prediction tool. However, in general it is believed that many people who dye their hair as a result of hair greying, and with the intention of hiding the fact that their hair has greyed, try to achieve their natural hair colour category via dyeing, especially in the case of men, to avoid stigmatisms associated with hair colouring. In such cases HIrisPlex hair colour prediction can still be useful even though the hair is dyed.
3.4 HIrisPlex analysis on a worldwide scale
Due to the fact that the HIrisPlex hair prediction model was created using individuals solely from Europe, as it should be for a European trait, to verify its use outside of Europe we performed HIrisPlex analysis on worldwide DNA samples from the H952 subset of the HGDP-CEPH panel that represents 952 individuals from 51 populations [
]. Due to lack of DNA in some samples, a final number of 945 worldwide samples were used. Fig. 5 displays the prediction of the four hair-colour categories blond, brown, red and black on a worldwide scale. This figure does not use any threshold parameters and therefore it is worthy to note that the prediction levels of blond hair in Europe (especially with probability values <0.7 p) may reflect more of a brown hair colour prediction upon inspection of the probability values and the prediction chart that should be used in Fig. 4. Although the actual hair colour of the HGDP-CEPH individuals is not known, we conform to general knowledge that individuals distant from Europe and its neighbouring regions (i.e. Middle East and parts of West Asia) display a black hair colour phenotype as illustrated by proposed figures of hair colour distribution [
], (with a image depiction found at http://cogweb.ucla.edu/ep/Frost_06.html). As seen from Fig. 5, for every individual who originates from regions that are distant from Europe and neighbouring regions, namely East Asia, Oceania, Sub-Saharan Africa and the Americas where only black hair is assumed to be present, HIrisPlex indeed predicts black hair as the only hair colour with no exception. Only in Europe, Russia, Israel and parts of Pakistan, the region covered by HGDP-CEPH samples where hair colour variation is assumed to be present, HIrisPlex predicts individuals with red, blond, brown as well as black hair colour. This mirrors our earlier findings using the IrisPlex system for worldwide eye colour prediction, where only brown eye colour was predicted in East Asia, Oceania, Sub-Saharan Africa and the Americas (with a single exception of an individual below the 0.7 p threshold level but still displayed a brown eye colour prediction); i.e. the worldwide regions where only brown eyes are assumed to exist. Also in Europe, Russia, Israel and parts of Pakistan where there is assumed eye colour variation, IrisPlex indeed predicted blue, intermediate and brown eye colour [
]. These results suggest that HIrisPlex hair and eye colour prediction is reliable on the worldwide scale and highlights that HIrisPlex hair and eye colour prediction can be applied independently from bio-geographic ancestry knowledge and without the need for extra DNA ancestry testing in practical applications such as forensics.
Furthermore, we examined the effect of the 24 DNA variants included in the HIrisPlex system on their potent ability to infer biogeographic ancestry. It had been advocated before that SNPs from pigmentation genes are useful for genetic ancestry detection [
]. Previously we had shown that the 6 SNPs from the IrisPlex system were able to separate Europeans from Non-Europeans to a certain degree on the population (but not necessarily on the individual) level [
]. Fig. 6 shows a two-dimensional plot from a non-metric multidimensional scaling (MDS) analysis of pairwise FST values estimated between pairs of all the 51 HGDP-CEPH populations using the 24 DNA variants of the HIrisPlex system (S-stress value 0.04030). As evident, the 1st dimension separates the European populations (except Sardinians and Adygei) from all non-European populations with all Middle-Eastern populations and the Kalash from Pakistan. Hence almost all groups with predicted hair colour variation are clustered closer to the European groups, whereas the East Asian groups together with the American groups cluster the farthest distance from the Europeans. The 2nd dimension separates African groups on one side and Oceanian groups to the other side from all other worldwide groups that appear centre. We then performed an AMOVA test to see how much of the total genetic variation provided by these 24 eye and hair colour predictive DNA variants is explained by geography when assigning the 51 populations into seven continental groups; Europe, Middle East, Africa, Central South Asia, East Asia, Oceania and America. A remarkably high variance proportion of 24.44% was estimated from 1100 permutations, which was highly statistically significant (p < 0.000005). When separating the 51 populations into two groups, i.e. Europeans and non-Europeans, we obtained a very similarly high variance proportion of 24.76% (p < 0.000005) from 1100 permutations. Grouping the 945 individuals according to their predicted hair colour categories (black, brown, red and blond) resulted in an only slightly higher variance proportion of 29.79% (p < 0.000005) as expected for a European trait such as hair colour variation.
Motivated by this finding, we investigated a combined eye and hair prediction threshold to test if it may be possible to find out simply by means of HIrisPlex eye and hair colour probability strength if a brown-eyed and black haired individual originates from Europe or from a region distant to Europe. If successful, this would provide additional information to the sheer eye/hair colour prediction, as it may alleviate the potential need for ancestry testing in finding out more about an unknown crime scene sample donor/missing person. Obviously, a prediction with sufficiently high probability of blue or intermediate eye colour, as well as of brown, blond or red hair colour would already allow a conclusion that the person is of at least partial European descent. However, this is different for brown-eyed, black-haired predicted individuals as this phenotype combination occurs worldwide. The results of this non-European threshold assessment can be seen in Supplementary Fig. 2 with the breakdown of population numbers shown in Supplementary Table 1. Our data demonstrate that it is indeed possible to predict that a brown-eyed, black-haired individual is likely to have non-European ancestry (excluding the nearby regions of Middle East and partly North Asia and America) using a threshold of >0.7 p for black hair and >0.99 p for brown eyes and the respective prediction accuracy based on our dataset is 86.5% (see Supplementary Table 1 for precise numbers).
We also investigated the worldwide allelic distribution of the 24 HIrisPlex DNA variants in the HGDP-CEPH samples separately for every DNA marker as shown in Supplementary Figs. 3–5 (except for rs12913832 (HERC2), rs1800407 (OCA2), rs12896399 (SLC24A4), rs16891982 (SLC45A2 (MATP)), rs1393350 (TYR) and rs12203592 (IRF4), as they can be found in Fig. 4 of our previous publication on worldwide IrisPlex analysis [
] and also displays high probability values for hair colour prediction, except for red (Table 3). Although the MC1R variants displayed in Supplementary Fig. 2(A) N29insA, (B) rs11547464, (D) rs1805008, (F) rs1805006, and Supplementary Fig. 3(G) rs1805007, (H) rs1805009, (I) Y152OCH which are all “high penetrance” MC1R variants as well as (K) rs111400 a “low penetrance” MC1R variant, all have a restricted European and surrounding areas distribution, as expected given their role in red hair that is normally observed in individuals with European and nearby ancestry, they are all quite rare especially N29insA and Y152OCH. However the remaining MC1R variants included in HIrisPlex (rs885479, rs1805005, rs2228479) show a variable distribution within Europe and its proximate areas, as well as outside these regions, which may explain the very rare occurrence of red hair individuals outside of Europe and surrounding areas [
], or that their effect size is rather minor. Notably, both rs1805005 (Supplementary Fig. 2(E), and rs2228479 (Supplementary Fig. 3(J)) were grouped into the MC1R_r low penetrance group for red hair prediction in our previous publication [
] and require a combination of MC1R alleles before the red hair phenotype is displayed due to their minor contributions towards red hair, which would explain their distribution outside Europe as their red hair effect is more minor. Rs885479 (Supplementary Fig. 2(C)) was also deemed a “low penetrance” SNP that is responsible for red hair colour production, but it seems to contribute to other hair colours as well as seen in its effect on the prediction model in Fig. 2, where the largest effect by its minor allele contribution was towards the darkening of hair colour (brown–black) in comparison to its contribution towards red hair colour prediction. This SNP is also noted to have a skin colour contribution, especially related to the evolution of lighter skin colour in East Asians [
], which mirrors its worldwide allelic distribution as shown here. Another HIrisPlex SNP with a peculiar worldwide allele distribution is rs28777 in the SLC45A2 (MATP) gene (Supplementary Fig. 3(L)), which reflects a pattern of European (and surrounding areas) versus Non-European differentiation due to its hair, in particular AA (black) versus CC (red) colour effect, but also due to its assumed skin colour association [
], which reflects this non-synonymous SNP's vital role in pigmentation. Rs683 (TYRP1) (Supplementary Fig. 4(R)) also reflects a slight European versus non-European pattern in terms of its TT genotype, which is present at a higher frequency within Europe and its surrounding areas than outside in which its counterpart allele GG is predominant. For the remaining SNPs, Supplementary Fig. 4(M), rs12821256 (KITLG), (N) rs4959270 (EXOC2), (P) rs2402130 (SLC24A4), (Q) rs2378249 (PIGU/ASIP), although associated with hair colour in Europeans, there is no discernable pattern of allelic distribution worldwide.
3.5 Considerations on the practical use of the HIrisPlex system for hair and eye colour prediction
The HIrisPlex system is capable of simultaneously predicting the hair and eye colour of an individual from DNA. Practical recommendations for eye colour prediction using the HIrisPlex system follow those previously published for the IrisPlex system [
] as the very same 6 SNPs and the very same eye colour prediction model used in IrisPlex are also used in the HIrisPlex system when it comes to eye colour. To allow easy use of the HIrisPlex system in practical applications, and to take full advantage of our eye and hair colour genotype and phenotype database and its relevant parameters for model-based prediction, we provide with the present paper the HIrisPlex hair and eye colour prediction tool (Supplementary Table 2). This tool is a combined Excel macro specifically designed to manage both the eye colour and the hair colour prediction models in an easy-to-use fashion that allows interactive use. Users simply input the number of minor alleles (0, 1 or 2) of each of the 24 DNA variants included in the HIrisPlex assay and a probability value for black, brown, red and blond hair colour is produced based on the underlying hair colour prediction model, as well as separately the probability of light and dark hair colour shade, and separately the eye colour probabilities of blue, intermediate and brown based on the underlying eye colour prediction model. This tool replaces our previously provided [
]. Excel spreadsheet for eye colour prediction based on the IrisPlex system as it combines eye and hair colour prediction with the respective underlying database knowledge in one tool. For the most accurate interpretation of the categorical hair colour and hair shade prediction probabilities revealed from the Excel macro prediction tool (Supplementary Table 2), we recommend to follow the hair prediction guide as shown in Fig. 4 and described above.
As a working example of the tool, upon assessment of the 308 individuals used for model testing based solely on the highest probability category, we also assessed their hair colour prediction following the prediction guidelines set in this paper (Fig. 4) as well as eye colour assessment following the guide set in the pan-European IrisPlex paper we published previously [
]. This reflects how the DNA prediction of both pigmentation traits would be performed in practice, with a final hair colour prediction being made to the case officer, i.e. “the most probable hair colour is light blond”, including the accuracy at which the HIrisPlex system is able to predict the hair colour category based on current research (at present, based on our 308 individual test set), and the eye colour prediction would follow our previously published guidelines [
], i.e. the most probable eye colour result is brown above others (if this p value was >0.7 p) at an accuracy of 94% based on a European dataset of over 3800 individuals. In Fig. 8 we show four illustrative examples including eye and hair colour phenotypes from high-resolution photographs, the categorical eye and hair colour as well as hair shade probabilities as derived from HIrisPlex genotyping, and a summarising statement of the prediction outcomes as may be used for reporting purposes (these individuals were not used in modelling).
Supplementary Table 3 provides the actual single grader (Polish) and self reported (Irish and Greek) hair and eye colour phenotype of the individual and includes the final prediction that would be produced with the HIrisPlex system for hair and eye colour. An accuracy of 60% correct prediction for both hair and eye colour together (measured as the presence of an inaccurate prediction for either hair colour or eye colour) was achieved in this 308 model testing sample set. Expectedly, an increased number of individuals would be beneficial to test the accuracy of the HIrisPlex hair colour prediction model, especially from different countries in Europe other than Poland, Ireland and Greece that were involved in modelling to rule out any possible bias that may be present, which should be targeted in future studies. However, the relatively low percentage of correct combined eye and hair colour prediction in this test set is not only influenced by sample size but also by the different accuracies achieved for eye colour on one hand and hair colour on the other. For instance, in only 7% of the test individuals (all with intermediate eye and brown to black hair colours) were both pigmentation traits, eye and hair colour, predicted incorrectly.
When splitting-up the accuracies in this test set for the two pigmentation traits separately, hair colour alone was 76% correctly predicted using the prediction guide approach. Although different prediction accuracies were obtained for different hair colours as described above, the majority of the error lay in predicting a colour lighter than the physical phenotype, which can be attributed to the darkening of hair colour with age. Without having available biomarkers informative for the age-dependent hair colour change, we believe it will not be possible to dramatically reduce the prediction error currently obtained in such individuals. Consequently, basic research in the molecular biology of age-dependent hair colour changes is required to investigate whether such biomarkers can indeed be developed for future applications such as forensics.
For eye colour categories alone, 76% of individuals gave probabilities that were correctly predicted in this set without using a threshold, or 82% by applying the >0.7 p threshold as we advocated before [
]. This overall estimate of eye colour prediction accuracy is strongly influenced by the intermediate category, which, with the currently available SNPs, is known to be by far the least accurately predictable eye colour category in relation to blue and brown. In fact, the majority (59%) of individuals in this test set that showed inaccurate eye colour prediction belonged to the intermediate category and only 14% intermediate (total n = 50) eye coloured individuals were predicted correctly. In contrast, and even without considering the previously suggested probability threshold of 0.7 p and omitting the phenotypic intermediate individuals, in only 8% of cases did the HIrisPlex system provide an incorrect prediction for individuals who had phenotypic blue eye colour and in only 18% of cases for individuals who had phenotypic brown eye colour in this set. This reflects an accuracy call rate of 88% (n = 258) for blue and brown eye colours alone in this test set, or 94% (n = 194) by applying the >0.7 p threshold for correctly predicting the phenotypic blue and brown-eyed individuals within this test set. Our previous IrisPlex study on >3800 individuals from seven countries of different parts of Europe also provided an overall eye colour prediction accuracy of 94% for blue and brown using the >0.7 p threshold [
]. This indicates that the eye colour accuracy when just considering blue and brown eye colour predictions is much higher than in the prediction of all three categories, blue, brown and intermediate, mainly due to the fact that currently, DNA markers with the ability to strongly predict non-blue and non-brown eye colours are lacking and need to be established in future basic research.
Some of the individuals categorised here as intermediate eye colour in fact carry green eyes and some DNA variants have been previously suggested to be informative for green eye prediction such as OCA2 rs1800407 [
] stated that green eye prediction with a high degree of accuracy is possible using specific genotype combinations, i.e. A/G at rs12913832 plus T/T at rs12203592 designed combo 1, or G/G at rs12913832 plus C/C at rs16891982 designed combo 2. We were interested to see if we could improve the green eye prediction in our test set where with HIrisPlex we only achieve 5 correct intermediate/green (19%) predictions from the 27 phenotypic green eyed individuals considered. Using their guidelines, we found that combo 1 predicted only 2 of the 27 green individuals (8%), which is less than half of the ones correctly predicted by HIrisPlex, and wrongly predicted 3 blues as green. Combo 2 did not exist within this set of 308 individuals; hence, none of the remaining 25 green eyed individuals could be identified with this combo. Therefore, applying the approach of Pneuman et al. [
], that intensified basic research into the genetics underlying green eye colour is needed before better markers for green eye prediction in practical applications such as forensics can be provided.
The hereby introduced HIrisPlex system is capable of simultaneously predicting hair and eye colour phenotypes from DNA using a single 24-multiplex assay and a combined eye and hair colour prediction tool. The HIrisPlex genotyping assay is highly sensitive allowing successful genotyping down to at least 63 pg starting DNA, and is capable of successfully coping with degraded DNA due to fragment sizes of <160 bp. An on-going developmental validation study of the HIrisPlex assay will deliver additional characteristics relevant for forensic applications. The HIrisPlex hair colour prediction model and prediction guide revealed on average individual-based hair colour prediction accuracies of 69.5% for blond, 78.5% for brown, 80% for red and 87.5% for black hair. The HIrisPlex system provides reliable hair colour prediction independent from bio-geographic ancestry as we previously also showed for eye colour prediction and the IrisPlex system, which represents the eye colour prediction part of the new HIrisPlex system. HIrisPlex hair and eye colour prediction in practical applications is eased by providing a user-friendly Excel spreadsheet requiring not more than the input of the number of minor alleles of the 24 assay DNA variants. It produces individual probabilities for four hair colour categories (red, blond, brown, and black) and hair colour shade (light and dark) – used together and following the prediction guide approach we provide here, this allows a more specific hair colour estimation than available from the categorical approach alone. This spreadsheet also delivers three eye colour categories (blue, intermediate, and brown) based on the previously developed and validated IrisPlex model. As an extra element with investigative value we demonstrate here that it is possible to infer bio-geographic ancestry on the level of European (including nearby regions) versus non-European (excluding nearby regions) origin from the strength of HIrisPlex hair and eye colour probabilities for brown eyed and black haired individuals distributed worldwide (whereas non-brown eye colour and non-black hair colour per se indicate an origin in Europe, including nearby regions).
Current limitations of the HIrisPlex system are in accurately predicting hair colour in those individuals who underwent age-dependent changes that influenced category shifts (such as blond to brown) because of the current unavailability of biomarkers to indicate such a colour change, and in accurately predicting intermediate eye colours such as green because of the current unavailability of good DNA predictors for these non-blue and non-brown eye colours. Basic research for finding more appropriate bio-markers for these aspects is needed to overcome current limitations of DNA-based eye and hair colour prediction in the future. Furthermore, future research is needed on the biology and genetics of hair greying, and the development of informative bio-markers for its molecular prediction. Last but not least, and similar to our previous proclamation on eye colour [
], we would like to emphasise here that only moving DNA-based hair (and eye) colour prediction from the current categorical level to a future continuous level, aiming to accurately predict all shades of hair (and eye) colour including age-dependent changes in early and in advanced ages, will provide the highest level of accuracy, as may be wished by the investigating authorities for forensic applications. Notably, such continuous prediction approach will also avoid the current uncertainties that come along with the interpretation variance of hair and eye colour categories by different receiving investigators, by being able to provide them with actual hair and eye colour charts or printouts to be used for tracing an unknown person instead of a simplified colour category as possible for the time being.
We are very grateful to the study participants for providing samples including eye and hair images. We would also like to thank Professor Tommie McCarthy of University College Cork (UCC), Ireland for helping with sample collection. This work was funded in part by the Netherlands Forensic Institute (NFI) and by a grant from the Netherlands Genomics Initiative (NGI) / Netherlands Organization for Scientific Research (NWO) within the framework of the Forensic Genomics Consortium Netherlands (FGCN), and was furthermore supported in part by a grant from the Ministry of Science and Higher Education in Poland no ON301115136 to W.B.
Frequency of individuals called non-European in the 51 populations from the HGDP-CEPH H952 set when using black hair colour probabilities >0.7 on its own and in conjunction with brown eye colour probabilities >0.99. Includes the percentage ability to differentiate between a black haired brown-eyed European from a non-European with black hair and brown eyes per population.
Interactive HIrisPlex prediction tool for hair and eye colour: an easy to use Excel macro to input the minor alleles that are generated from the HIrisPlex genotypes. The output of the tool gives the individual probabilities of the four hair colour categories (Black, brown, red and blond), two hair colour shade categories (light and dark), and three category probabilities for eye colour (blue, intermediate and brown) given its HIrisPlex genotype and based on a prediction model obtained from 1243 Polish, Irish and Greek individuals. For accurate interpretation of hair colour and shade prediction probabilities and to derive the final most likely individual hair colour category see prediction guide in Fig. 4. For accurate interpretation of eye colour prediction probabilities and to derive the final most likely individual eye colour category see recommendations described in Walsh et al. [
Prediction calls of the 308 test set of individuals, includes HIrisPlex probabilities for hair colour categories (including hair shade) and the final prediction call for hair colour (considering colour and shade based on the guide in Fig. 4) as well as eye colour prediction accuracies based on our recommendations described in Walsh et al. [