If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Department of Forensic Genetics and Forensic Toxicology, National Board of Forensic Medicine, Linköping, SwedenDepartment of Biomedical and Clinical Sciences, Faculty of Medicine and Health Sciences, Linköping University, Linköping, Sweden
Comprehensive review of investigative genetic genealogy from a forensic perspective.
Background outlined for the DNA methodology and long-range familial searching process.
Survey of current direct-to-consumer testing companies connected to investigative genetic genealogy.
Overview of DNA technologies focusing on high-density SNP genotyping.
Investigative genetic genealogy (IGG) has emerged as a new, rapidly growing field of forensic science. We describe the process whereby dense SNP data, commonly comprising more than half a million markers, are employed to infer distant relationships. By distant we refer to degrees of relatedness exceeding that of first cousins. We review how methods of relationship matching and SNP analysis on an enlarged scale are used in a forensic setting to identify a suspect in a criminal investigation or a missing person. There is currently a strong need in forensic genetics not only to understand the underlying models to infer relatedness but also to fully explore the DNA technologies and data used in IGG. This review brings together many of the topics and examines their effectiveness and operational limits, while suggesting future directions for their forensic validation. We further investigated the methods used by the major direct-to-consumer (DTC) genetic ancestry testing companies as well as submitting a questionnaire where providers of forensic genetic genealogy summarized their operation/services. Although most of the DTC market, and genetic genealogy in general, has undisclosed, proprietary algorithms we review the current knowledge where information has been discussed and published more openly.
It is a fundamental principle of genetics that individuals who are closely related will share DNA from their common ancestors; and the more distant the relationship, the less DNA is shared. Familial searching of national DNA databases [
] using 16–22 autosomal STRs will only provide links through partial matches to immediate relatives such as siblings, parent-offspring (50% of DNA shared) or, at most, avuncular relationships, e.g. uncle-nephew (25% shared); although even half-sibling relationships can be difficult to resolve with limited STR data. Once familial searching is extended over a longer range to pairwise comparisons of first cousins, second cousins, third cousins and beyond (12.5%, 3.13% and 0.78% DNA shared, respectively) there is the requirement for genetic variation at much higher densities than the standard forensic tests have been able to achieve up till now. High-resolution commercial direct-to-consumer tests which include a relative-matching feature have been available for more than a decade [
]. These tests are currently analyzed using high-density microarrays genotyping more than 600,000 SNPs, providing matches with both close and distant relatives. By distant we refer to degrees of relatedness exceeding that of first cousins, in contrast to genealogists who use the term distant for relationships beyond 4th or 5th cousins. Genealogists have used these tests routinely since their inception as a tool to help with their family history research, both to confirm existing relationships and find new relatives [
], with thousands of adoptees, donor-conceived individuals and foundlings successfully using the commercial tests to connect with siblings and identify biological parents. Conversely, tests have revealed unexpected discoveries such as the finding of unknown siblings or the discovery that the social parent is not the biological parent [
]. Therefore, it was only a question of time before the same techniques were applied to forensic DNA from a crime-scene or the remains of missing persons. The barrier hindering the forensic implementation of long-range familial searching was the lack of a method to generate the required high-density SNP data from degraded DNA which would be compatible with the genetic genealogy databases.
Three major factors are necessary to reach the level of effectiveness for relative matching achieved by genetic genealogy: i. large-scale autosomal SNP genotype data with marker numbers in the hundreds of thousands and available at an affordable price; ii. large databases of these SNP genotypes open to public access; and iii. a simple but well-founded system for comparing related pairs using this large-scale SNP data. While the use of dense SNP microarray data had already been studied in forensic contexts [
], such technology became readily available to the public in 2007 through direct-to-consumer testing companies (the ‘DTCs’) with the launch of tests from deCODE Genetics and 23andMe, costing nearly $1000 [
]. Early tests were based on the Illumina OmniExpress microarray, but the field is now dominated by the Illumina Infinium Global Screening Array (GSA), which currently has a core set of 654,027 SNPs and the ability to add up to 50,000 custom markers.
As the cost of testing decreased and more companies entered the market, SNP databases began to grow exponentially. The inflection point was reached in 2018 and in that year more DNA tests were sold than in all previous years combined [
Table 2Analysis and SNP genotyping details of the four main DTCs and GEDmatch. Information has been compiled from the company websites as well as the scientific publications given in the table. When data was not available ‘n/a’ is given.
Option 1: 9 cM and at least 700 SNPs for one half-identical region; Option 2: 5 cM and 700 SNPs with at least two half-identical regions being shared
6 cM per segment before the Timber algorithm is applied and a total of at least 8 cM after Timber is applied
Option 1: 9 cM and 500 SNPs for one half-identical region; Option 2: 7.7 cM for the first half-identical region and a total of at least 20 cM (including the shorter matching HIRs between 1 cM and 7 cM); Option 3: 5.5 cM and at least 500 SNPs for the first half-identical region for about 1% of customers who come from specific non-European populations
7 cM. Default SNP count is set to vary dynamically. SNPs down to 3 cM can be seen in the One-to-One tool
8 cM for the first matching segment and at least 6 cM for the 2nd matching segment; 12 cM for the first matching segment in people whose ancestry is at least 50% Ashkenazi
X-DNA match thresholds
For half-IBD segments: Male vs male: 200 SNPs, 1 cM; male vs female: 600 SNPs, 6 cM; female vs female: 1200 SNPs, 6 cM; For full-IBD segments: 500 SNPs, 5 cM
1 cM and 500 SNPs for both males and females; matches must already meet the autosomal DNA matching criteria
7 cM. Default SNP count is set to vary dynamically. SNPs down to 3 cM can be seen in the One-to-One tool
Of the commercial companies, only FTDNA allows law-enforcement matching within the opted in section of its database. GEDmatch, a citizen science website founded in 2010, proved crucial to the initial development of investigative genetic genealogy. GEDmatch allows DNA profiles to be uploaded from a wide variety of sources, including law enforcement samples, so that cross-company comparisons can be performed using an additional range of tools.
The arrest of Joseph DeAngelo as the suspected Golden State Killer in 2018 brought the investigative use of genetic genealogy to the world’s attention [
]. Many of the technical details around the analysis of forensic DNA for long-range familial searching are still not in the public domain, as commercial interests restrict publication of much of the information needed to properly assess how large-scale SNP genotyping techniques are applied to evidential material – typically with DNA limited in quantity and quality. In addition, there is a lack of transparency on the part of law enforcement agencies. IGG is used to generate an investigative lead and the details of the IGG work have not yet been scrutinized in court. Contradictory stories of how the Golden State Killer was caught have been published and further details only became available two years after his arrest from information leaked to the Los Angeles Times.
] (referred to as segments), alternative analyses exist and are being developed which could offer more viable approaches when insufficient SNP genotypes from poor DNA prevent reliable segment matching [
]. In this review, we attempt to fill some of the gaps in knowledge that currently exist, with emphasis on the DNA analysis regimes in use for long-range familial searches. To compensate for the lack of information in the public domain we sent out a questionnaire to some of the forensic science providers in the US. This includes a number of questions relating to the use of technologies and genetic genealogy in their assistance to law enforcement. The answers are submitted from private companies, potentially with conflicts of interests, and we have taken care to peer-review them as far as possible. The responses received to the questionnaire are compiled in Supplementary File S1.
We use the term investigative genetic genealogy (IGG), also known as forensic genetic genealogy, to describe the use of SNP-based relative matching combined with family tree research to produce investigative leads in criminal investigations and missing persons cases. The term forensic genealogy is sometimes used in this context but has a distinct meaning in US genealogical circles and relates to all questions of a legal nature that require genealogical analyses, including disputed inheritance, identification of military personal and citizenship claims.
] provide informative overviews of genetic genealogy used in forensic investigations. Useful additional information, an updated review of forensic genetic genealogy practice and a list of many successful crime investigations was provided in 2020 by Katsanis [
]. The aim of relationship inference, as defined in this review, is to determine whether regions of DNA are shared identical by descent (IBD), i.e., through common ancestry. Comprehensive summaries of this topic are provided by Weir et al. [
] reviews what we term pedigree-based methods. The following sections provide a brief description of these approaches, summarized in Fig. 1, and an overview of the underlying statistical theory. We do not discuss the number of markers required for each approach in detail and all numbers should be seen as approximate, heavily dependent on the case, the population or other factors. As a rule of thumb, simple versions of exploratory approaches require higher marker numbers evenly distributed across the genome, while pedigree-based methods tend to require fewer markers, but still evenly distributed.
2.1 Exploratory approaches
The exploratory approach benefits from being able to provide a measure of relatedness without any prior information. Briefly, it uses the observed genotype states and summarizes the number of shared alleles or shared stretches of alleles. Manichaikul et al. [
] describe a method to estimate the so called Cotterman coefficients using dense SNP data. Cotterman coefficients are summarized in the kinship coefficient and probability to share zero alleles IBD. A similar approach is implemented in PLINK [
] outline an alternative model whereby segments of shared DNA are identified, see Fig. 1. The simplest version of this approach utilizes dense SNP data to identify stretches of half-identical genotypes. A half identical stretch is terminated once opposite homozygotes are detected at a certain point. The length of the segment (or haplotype) is recorded as well as the segment’s SNP number. The non-probabilistic version of the segment model requires two parameters, the segment length in centiMorgans (cM) and the number of SNPs in a segment.
The threshold on the number of SNPs in a segment is primarily defined to ensure sufficient marker density in any given region. Further, in a forensic setting, marker density cannot necessarily be ensured, for instance due to low quality DNA samples.
If a segment exceeds a set threshold it is added to the total length of shared segments. Setting the threshold too low can potentially result in higher levels of false matches, whereas higher thresholds may eliminate true matches; although it should be noted that all likelihood-based forensic measurements must establish a threshold to balance false positive and false negative rates accordingly. In relationship tests a false positive result incorrectly includes an unrelated individual, while a false negative result excludes the true relationship, but may incorrectly suggest alternative relationships. Finding appropriate likelihood thresholds, with maximization of this cost/benefit trade-off applies to most statistical evaluations in forensic case work. The segment model has been adopted by all the major direct-to-consumer (DTC) genetic testing companies in different versions
]. Variations of the segment model implement a pre-phasing step whereby the paternal/maternal origin of each allele is determined and used to potentially improve the accurate detection of IBD segments [
]. Haplotype frequency is taken into account in the matching algorithms at AncestryDNA where their so-called Timber algorithm compares segments with a reference panel and down-weights the genetic distance for regions which have unusually high levels of matching [
] refer to as probabilistic versions of the segment model, uses a statistical approach (hidden Markov model) to model the IBD states and compute LOD scores determining whether a particular segment is IBD or not. The probabilistic models are likely to perform better for the detection of shorter IBD segments, e.g. below 4–5 cM, but require significantly more computational power [
The likelihood approach has its merits as investigators are presented with a probability stating how likely the genetic data are, assuming hypothesis one (H1): the individuals are related as claimed vs. hypothesis two (H2): that they are unrelated or have an alternative relationship. Using likelihood comparisons to determine relatedness has traditionally been part of forensic and medical genetics for some time [
]. This approach requires the formulation of hypotheses to assess, for instance:
H1: Two individuals are full cousins.
H2: Two individuals are unrelated.
The likelihood is then computed by conditioning on each hypothesis separately. A likelihood ratio can be formed stating how much more likely or unlikely the observed genotypes are given hypothesis H1 compared to H2 [
] for dense SNP data and many typed individuals. For pairwise comparisons the algorithms can be condensed, and results obtained with minimum computational effort. Thompson suggested the use of a maximum likelihood approach (MLE) to estimate the relatedness coefficients for pairs of individuals [
]. A maximum likelihood approach accounting for linkage requires the estimation of the relatedness coefficients in combination with inheritance patterns. Genealogical applications normally only provide a range of relationships rather than an exact level of relatedness. Therefore, a discrete grid of relatedness coefficients can be evaluated instead of a continuous optimization, i.e., the MLE approach can compute the likelihood of e.g., the twenty most common degrees of relatedness and then report the highest likelihood, or the top listed likelihoods if these have similar values.
The likelihood approach further benefits from being able to use reduced genotype data, normally comprising pruned genome-wide SNP data. A naïve approach uses only a minimum distance as the inclusion criterion. Closely located SNPs are expected to contain a high degree of redundant information, mainly through the association of alleles in a population. While a large proportion of SNPs with low minor allele frequencies on average convey little information, when a few rare variants are shared they can provide strong support for relatedness. Maximum information (i.e. heterozygosity) is achieved when the minor allele frequency for a bi-allelic marker is 0.5. Therefore, more intricate thinning procedures would utilize measures of allelic associations and population frequency data to prune SNP data.
] compared exploratory and likelihood approaches (including four degrees of relationships) finding that to identify distant relatives, they provide equal power while the likelihood approach tends to falsely include unrelated individuals as distant relatives to a greater extent than exploratory approaches. Note that Kling et al. used a naïve version of the segment approach, mimicking that of GEDmatch, and better performance would be expected for the more evolved versions [
]. As with the likelihood methods, the exploratory approaches do not provide an exact degree of relatedness, but a range of possible relationships which can be investigated through genealogical research. Ultimately, taking a case to court currently requires the formulation of hypotheses and a likelihood ratio which is then converted into a posterior probability stating how likely a certain hypothesis is, given all circumstantial evidence [
]. Exploratory approaches are currently only used in forensic analysis to generate investigative leads and are not presented in court, where STR profiling remains the universally accepted way to establish identity or the link between suspect and crime scene. However, Ge and Budowle [
] have suggested that a shift from STRs to dense SNP data could eventually occur which would require establishing new statistical methods in forensic genetics and acceptance as a secure system of identification by courts of law.
In forensic applications, obtaining data for panels of >500,000 SNPs is not always possible, partly due to the nature of forensic samples but also due to the panels and platforms used in routine work. The exploratory approaches require very dense panels of markers to accurately determine relationships. Fig. S1 illustrates that in a small study performed, at least 56,000 SNPs are needed to determine first cousins, while siblings only require 29,000 SNPs. In contrast, the likelihood approach does not rely on as dense a set of markers as the exploratory approaches. It benefits from using allele frequencies to infer relationships and thus, in theory, a few shared rare variants can indicate strong support for relatedness. This could also represent a drawback if inappropriate frequency databases are used, as demonstrated in Kling [
]. Limitations in the number of genotyped SNPs could potentially be overcome by using imputation, described later. A further drawback of the likelihood approach is the need to account for linkage disequilibrium (LD) when SNP numbers increase. Kling showed that the false positive rate (i.e. false inclusion of true unrelated individuals at various degrees of relationship) is heavily inflated if LD is not accounted for with SNP numbers exceeding 30,000, particularly in some populations [
]. In contrast, LD can be naturally incorporated into the segment approach where SNPs could be in LD (i.e. shared through distant population ancestry) in short segments, but when segments are longer, little LD is detected between their start and stop positions [
] showed that many inferred segments 1–2 cM long actually result from conflation of a number of smaller segments of at least 0.2 cM or longer. AncestryDNA recently illustrated that some longer segments, even up to 50 cM, were identified to be shared by individuals from a common population.
]. Donnelly investigated the theoretical probabilities of two people of different degrees of relatedness sharing a portion of their genome identical by descent. This study found that in theory all second cousins should share some DNA identical by descent, but roughly 2% of all third cousins and 30% of all fourth cousins would share no detectable DNA relationship. This work further highlighted the limits of genetic genealogy and the important principle that not all genealogical relationships will be genetic ones [
] investigated IBD sharing in a much larger dataset of over 20,000 individuals drawn from the 23andMe database and HGDP-CEPH panel. Using unphased data, it was possible to detect ~90% of third cousins and 46% of true fourth cousins. There is a considerable overlap between the distribution of shared DNA for distant relatives (see Table 1 in Balding et al. [
]), which is why DTC reports give ranges of relationships rather than precise inferences. The crowd-sourced initiative “Shared cM Project” (see Section 5) provides a good overview of empirically collected data submitted by DTC customers [
]. The use of whole genome sequence (WGS) data has the potential to further improve relationship estimations. Li et al. estimated that WGS data potentially increases the detection power for distant relationships by 5–15% compared with microarray data [
] described the use of whole genome sequence data where distant relatives (8–9th degree) could be detected using very rare genetic variants. Section 3.2 further explores the expected and reported success rates using current databases.
The inference of relatedness is confounded by pedigree collapse and endogamy. Ralph and Coop [
] provided empirical data of the inter-relatedness of all Europeans within the last 1000 years. They found that two European individuals from neighboring populations share between two and 12 genetic ancestors from the last 1500 years and over 100 genetic ancestors within the last thousand years, with substantial regional differences in the level of sharing. They highlighted the difficulties of inferring the age of a single small segment of 10 cM and the impossibility of assigning a genealogical relationship. Gauvin et al. [
] explored the effect of endogamy in HGDP-CEPH populations. Very high levels of segment sharing, and therefore very recent common ancestors, were detected in Surui and Karitiana, (Amazonian populations which are essentially extended families). However, high levels of segment sharing were also detected in the much larger Kalash and Yakut populations, indicating the minimum segment length threshold used to analyze IBD needs careful calibration in populations with endogamy or recent bottlenecks [
Various errors can be introduced to SNP genotypes during the process of parsing variants. Such errors are broadly dividable into two subsets: technological errors and induced errors. Technological errors resulting in erroneously called genotypes can occur during DNA amplification and sequencing, or in the bioinformatics pipeline that performs sequence alignment or variant calling, see Fig. 2. Imputation and phase errors fall into the latter category. In the section on imputation we describe a small study where we investigate the errors introduced when inferring missing data. Furthermore, the process of phasing individual chromosomes can introduce errors [
]) of less than 0.2%. AncestryDNA further found a phase error rate using the Underdog algorithm of 0.64% with a training set of 502,212 samples and suggested accuracy would improve with larger phasing panels [
] when these are not accurately modelled. Similarly, de Vries et al. [In submission, 2020] demonstrated that the segment approach is sensitive to wrongly called homozygotes for error rates as low as 0.5% (personal communication). One of the strengths of non-probabilistic versions of segment matching, where phasing is not used, is that it is only sensitive to wrongly called homozygote genotypes, which can prematurely terminate a shared segment. Durand et al. [
] suggest applying a haplotype score incorporating the phase and genotyping error rates. This score could be used as a post-processing step to filter spurious IBD segments. Other researchers have studied and incorporated error rates into their segment models [
From a forensic perspective, many contact trace samples are likely to be of low quality and quantity, analyzed with low-depth whole genome sequencing, whereas database samples, commonly analyzed with SNP microarrays, are expected to have significantly lower error rates [
] and induced errors in one of the genotypes at different levels (2%, 1% and 0.5%), see Supplementary File S2. The results are illustrated in Supplementary File S2, Fig. 1 where no model accounting for errors is used, which show that levels of detectable shared DNA drop rapidly with increasing error rate. At 2% error rates, a pair of full siblings share on average ~500 cM of detectable total segments compared to roughly 2800 cM without errors. Supplementary File S2, Fig. 2 contains an equivalent illustration when a single error per segment is allowed and shows a considerable improvement in terms of detecting broken segments. Furthermore, Supplementary File S2, Fig. 2 demonstrates an implementation of the error model presented in Petter et al. [
]. In our implementation, four homozygote errors per segment are allowed while simultaneously only retaining a match if a segment of above 6 cM without errors is detected. Fig. 3 further illustrates how errors affect the individual segment and indicates that for e.g. full siblings, a few long shared segments are split into multiple shorter segments. Some will disappear, failing to exceed the detection threshold, while others are accumulated into the total length of shared DNA.
2.5 The use of DNA mixtures
In contrast to single-source DNA samples, mixtures of several contributors are common in forensic samples. In terms of using mixtures as court evidence, there are various methods to estimate the evidential weight of a DNA sample [
described the use of WGS analysis of a mixture and subsequent separation through conditioning on the victim’s DNA profile, although also lacking details of the method. State-of-the-art methods in forensic DNA analyses use quantitative models where allele peak heights help infer individual contributor genotypes (termed probabilistic genotyping). In current IGG, the search is conducted with a single source DNA profile,
Strictly speaking, since biallelic SNPs are used, it can never be perfectly deduced if a profile is single source or not. However, allelic balances can give information on the number of contributors.
so a searchable profile must be obtained by deconvolution of the mixture, either by conditioning on known contributors or by combining a statistical model and information about the balance of allelic signals. As a consequence, the resulting profile used in the search has a level of uncertainty and the analysis benefits from estimation of the false/true positive rates affected by this uncertainty. Standard forensic mixture deconvolution incorporates the uncertainty into a statistical model to potentially allow a search. The current version of CODIS [
] demonstrated that linked markers can be used in a qualitative model allowing future expansion of marker panels. Exploratory approaches, on the other hand, rely on large numbers (and segments) of uninterrupted SNPs. IGG relies on the generation of a SNP profile with sufficient genotypes to be accepted into the databases to allow LE matching. The approaches would have to rely on a single deconvolution where the profile of the perpetrator is extracted instead of a more probabilistic approach.
Whole genome sequencing of low-level DNA tends to yield low mean coverage conveying little information on the exact level of individual contributors. However, a statistical model can be developed to extract a contributor in a mixture based on allele dosage (i.e. read counts). Fig. 4 illustrates a two-person mixture and how it is possible to extract the perpetrator based on a known contributor. Without using information on allele dosage, only homozygotes can be called with certainty. If the mixture is a homozygote genotype then the perpetrator must be a homozygote as well, disregarding dropouts, and therefore the second contributor’s genotype is irrelevant. For heterozygote mixture genotypes, the perpetrator can be a heterozygote or homozygote for either of the alleles, potentially inferred using information from the second contributor. Inflating the number of erroneous homozygotes is quickly detrimental to genealogy searches, so potential solutions are to always infer a heterozygote genotype for the perpetrator, or to remove these ambiguous genotypes. The former can lead to an increase in the number of false positives, while the latter can potentially increase false negatives since fewer SNPs are called. If information on allele dosage is available, such information can be used if heterozygote genotypes contain a minimum number of reads. Raw data from microarrays contain intensity levels that potentially allow mixture contributors to be separated, as described by Homer et al. [
]. However, we do not recommend the use of such microarrays for forensic analyses (see Section 7.1).
We performed a small study where unrelated individuals from the 1000 Genomes Project were drawn at random in a pairwise approach. The genotypes were mixed (equal proportions) and deconvoluted using three different models, two qualitative and one quantitative, as outlined in Supplementary File S2, section B. Under the assumptions in our study, genotypes could be deduced with 99.9% accuracy when the quantitative model was used, with 4–5% of genotypes dropping out due to uncertainty in the deconvolution process, as shown in Supplementary File S2, Fig. 4. The qualitative models both resulted in an inflation of errors. We did not explore the impact of the deconvolution accuracy on the inference of relatedness but assume that it is minimal for the quantitative model, given the low error rates.
3. Genealogy research
3.1 Genealogical research
Genealogical research is a key component of IGG and generally the most time-consuming part of the process, though time spent on research will vary depending on many factors including closeness of the matches, the supporting network of matches, family size and availability of genealogical records. In a UK pilot study [
] genealogists solved one case which had matches with immediate family members within three hours, while they estimated more complicated cases with matches at third or fourth cousin levels needed 50–100 h of research. Some cases analyzed by the DNA Doe Project required hundreds of hours of research by volunteer teams. IGG is only possible because of the large quantities of genealogical records from around the world which have been digitized and indexed in the last two decades. The Church of Jesus Christ of Latter-day Saints has been at the forefront of this process and provides free access to billions of worldwide records through its FamilySearch website (https://www.familysearch.org). The FamilySearch Wiki allows access to information on the availability of worldwide genealogical records and provides articles on the research process. Users can upload family trees, and the site hosts the FamilySearch Family Tree (claimed to be the largest family tree in the world). Commercial companies, such as Ancestry.com, Findmypast, Geneanet and MyHeritage, have also transcribed and indexed billions of records and provide subscription-based online access. These sites also allow users to upload family trees which can then be searched by other users. Therefore, it is now possible to easily access family trees, birth, marriage and death records, censuses, electoral registers, newspaper articles, wills and a variety of other historical records from many different countries. There are also many national and regional archives around the world with growing collections of digitized records which are freely available online. Research which previously took years and required visits in person to archives and repositories can now be done online in a matter of hours.
IGG involves researching not just historical records but tracing lines forward to the present day in what is termed descendancy research or reverse genealogy. This requires access to records on living people. Some modern records are available on the genealogy sites mentioned above but these records can be supplemented by searches on social media, particularly Facebook, which can offer a lot of information about living people and their family relationships. Online obituaries, particularly in the US, often provide complete lists of descendants and relatives of the deceased. People finder sites like BeenVerified and Intelius are particularly useful for US searches.
Successful genetic genealogy searches require not just easy access to genealogical records and a good understanding of how to evaluate genealogical evidence but also considerable experience of interpreting DNA evidence. There are university courses which provide a route to a career as a professional genealogist
]. However, many good professional genealogists are not accredited and have learnt through experience rather than a formal education programme. Genetic genealogy is a new discipline where best practice is being developed slowly through the collective experiences of those who are actively working in the field, many of whom are hobbyists. There are no official genetic genealogy qualifications and no organization which can testify to an individual’s ability to work on IGG cases. Many of the leading practitioners in IGG have had no formal genealogy training and have no accreditations. Accreditation with a genealogical organization is no guarantee that an individual has a sufficient level of expertize in genetic genealogy. This lack of professionalization makes it challenging for LE agencies wishing to employ a genetic genealogist to judge whether they have the relevant skills and expertize [
The IGG process starts with the upload of a SNP profile to one or more of the three databases where it is currently permitted: GEDmatch, FTDNA and DNASolves. Each company has different protocols for the use of their database by LE agencies, as described below.
The match lists are assessed by the genealogist who determines whether or not a genetic genealogy search is likely to be productive. If the query profile generates one or more matches at the second or third cousin level or closer, then the case is likely to be worth investigating. Second cousins are considered to be the “sweet spot” where identification should be possible [
]. However, much depends on the quality of the matches and whether or not the individuals can be identified through their username and/or e-mail address and by their family tree, if provided. The search will be more difficult if the query profile has ancestry from a country with limited availability of online genealogical records or where access to records on living people is more restricted.
Once the top matches have been identified, a check is made of the shared matches to identify genetic networks (clusters) of related matches. For example, second cousins share a set of great-grandparents in common and any matches which match both the query profile and a second cousin are likely to be related through a common ancestral couple in one specific quadrant of the family tree. The family trees of the shared matches are searched or built out to identify a common ancestral couple for all the people in the cluster. Descendancy research then traces the lines forward to the present day to identify candidates of interest. If additional clusters of related matches can be identified, then the genealogist will look for intersections (triangulations) between clusters, e.g., a marriage involving surnames from two distinct clusters. All the different genetic networks or clusters must be consistent with the identification with each match sharing the appropriate amount of DNA for the hypothesized relationship. However, because full siblings have identical ancestral family trees, genetic genealogy generally only ever narrows down the search to the offspring of a specific couple. It cannot determine which of a number of siblings is the suspect or the missing person, unless additional data for their descendants are available.
If the matches are all more distant (e.g. at third/fourth cousin level or beyond) the family trees can still be worked on, but it is often necessary to perform targeted testing of people identified through the genealogical research as possible closer relatives of the person of interest (e.g. second cousins). The individual is approached and asked to help with the investigation by taking a commercial genetic ancestry test and uploading the results to one of the databases which participates in law enforcement matching. The genealogist can then check that the individual matches the perpetrator in the expected way. Target testing thus helps to confirm that the correct branch of the family tree is being researched and narrows down the search pool, though the practice does have ethical implications, particularly if the DNA sample is obtained without the appropriate informed consent.
As well as the quality and quantity of forensic DNA in a case, the chances of a successful identification depend on the size of the database plus the number and quality of the cousin matches. Edge and Coop [
] investigated the question of the expected number of genetic cousins at varying degrees in databases of different sizes to assess the chances of success. Using simulations and some simplifying assumptions, their findings indicate that in a database of one million individuals with ancestry from the same population, there is a high probability (>95%) of having at least one genetically detectable third cousin match sharing two or more DNA segments. At that time, the GEDmatch database had nearly one million profiles accessible to LE searches so this study demonstrated that the identification of Joseph DeAngelo as the Golden State Killer was within expectations and that there was a high chance that US individuals with European ancestry could be identified in a database of this size.
], using empirical data from the MyHeritage database (1.28 million SNP profiles at the time of study), found that ~60% of searches for individuals of European ancestry would result in a third-cousin or closer match with a total 100 cM or more shared segments. In 15% of the queries at least 300 cM in total was shared, signifying a second cousin or closer relationship which could provide highly informative investigative leads. They corroborated the results by performing similar queries on a smaller scale in the GEDmatch database which led to ~76% of cases with 100 cM or more shared and ~10% of cases with 300 cM or more shared. Erlich’s study estimated that 75% of the MyHeritage database was of Northern European ancestry. The model presented in their study predicted that only 2% of a target population would need to be represented in a DNA database to provide a third cousin match for nearly everyone in the database.
Two studies have demonstrated the potential utility of IGG in a European setting and have validated the methodology. In a pilot study from the UK of ten volunteers, genetic genealogists were able to re-identify four of the ten individuals in the GEDmatch database (1.2 million SNP profiles at the time of study). One of the identified individuals had Indian heritage via St Vincent and the Grenadines, indicating the methods can potentially work for people of non-European descent if the right matches are available [
]. In a more recent case from Sweden, Daniel Nyqvist was identified as the suspect in a 2004 double murder of a young boy and a woman through matches with fourth cousins and as a result of extensive family tree building.
The searchable portion of GEDmatch which is accessible for investigative purposes changed dramatically in May 2019 following concern amongst some genealogists and users after it was used for a search which was not covered by the existing site policy [
]. GEDmatch set to zero the number of ‘kits’ (herein, a kit refers to an individual’s SNP dataset uploaded to GEDmatch, mainly produced and held by the DTCs) against which LE investigators could query and introduced an opt-in framework, where users own the choice to allow their SNP kit to be included in the portion that can be compared for investigative segment matching purposes.
Prior to the reset, ~700,000 of the one million or more GEDmatch profiles were available for investigative query. Private profiles, duplicate profiles, those with insufficient SNPs or excessive gaps in SNP coverage and specialized datasets (e.g., surname or ancestry groups) were all excluded from searches. GEDmatch was the subject of a security breach in July 2020,
but they have indicated to us that only a minimal number of users have since deleted their accounts, and the database continues to grow. In a presentation at the 31st International Symposium on Human Identification in September 2020,
Verogen, who acquired the GEDmatch database in December 2019, said 1.1 million users had uploaded 1.45 million DNA profiles. Over 285,000 users have opted in to LE matching and 83% of new users opt-in to LE matching. Verogen have made internal assessments to test the efficiency of the opted-in profiles for investigative searches. When a small cohort of known investigative SNP kits were compared internally against the opt-in portion of the database, and then against the opted-out portion, the opt-in portion provided equivalent potential leads to the opt-out database in ~80% of cases.
The GEDmatch database is dominated by users of European ancestry, particularly from anglophone countries. Table 1 gives the ten countries with the most GEDmatch uploads based on website analytics (data from Verogen, August 2020). The need for European GDPR compliance is also an influencing factor in the potential success rate as the consent process required EU users to opt in to use the database, following its acquisition by Verogen.
GEDmatch is now supplemented by the FTDNA database where the number of profiles available for LE matching is not known. If the FTDNA database has a similar number of profiles accessible to LE the combined reach of the two databases may be approaching 600,000, though some duplication is likely. In time, critical mass could be reached where nearly any US individual of European descent could potentially be identified through IGG [
In response to our questionnaire, Parabon NanoLabs said they had recorded a significant recovery in the informativeness of GEDmatch since the opt out was implemented in May 2019, but match rates had not quite reached the levels available before. However, they indicated the number of cases where investigative leads and actionable information can be provided has not significantly changed, but this often requires uploading to FTDNA as well as GEDmatch. The segment matching evaluations made by Parabon NanoLabs, before and after the GEDmatch LE access changes, are summarized in Supplementary File S3.
On 11th January 2021 Verogen updated the Terms of Service at GEDmatch.
]. The individual who makes their DNA available for law enforcement matching shares part of their genome with other close relatives and so their decision essentially affects their wider extended family who could potentially be involved in the investigation even though they have never taken a DNA test [
]. The use of surreptitious DNA testing to obtain confirmatory samples from the suspect also raises ethical issues, especially as in some cases the police have put multiple family members under surveillance to obtain these samples. The international nature of the consumer DNA databases and differing approaches to punishment raise ethical and human rights issues, particularly with regard to the death penalty which is still used in a minority of countries and in some US states.
The use of IGG to identify and prosecute the mothers of abandoned babies has also been cited as a cause for concern, particularly in jurisdictions where there are no infanticide laws allowing for more lenient and compassionate treatment of mothers.
Genealogists are interested in testing the DNA of deceased relatives to help with their family history research, but should they have the ability to make a deceased relative’s DNA profile available for LE use? What happens if the descendants have conflicting views on such sharing? Qualitative research looking at the views of UK stakeholders found that there was considerable support for the use of IGG, but many interviewees commented on a range of social and ethical concerns and expressed the need for independent regulatory oversight [
]. While interviewees all expressed the importance of individual informed consent, it was found that it is not an ethical panacea and there is a need for a more societal approach to consent in consultation with the public [
]. We have highlighted some of the key ethical and social issues discussed in the literature which we feel are important, but it is outside the area of expertize of the authors and beyond the scope of this paper to engage with them in depth. Much more research is needed on all these issues by bioethicists and social scientists in consultation with stakeholders and the general public in order to establish a suitable ethical and regulatory framework for the responsible use of IGG.
4. Official guidelines for use of genealogy data in investigative practice
The US Department of Justice (DoJ) released an Interim Policy on Forensic Genetic Genealogical DNA Analysis and Searching in November 2019. The “scientific community and other interested parties” were encouraged to send comments to the FBI [
]. The policy clarifies that the investigative agency “must have pursued reasonable investigative leads” but it did not make specific recommendations about the need to clear testing backlogs or the need to use familial searching first before resorting to genetic genealogy. The SWGDAM (the Scientific Working Group on DNA Analysis Methods) in the US convened a working group to publish a statement on genetic genealogy and published an Overview of Investigative Genetic Genealogy in February 2020.
Both the DoJ and SWGDAM recommendations emphasize the importance of a ‘CODIS first and last’ approach in investigative practice. The DoJ policy states: “before an investigative agency may attempt to use genetic genealogy, the forensic profile derived from the candidate forensic sample must have been uploaded to CODIS, and subsequent CODIS searches must have failed to produce a probative and confirmed match”. They then emphasize that a CODIS search must complete the investigation, stating: “a suspect shall not be arrested based solely on a genetic association generated by a genealogical service. If a suspect is identified after a genetic association has occurred, STR DNA typing must be performed and the suspect’s STR profile must be directly compared to the forensic profile previously uploaded to CODIS”. As DNA analysis techniques progress there will eventually be situations where SNP data sufficient for a genealogical analysis will be generated from evidential material where an STR profile has not, e.g., where a hair shaft at a crime scene is submitted for specialist analysis outside of routine crime laboratory testing regimes. At this stage, which may have already been reached, the DoJ and SWGDAM guidelines must be reconsidered to address the way identity is established using SNPs in forensic cases without an STR profile from the crime scene.
With regard to what is described as ‘investigative caution’ concerning the behaviour of investigators in being transparent about the purpose of relative searches made by genealogical analyses, they state: “Investigative agencies shall identify themselves as law enforcement to genealogical services and enter and search genetic genealogy profiles only in those service suppliers that provide explicit notice to their service users and the public that the law enforcement may use service sites to investigate crimes or to identify unidentified human remains”. Furthermore, when obtaining new DNA samples they state: “an investigative agency must seek informed consent from third parties before collecting reference samples that will be used for genealogy, unless it concludes that case-specific circumstances provide reasonable grounds to believe that this request would compromise the integrity of the investigation”. The SWGDAM recommendations largely echo those of the DoJ, by saying a CODIS search in state or national databases should be made before instigating genealogical analyses and a CODIS search should conclude the investigation to complete the exclusionary/inclusionary process. On public consent for LE access, SWGDAM state: “policies/procedures should be established which consider applicable privacy policies and the database provider’s terms of service, a level of transparency of techniques employed, and maintenance of the public trust”.
The UK Biometric and Forensics Ethics Group recently published a report on investigative genetic genealogy which covers the feasibility of using the technique in the UK and ethical issues arising from its use.
]. Following the resolution of a recent double murder in Sweden assisted by IGG (see above), public pressure to use the method in other cases has emerged. The double murder case was selected as a pilot study, initiated by the Legal Affairs Department at the Swedish Police Authority, to evaluate the suitability of IGG from a Swedish perspective and examine its compliance with current Swedish laws. The experiences from this pilot are currently being evaluated, involving technical, legal and ethical aspects.
5. Direct-to-consumer testing
Most current discussions of genetic genealogy describe four main DTC companies: AncestryDNA; 23andMe; MyHeritage; and FTDNA, each offering SNP microarray-based insights into an individual’s health risks and/or ancestral roots, plus the opportunity to find links to previously unknown relatives that match for a pre-set minimum proportion of chromosomal segments. Each company uses a slightly different approach to detect putative IBD segments, commonly without disclosing all details about the exact implementation of their algorithm. They each apply different thresholds for declaring a match, but none report matches that share less than 7 cM. With the limitations of microarray technology, it is estimated that 20% of matches are false positives [
]. Most DTC’s relative-searching analyses require customers to opt-in. AncestryDNA and 23andMe restrict matching to customers who have directly tested with the company. FTDNA and MyHeritage permit the upload of raw SNP data from 23andMe and AncestryDNA to expand the potential number of links to relatives.
The DTCs provide lists of matches and the suggested range in which the relationship might occur. The matches only provide a rough guideline and the genealogist makes further interpretation of the most probable degree of relatedness based on genealogical information and the related genetic network of matches. The analytical tools provided by the DTCs for estimating relationships can be supplemented by additional tools. The Shared cM Tool on the DNA Painter website (https://dnapainter.com/) reports cM value ranges and averages. It allows the user to enter the total cM shared and generate a table of probabilities for the possible range of relationships (probabilities inferred from the AncestryDNA Matching White Paper [
Although participants have the opportunity to state if endogamy is suspected in their own family tree, they may have underestimated the degree of endogamy occurring. Therefore, the average total shared cM and upper range limits collected by the project are likely to be inflated. Some outlying values were removed from undetected misattributed parentage and data entry errors. Nevertheless, the compiled values and their distribution as histograms of average total cM (excluding alleged relationships without shared DNA) provide valuable aids for the interpretation of segment sharing data and a useful point of comparison with the predicted relationships given by the DTCs and GEDmatch. It should be noted that since DTCs use different detection thresholds which change over time, these numbers are only rough estimates reflecting that particular method and parameters.
The four DTC’s microarray compositions are summarized in Table 2. Note that ISOGG list 32 separate genetic testing companies, but we concentrate on the four with the largest customer databases. The two next largest DTCs are the Genographic Project and Living DNA. Although Genographic had more than one million participants, it ceased making analysis data available to customers in June 2020. However, many participants have transferred Genographic data to FTDNA.
Living DNA has a worldwide customer base but is focused on Britain and Ireland. It introduced a relative-matching feature called Family Networks in February 2018, initially restricted to close matches.
The matching range was expanded in May 2020 to provide matches with more distant cousins, but the number of matches obtained is modest in comparison to those of the four main DTCs. Several companies, such as Dante Laboratories, Full Genomes Corporation, Nebula Genomics and YSEQ, now offer WGS direct to the consumer, and the cost of WGS services continues to fall. However, there is no database which can fully leverage the information contained in WGS data to infer relationships. Advanced users can extract specific SNP profiles for upload to GEDmatch. While FTDNA accepts SNP profiles generated from WGS from LE accounts, such uploads cannot be made by regular customers. In theory, customers could mimic the file formats for DTC microarrays to upload a WGS-generated SNP profile.
Each of the four main DTCs has specific rules of engagement for their interactions with LE investigators seeking to match forensic DNA with their customer’s data. These frameworks are well covered in the ISOGG Investigative Genetic Genealogy FAQs
and we summarize their current positions along with the SNP testing features of each DTC below in descending order of SNP profile database size.
AncestryDNA (http://ancestry.com) is by far the biggest DTC in the genealogy field, with nearly 20 million SNP profiles. It provides autosomal SNP testing based on a modified Illumina OmniExpress − changed in 2016 from v1 with 682 K SNPs to v2 (~300 K underperforming or uninformative SNPs swapped, reducing total SNPs to 637 K). Although AncestryDNA’s microarray includes X-chromosomal SNPs, these are not used in their analyses, but are available through raw data download to the customers. AncestryDNA previously offered Y-DNA and mtDNA tests but discontinued them in 2014. Company policy is “not to allow law enforcement to use Ancestry’s services to investigate crimes or to identify human remains”, but when a warrant or subpoena is issued, “data relating to the DNA of an AncestryDNA user will be released only pursuant to a valid search warrant from a government agency with proper jurisdiction”, and “when we receive a request our team reviews it to make sure it satisfies legal requirements and our policies. If we believe the request is overly broad, we will try to narrow it to the extent legally permitted”
Although not permitting LE access to the DNA database, the genealogical records and family trees held by AncestryDNA’s parent company Ancestry.com are used extensively in IGG searches. In some cases, target testing to narrow down the search pool is done first at AncestryDNA before uploading to those databases allowing LE searches.
AncestryDNA has been the most transparent in outlining the process used to identify IBD segments by releasing a white paper which describes the principles and processes well [
]. Fig. S2 summarizes the steps taken to define each segment match. SNP data is phased into sections of sequentially arranged alleles using an adaptation of BEAGLE developed into a more efficient algorithm called Underdog. A separate algorithm, Timber, handles haplotype frequency estimation from the millions of profiles this DTC holds.
23andMe (http://www.23andme.com) has more than 12 million users as of January 2021, over 80% of whom have opted in to participate in research. They currently offer one test (used for multiple analysis options)
Customer options are ancestry and traits or ancestry and health. The 23andMe health service is only available in the US, Canada, UK, Denmark, Finland, Ireland, Sweden and The Netherlands.
using the Illumina GSA with additional customized SNPs providing Y-chromosome DNA and mitochondrial DNA ancestry reports. The GSA was preceded by four different configurations of the Illumina OmniExpress microarray (v1, 2007–8, indeterminate custom SNP set; v2 2008–10, 556 K SNPs; v3, 2010–13, 930 K SNPs; v4, 2013–17, 585 K SNPs). Of the four main DTCs, 23andMe is the only one to fully incorporate X-chromosome data in their relative-matching process. The emphasis of 23andMe is on trait and disease risk associations from the collective SNP data compiled by the company, with customers self-reporting their lifestyle/behaviour, disease histories and known characteristics, which in turn provides some input to the forensic phenotyping knowledge base [
]. 23andMe were the first DTC to introduce segment matching tests to link customers to unknown relatives on the 23andMe database. The initial segment analysis regime to identify so-called ‘cryptic relatives’ was published by Henn et al. in 2011 [
23andMe do not give any access to customer information from requests by LE authorities, stating: “use of the 23andMe Personal Genetic Service for casework and other criminal investigations falls outside the scope of our services intended use. However, 23andMe must respond to, and is expected to comply with, court orders, subpoenas, or search warrants for genetic and personal data. 23andMe state: “they would use every legal remedy possible" to challenge a request for such legally enforced access to 23andMe customer data.
23andMe’s transparency report includes details of LE requests for information and is updated on a quarterly basis. As of May 2020, it had received seven requests, all from US agencies, pertaining to 10 users, and all were refused, with no data passed to LE authorities (https://www.23andme.com/transparency-report/).
MyHeritage (http://www.myheritage.com) is estimated to have over 4.5 million SNP profiles, and accepts free data transfers from 23andMe, AncestryDNA, FTDNA and Living DNA. They originally used the Illumina OmniExpress but moved to the GSA in 2019. They launched a new microarray-based health test in June 2019 and the GSA has been customized to provide ancestry and health informative SNPs.
Although X-chromosome data is available in the raw DNA data download it is not currently incorporated into the relative-matching service.
The MyHeritage database unwittingly provided the breakthrough match in the Golden State Killer case. The terms and conditions have since been changed and the company now specifically “prohibits law enforcement use of its DNA services” and states “we will not provide information to law enforcement unless required by a valid court order or subpoena for genetic information”.
However, because MyHeritage accepts uploads from people who have tested at other companies it is theoretically possible that they are vulnerable to unauthorized LE uploads, though a high-quality and near-complete SNP profile would be needed to pass quality control checks. To prevent such potential breaches, Yaniv Erlich of MyHeritage and colleagues [
] details on the new matching algorithm of MyHeritage which builds on a similar model to AncestryDNA using shorter phased seed segments and extending them using unphased data. The study also contained details about models for errors without disclosing exactly what the standard parameters in the matching algorithms were.
FamilyTreeDNA (FTDNA) launched in 2000 and were the first company in the US to offer direct-to-consumer ancestry testing [
]. The initial focus was on Y-chromosome and mitochondrial DNA testing as a tool for genealogical research, with Y-DNA results focused on surname projects. FTDNA were the second company to offer an autosomal DNA test for finding relative matches with the Family Finder test launched in 2010.
FTDNA moved to a customized Illumina GSA in spring 2019.
FTDNA appears to apply a system of half-identical segment matching with unphased genotypes, although their algorithms are proprietary, and no technical details have been published. The original threshold for a match was set at 20 cM total shared and a minimum longest segment of 7.69 cM for 99% of customers and 5.5 cM for the other 1%. Thresholds were updated in 2016, comprising a reduced minimum shared cM total but at least one segment required to be 9 cM or longer.
Matches are reported in a list with information on total number of shared cM, length of the longest segment and the predicted relationship range. FTDNA provides relationship predictions in four ranges: immediate matches, close matches, distant matches and speculative matches.
Although the minimum segment size for a match is set at 9 cM, once a match has been declared all segments down to 1 cM are included in the cM total. The majority of these small segments are either false matches (pseudo-segments) because of the lack of phasing or they are not genealogically relevant. Genetic genealogists normally recalculate the total cM shared to exclude segments under 7 cM to obtain a more realistic number. Matches for Ashkenazi Jews are down-weighted to account for the underlying endogamy in the population, though the technical details of the algorithms have not been published.
Users can download a list of their matches and the shared segment data. The problem of small false segments is seen when viewing known relations from different generations, in the chromosome browser, as shown in Fig. 5A and B.
FTDNA do not include X-DNA in total cM shared and an X match is only reported when two individuals have an autosomal DNA match. Once an X-DNA match is declared, FTDNA reports X-DNA matches down to 1 cM. There is a high false positive rate with these small X-DNA matches which is partly explained by the low SNP density on the X-chromosome on current microarrays. The false positives are clearly evident when comparing the low number of male X-DNA matches, which are naturally phased, with the unusually high number of female X-DNA matches. A small study found major discrepancies between the number of male and female X-DNA matches.
In March 2018 FTDNA announced it was collaborating with BC Platforms who would provide a solution for incorporating genotype data from multiple chips into the database and dealing with backwards compatibility of historical data.
No details of the methods used have been published to date.
FTDNA’s ability to accept third-party uploads inevitably made them susceptible to unauthorized uploads from non-genealogical sources. In January 2019 it was revealed that the FBI had infiltrated the FTDNA database and FTDNA had agreed to collaborate and continue to provide FBI access.
However, this meant existing customers not wishing to make their profiles available for LE use were denied access to the matching database for their own genealogical research. Following a backlash, in March 2019 FTDNA allowed customers to opt out of LE matching. EU citizens were opted out to comply with GDPR rules but could choose to opt back in.
New customers worldwide can choose whether to participate in LE matching when they activate their kit. In December 2020 further details of the Golden State Killer case emerged and it transpired that FTDNA had tested the rape kit and allowed the FBI to upload the profile to the FTDNA database as part of a covert operation. The FBI had invoked a legal privilege to prevent the disclosure of this information, thus raising concerns about the transparency and accountability of the FBI.
though the number available for LE matching is unknown.
LE agencies wishing to use the FTDNA database are required “to register all forensic samples and genetic files prior to uploading to the FTDNA database. Permission to use the service is only granted after the required documentation is submitted, reviewed, and approved.” Permission to use the FTDNA database for law enforcement purposes is only granted “to identify the remains of a deceased individual” and “to identify a perpetrator of homicide, sexual assault, or abduction”.
FTDNA works with US LE agencies but will consider working with agencies outside the US on a case-by-case basis. Gene By Gene (https://genebygene.com/forensics/), the parent company of FTDNA, has its own testing laboratory in Houston, Texas, which has established a forensics division performing DNA extraction and testing in house. LE uploads are also accepted for a fee when testing has been done elsewhere.
LE kits are not visible to other FTDNA users regardless of whether they have opted in or opted out, and LE agencies receive a more restricted match list than regular customers. However, and similar to MyHeritage, FTDNA is theoretically susceptible to unauthorized LE uploads seeking to gain access to the entire database rather than the restricted LE matching portion.
6. Third-party services
In addition to the databases provided by the four main DTCs, there are two third-party services – GEDmatch and DNASolves – which do not sell DNA test kits but provide databases that accept uploads and can be used for LE matching. Below we address GEDmatch in particular since this portal has been the main entry point for LE up till now. Three additional third-party databases – DNA.Land (https://dna.land/), Geneanet (https://en.geneanet.org/) and Geni (https://www.geni.com/) – accept autosomal DNA uploads and could be vulnerable to unauthorized uploads, but these databases are all very small and less likely to be the focus of investigations.
GEDmatch (https://www.gedmatch.com/) was founded in 2010 as a hobbyist website by genealogists Curtis Rogers and John Olson to supplement the tools provided by the DTCs and to help in particular with unknown parentage searches.
GEDmatch is a freemium website with both free and paid-for tools that perform a series of comparisons to other uploaded profiles and provide additional functionalities. GEDmatch was acquired in December 2019 by Verogen, a private forensic genomics company.
The portal allows the user to search for matches with people who have tested on different platforms and at different testing companies. GEDmatch now accepts SNP profiles from over 20 DTC providers and is able to accept raw data from both microarrays and whole genome sequencing. They further allow users to upload DNA profiles obtained from ancient DNA (aDNA) samples or from the testing of artefacts of deceased people (e.g. testing the tooth or bone of a deceased parent or the DNA of a deceased relative obtained from a letter) as long as some quality criteria (i.e. number or density of genetic markers) are fulfilled. Artefact testing for genealogical purposes is still in its infancy but is likely to be a growth sector in the future.25. GEDmatch also has a tool available in the Tier 1 subscription which allows the user to combine SNP sets from multiple testing platforms into a ‘superkit’ to maximize the potential/reach of the search.
GEDmatch has a dedicated law enforcement portal known as GEDmatch Pro (https://pro.gedmatch.com/) which was launched in December 2020. Law enforcement are now charged a fee to upload a SNP profile and LE uploads are no longer accepted on the main GEDmatch website. On 11th January 2021 Verogen subtly updated the site policy to allow unidentified human remains to be compared against the entire database.
Profiles uploaded to identify the perpetrator of a violent crime will continue to be matched only against the opt-in portion of the database. This change was made without the consent of the users and was a reversal of the decision taken in May 2019 to opt out the entire database from law enforcement matching and seek fresh consent from users. It is not clear how the distinction between offenders and unidentified remains will be enforced. It also not clear how Verogen can effectively identify LE users and prevent unauthorized uploads. Although LE uploads are expected to be declared as such, there is no regulation of this process outside of the guidelines and code of practice issued by the US DoJ and SWGDAM (see Section 4).
6.1.1 GEDmatch SNP uploads and analyses
Uploading a set of SNP genotypes, whether from a DTC raw data file or as compiled variants from a microarray or WGS analysis of forensic DNA, initiates the GEDmatch processing, which begins with the parsing of SNP data to ensure viability, followed by the assignment of a kit number. LE uploads are marked as research kits and so are excluded from comparisons made by individual users with their own kits and are not visible to other users. The SNP data are subjected to a process called tokenization, creating a compressed site-specific binary format which allegedly would not be possible to de-code in a security breach. As part of this process, health-related SNPs, SNPs with low minor allele frequencies and SNPs with no calls are removed in the tokenization. All comparisons in the database are done with the token files. Details of the tokenization process are given in Supplementary File S4. Once tokenized, the original upload is deleted, so it is not possible for SNP allele calls to be accessed directly either by a user or through malicious attacks on the site. However, if the phenotype of the query profile is known it is still possible to infer that matches have a particular trait of interest as demonstrated by Leah Larkin in a cystic fibrosis case study.
The DNA File Diagnostic Utility can be used to verify the number of SNPs used in the token files. There are two different versions. The standard token file is used for all the GEDmatch tools with the exception of the One-to Many comparisons which uses the slim token file. To save processing time, heterozygous SNPs are removed from the slim token file. These SNPs would produce universal matches so do not provide any additional information.
GEDmatch data viability checks reject SNP numbers below 50,000 as insufficient for reliable segment comparisons, so potentially useful datasets from very degraded DNA may require troubleshooting of the DNA extract to increase call rates or a proportion of genotypes may be inferred by SNP imputation (see Section 8). Although imputation can ‘rescue’ scant genotype coverage when analyzing very challenging forensic DNA, in the most extreme cases uploading heavily imputed data leads to excessive numbers of false associations – commonly observed as a high proportion (up to 25%) of the reference profiles associating with the query file.
The One-to-Many tool is used to search for matches in the GEDmatch database. There are two different versions of this tool: the standard version and a beta version which has enhanced functionality and some additional features limited to Tier 1 subscribers. The One-to-Many tools look for all the SNPs in common between any two kits and then uses a simple system of half-identical matching to look for matching segments. The basic tools provide a list of 3000 matches, while additional matches can be viewed with the Tier 1 tool. Segments under 7 cM are excluded by default, but there is an option to set more specific thresholds both for length (cM) and number of required SNPs in each segment. The One-to-Many reports also include X-chromosome matches. The match list provides information on the length of the largest segment, the total cM shared, plus the number of generations between the pair suggested by their segment overlaps. The number of overlapping SNPs is also reported. If there is a low overlap, as happens for example when comparing a GSA kit with an OmniExpress kit, the overlap is marked in pink in the beta tool to highlight that caution is needed in the interpretation of results. All GEDmatch kits marked as ‘private’ or ‘research’ are excluded from the matching process. For LE kits the One-to-Many comparison is made with the subset of profiles permitting LE access. More details on the number of matching segments, their individual cM lengths/SNP numbers, and the bounding genome co-ordinates of these segments are given in the follow-up One-to-One searches made for each of the most closely related individuals. The One-to-One X-DNA comparison tool provides additional information on X-chromosome matches. Smaller segments down to 3 cM can be seen in the One-to-One tools.
The listed individuals in GEDmatch are identified in the match list by their kit number (assigned to each member’s uploaded SNP profile), self-designated name or alias, and an e-mail address. It is common for one member with a single e-mail to manage a large number of individual kits. The ‘People who match both, or 1 of 2 kits tool’ can be used to identify the list of matches shared between two kits which is the foundation of cluster building. It is possible to see who matches the query profile but clicking on their kit numbers reports the matches of each match. Therefore, it is potentially possible to build up an extended network of matches. An automated clustering tool is available as a Tier 1 tool. Finally, Q-matching is available in Tier 1 tools, in which the Q process considers the individual statistical characteristics of each SNP, gaps in coverage, and several other factors to provide a more evidence-based analysis of segments before they are reported.
As with all entities storing sensitive information, public genetic databases are particularly susceptible to breaches, either as a way to obtain the genetic data itself or to upload forged profiles potentially misleading LE investigators. The security of users’ kit permissions in GEDmatch is now centre stage in discussions about the consequences of the security breach of July 2020.
Malicious attacks on GEDmatch could seek to target information on three types of data: i. SNP genotypes which are not accessible online; ii. kit numbers and the associations produced relative to other kit numbers; and iii. users’ names or aliases and e-mail addresses. The July 2020 breach reset each user’s permissions so that private and research kits, including LE kits, became part of the segment matching comparisons and kit numbers were displayed among putative familial networks. Therefore, if queries were conducted in the 3-hour period of the attack, the ramifications are that putative associations were displayed temporarily to include kit number, name/alias and e-mail of each. Other information potentially accessed included the DTC used to create a kit. It is not clear whether other information was obtained from the GEDmatch attack, as the low numbers of targeted MyHeritage users would suggest the email addresses were obtained by running One-to-Many queries where the source DTC is indicated. Online user forums have recommended GEDmatch members delete their kits and re-upload with new kit numbers for the same SNP data, as a security measure which resets links between kit numbers and personal information. Since their recent acquisition of GEDmatch, Verogen have made repeated assurances that the SNP file reconfiguration process in GEDmatch makes a person’s genetic data secure from data mining attacks. These SNP data include the rs-number; GRCh37 coordinate; and allele calls. Health sensitive and low minor allele frequency sites are stripped from the SNP file that is uploaded. This is an important point, as it has been reported that the One-to-One comparison tool can be used to mine genotypes by using artificial SNP datasets designed to find known relatives and estimate the genotypes when mismatches are found. Recent studies by Edge and Coop [
Ney, P., L. Ceze, and T. Kohno, Genotype extraction and false relative attacks: security risks to third-party genetic genealogy services beyond identity inference. in Network and Distributed System Security Symposium (NDSS). 2020.
] have demonstrated an attacker could upload artificial files and attempt to extract the large majority of allele calls of other GEDmatch kits. They concluded that the visualizations and other results such as segment boundaries leak enough information for attackers to infer over 90% of the SNPs used in the comparisons. Therefore, this could potentially indicate a significant privacy violation for the targeted individuals. Verogen has stated that a series of measures are now in place, which effectively block such attacks.
The system of using kit numbers at GEDmatch to access DNA profiles and match lists brings a risk of sensitive accounts, e.g. LE accounts, being exposed as a result of user error, for example, if the kit number of a private or research kit is inadvertently shared or published. In the case of James Curtis Clanton, LE published the GEDmatch kit number of the crime scene profile in a publicly available affidavit for his arrest warrant, along with the initials of the people in the match list. As a result, all of the suspect’s family members could easily be identified by anyone with a basic understanding of genealogy.
Although the kit was removed after LE personnel were alerted, many people would have had access to the match list up to this time, and such a breach could potentially compromise an investigation.
Set up in December 2019, DNASolves (https://dnasolves.com) is run by Othram (https://www.othram.com), and is intended to be a dedicated SNP database for LE use. As of March 2020, there were estimated to be several thousand profiles in the database.
DNASolves accepts SNP data from the four main DTCs and sequencing data in other formats (BAM/SAM, FASTQ or VCF). Some of the database plans were revealed in a podcast with David Mittelman, CEO and founder of Othram, on the Genialis website (aired March 2020).
Users contribute data to DNASolves solely to solve crime; there is no public-facing search and users cannot be matched with relatives or access anyone else’s data but their own. People can voluntarily submit their name, date of birth and their parents’ names as data points to help investigators. When LE agencies submit data for a case, their credentials are validated. The matching algorithm is similar to that of AncestryDNA (personal communication, David Mittelman). The database is currently a grassroots effort with a user group on Facebook where new features are discussed.
Although DNASolves is now actively accepting uploads there are no reported cases of it yet being used to produce investigative leads.
7. Technologies that generate a SNP dataset from forensic DNA
There are three ways that SNP genotype datasets suitable for relative searches can be generated from forensic DNA: i. using the same type of SNP microarrays as those adopted by the DTCs; ii. whole-genome sequencing (WGS) to obtain sufficient sequence coverage to reliably call heterozygote variant sites; iii. use of massively parallel sequencing (MPS) to perform targeted sequencing. This latter category can be further divided into amplicon-based methods (amplifying a smaller subset of SNPs with higher overall informativeness than those genotyped by a full SNP microarray) and hybridization capture methods. These different genotyping technologies are described below, and Fig. 6 illustrates the main steps included in the workflows. The genotyping technologies have different characteristics including analysis cost; availability of off-the-shelf assays; instrumentation requirements; data handling capacities required; and protocols available and optimized for the analysis of low quantity/low quality DNA. Note that it is possible to perform relative searches, which are not necessarily based on segment matching, but neither GEDmatch nor the DTCs currently use alternatives to the measurement of shared IBD segments. Therefore, when a subset of a typical DTC SNP dataset is assembled comprising 10,000, 20,000 or 50,000 SNPs, all uploads to GEDmatch are rejected due to insufficient data for IBD segment matching. This SNP density limit will potentially change in the near future as a result of initiatives by Verogen to develop smaller SNP sets for forensic analysis which will be suitable for their ForenSeq MPS platform, but informative enough to make reliable relationship inferences (see Section 7.4).
7.1 SNP microarrays
SNP microarrays have been the system of choice for over 15 years to genotype large numbers of SNP sites in a single workflow [
]. The basis of SNP allele detection with microarrays is to let fragmented sample DNA sequences hybridize to oligonucleotide sequences bound to a surface or to beads. In Illumina’s BeadArray technology these oligo-sequences are designed to end prior to the SNP position, and the variant nucleotide(s) in the test sample are identified by single base extension. The Affymetrix (now part of Thermo Fisher Scientific) microarray technology uses fragmented DNA labelled with fluorescent dyes and then hybridized to a dense panel of allele-specific capture probes on the microarray surface. Hybridization of the DNA fragments containing the target SNP nucleotides, to one or both allele capture probes, produces dye signals detected by microscopic examination of the microarray surface corresponding to each genotype. There are multiple replicated capture probes per allele and SNP to ensure reliable consensus genotype analysis. In early microarray versions there was a degree of non-specific hybridization, but the sensitivity and reliability of the probe designs has improved markedly and, with optimized signal processing pipelines in place, microarrays deliver a very reliable system for SNP genotyping. The two most commonly used microarrays for SNP genotyping are Illumina GSA (654,000 target SNPs) and Illumina CytoSNP (850,000 target SNPs). Prior to the introduction of the GSA in 2016, the Illumina OmniExpress was the most commonly used microarray. Affymetrix provides the most commonly used alternative SNP microarray technology to the Illumina system. All current DTC analyses are based on Illumina OmniExpress or GSA microarrays with the addition of up to 50,000 customized SNPs, with the exception of Living DNA (using Affymetrix). The CytoSNP microarray has more SNP targets, but unfortunately only provides a 104 K marker overlap with the GSA. Therefore, imputation is required to increase the overlap and facilitate relative searches (see Section 8). The CytoSNP microarray is used by Parabon NanoLabs both for phenotype predictions and for upload to genealogy databases [
]. Generally, SNP microarrays require 20–100 times more DNA than would be considered a standard input quantity of 1 nanogram (ng) for other forensic DNA tests, and this has hindered the technology’s adoption for forensic analysis since it was first developed. Wendt et al. recently demonstrated in a titration experiment (1–200 ng of input DNA), using the Infinium Omni2.5Exome-8 chip, that high genotype concordance and call rates can be obtained down to 25 ng of input DNA [
]. However, the quality of the DNA used, in terms of degradation, levels of inhibitory substances and ratios of bacterial to human DNA in the sample, has a much greater effect on SNP call rates and their reliability (i.e. the genotype concordance recorded) than pushing DNA input levels below recommended quantities, as shown by the Bode Technology experiments outlined in Section 7.3. Two advantages with SNP microarray typing are the comparatively low cost and the ease of variant calling compared to the much more demanding bioinformatic analysis required by whole-genome sequencing.
7.2 Whole-genome sequencing
The ability to sequence the whole genome of an individual in a viable single workflow has been refined much more recently than SNP microarray technologies. Initially confined to genetic research, WGS is now increasingly applied to clinical studies – where a rapid and practical sequencing system is required. The workflow is quite straightforward and starts by shearing the DNA into smaller fragments, achieved by, e.g. sonication. Sequencing adapters and sample indexes are then ligated onto these fragments. The library is amplified and after one or several purification steps it is ready for sequencing. Illumina and Thermo Fisher Scientific each offer whole-genome-scale sequencing systems with greatly expanded nucleotide reading throughput. Both companies have adapted the chemistry and nucleotide detection of their MPS targeted sequencing solutions now increasingly applied to forensic DNA analysis with the MiSeq-based and Ion S5 systems. Of the two options, Illumina have dominated the approaches to sequencing human whole genomes with the HiSeq X and NovaSeq systems, and each has been successfully applied to sequence-challenging forensic DNA samples ([
] and Section 7.3, respectively). There are kits and protocols available to analyze as little as 50 pg of DNA, e.g. using the ThruPLEX DNA-Seq Kit (Takara), although such a low amount requires pure, good quality DNA. One advantage of the WGS protocol is that the input in the library preparation is fragmented DNA, which increases the probability to obtain results from degraded forensic samples. Brandhagen et al. [
] employed a whole-genome shotgun sequencing approach on rootless human hair shafts and showed that complete mitochondrial genomes (mtGenomes) could be recovered from aged hair shafts in reasonable quantities. However, the sequencing data in their study was not sufficient to provide any reasonable depth of coverage across the nuclear genome. A probable cause was that their libraries were sequenced on a MiSeq with considerably less sequencing capacity than high-throughput NextSeq or NovaSeq platforms. A method has recently been developed to extract DNA from rootless hair shafts to create SNP genotype datasets from WGS for upload to GEDmatch. The methodology has not been published to date but has reportedly been used to identify two murder victims.
An additional advantage with whole-genome sequencing is that, if high coverage data are obtained, it is possible to design and extract genotypes for any custom SNP panel. Thus, there is no need to target specific primers or probes. The disadvantages remain the high cost and computational workload when performing the bioinformatics and genotype calling.
7.2.1 Application of whole-genome sequence SNP genotyping to a real case
The application of HiSeq X WGS analysis of DNA extracted from the femur of a 2003 murder victim by Tillmar et al. [
] was documented in detail. The researchers used carefully constructed validation measures throughout the process to ensure there was sufficient sequence coverage for robust SNP genotype calling from these data. They were able to build a SNP dataset of more than 1.3 million variants that allowed efficient querying of the GEDmatch database. We describe the process used in detail next, since this was achieved by a forensic laboratory that was already investigating the case with conventional DNA analyses, rather than a commercial supplier who may not wish to disclose proprietary methods in such detail.
The sequencing pipeline followed three steps. First, DNA was extracted by standard phenol-chloroform methods from two grams of bone powder [
]. A critical part of this preparatory step was the checks made of the DNA quality prior to WGS. As well as quantitation with NanoDrop and checks of DNA integrity with Agilent TapeStation tests, MPS-based genotyping of an established forensic ID-SNP panel [
] ensured the DNA extracts would be suitable input for WGS. Prior genotyping by MPS also provided a concordance check of the SNP calls made by WGS at lower average sequence coverage. The MPS system tested the Qiagen QIAseq Investigator 140-SNP identification panel (131 SNPs used), set to a minimum coverage threshold of 200X and allele read frequency limits of 0.4–0.6 for heterozygotes and 0.1–0.9 for homozygotes. Additional GlobalFiler STR profiling was run alongside the MPS tests. Second, three library preparations were made in parallel from the single bone powder DNA extract with the ThruPLEX® DNA-seq 48S Kit (R400427, Takara). Each library used 3 ng of fragmented DNA prepared by sonication to produce fragments with a mean size of 400 basepairs (bp). Sequencing was performed with an Illumina HiSeq X instrument and v2.5 sequencing chemistry using paired-end sequencing and read lengths of 150 bp. Third, a bioinformatics pipeline was created to compile the three FASTQ files by alignment to human genome build hg19 and was recalibrated to adjust for potential misalignments caused by flanking indels, etc., with GATK. Duplicated, broken and non-specifically mapped reads were removed using Qiagen Biomedical Genomics Workbench v5.0.1.
For the GEDmatch relative searches 1378,481 SNPs were selected from the complete WGS variant dataset based on an optimum intersect of SNPs from the DTCs' adapted GSA and Illumina OmniExpress sets, comprising: 23andMe v5; AncestryDNA v2; and FTDNA/MyHeritage v1 microarrays. SNP genotyping quality was checked by applying four QC criteria to the HiSeq X sequence output: sequence coverage; allelic balance; Q-score and forward-reverse read ratio. Threshold values for these were, respectively: 10 or more homozygote reads, 5 or more per allele heterozygote reads; 0.5–0.7 heterozygote allele ratios; Q-scores higher than 25; and a read ratio of at least 0.2. From almost 3 billion reads, 86.7% were successfully aligned to the reference genome with a mean coverage of 32.2X. In the WGS-based SNP genotypes, 122/127 (WGS/MPS QC passed) cross-check genotypes passed their respective thresholds and were concordant. Approximately 75% of targeted SNPs passed the above thresholds, leading to a total of 1035,274 SNPs which were considered to be reliably called and compiled into the query profile for this case.
The GEDmatch relative search was marked for LE purposes and made across the full database of ~1.2 million reference profiles, i.e. before the opt-in setting was applied from member’s choice to permit LE access. Searches returned several thousand putative relatives, but these were refined by choosing the top 100 matches that had >10 cM matching the query profile and 7 cM in common with others in the match list. This led to 36 putative relatives being analyzed further which created four clusters of individuals estimated to have been linked by their relationships to common grandparents. Information for some of the matched relatives indicated a Croatian origin and in fact, could be located more precisely to an area of ~40 km radius in NW Croatia. At this stage, meaningful investigative leads could be given to the police for their enquiries. These analyses are particularly important in establishing a benchmark for the validation of a new approach to forensic SNP genotyping and its application to IGG. They show the value of a forensic laboratory performing the WGS analyses who are well versed in applying multiple QC checks to novel techniques, as well as the diligence and depth of experience necessary for the handling of limited evidential material.
7.3 Evaluation of technology for forensic samples
There are few scientific studies on the suitability of each technology applied to forensic casework. Bode Technologies published a webinar (“Forensic Genealogy: Unlocking the Science of Genealogy”) outlining an evaluation of Illumina GSA/CytoSNP microarrays and WGS for forensic DNA analysis in a valuable series of comparative tests [
]. Evaluations were based on the traditional measure of forensic sensitivity using dilution series of control DNA and artificial degradation of the same samples by progressively longer periods of sonication. Quality of SNP genotyping was measured by concordance rates (% of concordant genotypes with 250 ng input) and call rates (% of genotypes called), with input DNA quantities of 250; 50; 10; 2; 1; 0.5; and 0.25 ng extracted from blood and sperm. From the results presented in the webinar, microarray technology (1x CytoSNP, duplicated GSA runs from two different analysis laboratories supplying data to Bode) was able to provide high call rates from 100% (250 ng−50 ng) reducing to 95% for 10 ng of blood-based input DNA, which only dropped to 90–95% at 2 ng input (1–0.25 ng not reported or analyzed). The lowest concordance rates were 89% in the 2 ng sample, but this lowest rate was an outlier value for general rates of 100–95% concordant genotypes, and one GSA analysis laboratory was consistently higher for both values indicating that established expertize and experience with handling low level microarray input affects the quality of results obtained. Apart from these differences, no discernible differences were detected between CytoSNP and GSA results. Sperm fraction DNA gave much lower call rates ranging from 90% down to ~65% (50 ng duplicates to 2 ng, respectively; no 100 ng analyses made), and concordance of 100–92% was similar to that from blood-based DNA. Degraded DNA produced by sonication and measured by Degradation Index (DI) had high concordance in microarray analyses of 100–98%. However, once DI values reached 6.6, 11.1 and 21 (from 1.4) call rates dropped to between 90% and 58%, meaning microarray technology struggled to detect damaged DNA with SNP target fragments of sufficient size to hybridize successfully.
WGS analysis was provided by a laboratory specializing in this technique using the Illumina NovaSeq 6000 system. Call rates for blood and sperm extracts were slightly lower with WGS than microarrays (97–87% in blood DNA, 93–41% in sperm fraction DNA), but concordance was consistently high at all input levels at: 100% (250 ng) to 98% (2 ng) in blood, and not dropping below 99% in all inputs (no 250 ng input). When very low levels of input DNA were examined this high sensitivity was maintained; duplicates of 2, 1, 0.5 and 0.25 ng of blood DNA had 92–91% call rates and were at or close to 100% concordance. This translated to GSA microarrays reaching > 91% concordance with WGS data in blood and > 81% in semen. Therefore, the overall trend in sensitivity measurements indicated WGS was more sensitive than microarrays and this sensitivity was more consistent—concordance dropped much less markedly as input DNA was reduced despite slightly fewer calls being made.
Following these experiments, an interesting evaluation of the SNP dataset informativeness obtained from each technique was performed using GEDmatch to examine known matches to kits from the control DNA used. The low template DNA SNP dataset obtained from the GSA microarray matched 9/13 kits in GEDmatch, compared with 11/13 with WGS, indicating more extensive SNP genotypes from WGS with minimal input DNA. The difference in the performance of WGS vs microarrays was much more marked when uploading SNP datasets from degraded DNA, with no kit matches amongst the top 18 with GSA-analyzed DNA having DI values of 6, 11 and 21; in contrast to WGS, with matches to all the top 18 with DNA at DI values of 1 and 21. Bode Technologies concluded from these studies that WGS is the system of choice for forensic DNA as it is a more accurate and sensitive SNP genotyping system for degraded DNA, matching or exceeding genotype call rates from microarrays.
Massively parallel sequencing (MPS) based methods using hybridization capture can genotype a large number of SNPs. These techniques have many similarities with whole-genome sequencing but, instead of including all fragments in the library to be sequenced, only those from regions of interest are captured and sequenced. The major advantages with this approach are that only relevant sequences are analyzed and deeper coverage is consequently obtained for those targets [
]. However, one disadvantage is that efforts are needed for the design of the probes (or “baits”) used to capture the sequences of interest. Several different hybridization capture methods exist and have been developed. The main steps of the method are similar (see Fig. 6) but variation exists, especially in the way the targets are captured [
]. Some examples of commercial hybridization capture technologies are SureSelect (Agilent), HaloPlex (Agilent), Nextera (Illumina), myBaits (Arbor Biosciences), Twist technology (Twist Bioscience) and SeqCap (Kapa HyperExplore).
In hybridization capture methods the template genomic DNA is first randomly sheared by e.g. sonication or restriction enzymes. Sequencing adapters (which can also include sequences for library amplification, sample barcoding, etc.) can then be ligated to the fragmented DNA. The sequences of interest are captured using oligonucleotide probes. These synthetic probes are hybridized to the regions of interest and these hybridized regions are further captured by, e.g. magnetic beads, enabling non-targeted DNA fragments to be washed out. The probes are then removed from the targets prior to library amplification and sequencing. Sequencing can be performed using standard MPS approaches such as MiSeq/NextSeq/NovaSeq (Illumina) or Ion GeneStudio (Thermo Fisher Scientific). The number of targets, the desired depth of coverage per target, the level of sample multiplexing and other variables determine the level of sequencing capacity needed.
The main advantage with capture approaches, apart from offering high multiplexing capabilities, is that they are amenable to all sample types, from high-quality genomic DNA to severely degraded DNA (e.g. [
]). DNA from forensic samples and human remains is often of poor quality and, as a result of degradation, the DNA is already broken up into fragments, so such approaches are particularly suitable for forensic analysis. However, hybridization capture is more costly than amplicon-based approaches, but has been shown to be superior when testing mtDNA from human remains [
Performance evaluation of a mitogenome capture and Illumina sequencing protocol using non-probative, case-type skeletal samples: Implications for the use of a positive control in a next-generation sequencing procedure.
]. Many of the existing hybridization capture methods were initially optimized for research studies and clinical testing where large quantities of DNA are available. Nevertheless, several protocols have been adjusted for lower DNA input [
] used a hybridization capture method to target approximately 1240,000 SNPs to analyze historical genetic variation among 230 West Eurasians dating between 6500 and 1000 BCE. Almost 600,000 of these SNPs were included on the Affymetrix Human Origins microarray. The samples in their study comprised teeth, petrous bones, femurs and other sources. Interestingly, they compared their data with that of a similar study using whole-genome sequencing, indicating that while the mean number of reads generated per sample with the capture approach was ~40 times lower, median coverage per analyzed SNP was ~4 times higher. Feldman et al. [
] used the same capture assay to successfully produce genotype data from Bronze/Iron age individuals.
Although most of the hybridization capture companies offer custom-made panels, we have not found any large-scale (>100 K SNPs) studies on forensic samples combined with genealogically relevant SNPs. However, Shih et al. [
] analyzed a custom SeqCap assay (Roche) to capture the mtDNA genome and a smaller number of autosomal SNPs (~400). They tested their assay on forensic samples (telogen hairs, mock stain samples, etc.) and obtained highly accurate SNP genotype data. We expect more studies and case reports to be published in the near future in which hybridization capture methods are applied to forensic analyses. Lastly, Ancestry.com launched an MPS-based AncestryHealth test in August 2020
At the time of writing, a targeted SNP genotyping system, using MPS to generate data for SNP sets at a much-reduced scale of approximately ten thousand loci, was being developed by Verogen following their acquisition of GEDmatch. The new assay, named the ForenSeq Kintelligence Kit (https://verogen.com/products/forenseq-kintelligence-kit/) was announced in January 2021, and comprises <10,250 SNPs which exclude medically important loci or those with low minor allele frequencies. The kit is based on the established ForenSeq library preparation approach using the MiSeq FGx forensic genomics system (validated for forensic use [
]). To develop the ForenSeq Kintelligence Kit, Verogen performed detailed bioinformatic analyses of the relative performance of component SNPs on various Illumina microarrays uploaded to GEDmatch, in order to gain knowledge of optimum candidates for smaller, forensically relevant SNP sets. Verogen will use a new IBS (identical by state)-based analysis tool supporting data from the new assay and used in the LE portal. The advantage of developing a ‘built for purpose’ SNP set for kinship analysis is not only an improved performance with challenging DNA (expected to work on forensic samples with DNA concentrations at sub-nanogram levels), but it is also feasible that SNP details can be encoded at each stage of the genotyping and query processes enabling better protection of kits in GEDmatch used for investigative purposes.
8. SNP genotype imputation
As discussed previously, DTC companies as well as scientific studies use a variety of different SNP microarrays and their marker configurations can change over time. Although some microarrays have a large proportion of overlapping SNPs, others have a considerable amount of non-overlapping SNPs which may reduce the power in database searches and segment analyses. One example is the transition from Illumina’s OmniExpress to their GSA microarray which have fewer than 200,000 SNPs in common.
Missing data may also be the result of low-quality or degraded DNA. A way to increase the number of genotypes, and increase the proportion of overlapping SNPs, is to predict the missing genotypes with a method known as genotype imputation. Genotype imputation may also be relevant to apply when an analysis of DNA of low quantity and/or quality results in a large proportion of missing genotypes. A database search may be impossible to conduct if the number of missing SNPs is too large.
The aim of imputation is to predict the genotypes for SNPs not directly genotyped in a sample. One of the first studies using genotype imputation was in connection with the identification of genetic risk variants for type 2 diabetes. The study compared their results with those from similar studies conducted with different genotyping microarrays [
]). Although rare, there are also examples of genotype imputation for forensic STR typing purposes. Edge et al. as well as Kim et al. recently published two studies in which they demonstrated that a standard STR profile can be used to impute genome-wide SNP data (and vice versa) [
The underlying principle of genotype imputation is that any two individuals, including those who are apparently unrelated, will share short segments of DNA from a distant common ancestor. Factors like high levels of linkage disequilibrium (LD) and low recombination rates within small stretches of chromosomal segments will conserve haplotype variants through many generations. Shared segments can be identified if the observed genetic variants in the studied individual are compared with variants from a panel of reference individuals. From these shared segments, missing data in the sample can be predicted based on the observed genetic variants in the reference individuals. In practice, the genotype data (for both test individuals and reference individuals) is first converted into haploid format (i.e. haplotypes) by phasing methods [
]. There is a wide range of phasing software and many of them now combine phasing and imputation. The principle of phasing is illustrated in Fig. 7, and population haplotype frequencies are used to probabilistically estimate the most likely haplotype configuration. Many of the phasing models use hidden Markov models (HMM) for this inference.
Once phasing is completed, the missing variants in the test sample can be predicted from the variants present in the reference individuals with matching haplotypes (see Fig. 8). A studied haplotype will be a mosaic of the reference haplotypes where changes may represent historical recombination events, but differences may also represent historical mutations, gene conversions and genotyping errors. Most of the imputation methods also utilize an HMM framework and may differ in the parameters and the setup of the HMM [
]. A key component in the development of new methods and models for phasing and imputation is to decrease the computational burden to better handle larger reference data sets and to speed up the computations without loss of accuracy. A selection of available software is presented in Table S1. The genotype predictions are not always perfect, and many of the programs provide a prediction probability along with the imputed genotype which corresponds to the uncertainty of the imputed variants [
]. The reason for this is that rare alleles will be observed less often in the reference data and they tend to have lower levels of LD with common variants, which increases the uncertainty in the imputation [
]. An additional important factor is the size and the population origin of the reference panel. Larger reference panels have increased imputation accuracy, and since genotype imputation depends on finding haplotype segments shared between reference and target haplotypes, a matching reference population is relevant to use [
]. Having a reference panel with very little genetic similarity with the test sample can decrease the imputation accuracy. At present, several public or partially public reference panels exist and include the HapMap project [
]. The TOPMed project includes more than 100,000 sequenced samples.
What level of accuracy can be expected in practice? Based on the factors outlined above, it is hard to be certain, but for the microarrays and SNPs included in genealogy testing a reasonable estimate may be ~1% or less [
]. However, the commercial companies have databases of millions of customers and will therefore have much larger reference panels for imputation than are currently available to academic researchers. For illustrative purposes we performed a genotype imputation test on a SNP profile from one of the co-authors. The genotypes of the SNPs included in the current version of the microarray used by AncestryDNA were used to estimate the error rate when reducing the number of observed SNPs for imputation and thus increasing the masked SNPs which required imputation. Although no direct conclusions can be drawn from this single experiment, the imputation error rate was around 1% or less, even when the number of SNPs was reduced to less than 100,000 (Supplementary File S2, Fig. 5).
It has been noted that several of the companies use genotype imputation, at least to some degree, either to accept transfers from different companies or to ensure backwards compatibility with the OmniExpress microarray, though no published details are available. The MyHeritage imputation process has been described in a company blog post.
However, the extent of imputation amongst DTC and forensic service providers, and its specific application in genealogical analyses beyond MyHeritage are not known.
9. Concluding remarks: proportionality and the CODIS gap
While IGG is an exciting and powerful forensic genetic technique which has led to the successful resolution of many long-standing cold cases, its use has highlighted systemic problems in the US criminal justice system. It has become apparent that many cold cases could have been solved much earlier using existing DNA technologies. It is estimated there are many thousands of profiles from convicted offenders which are legally mandated but have yet to be collected.
The CODIS gap is further exacerbated by the piecemeal use of familial searching in the US. It is currently confined to 12 of the 50 US states (Arizona, California, Colorado, Florida, Michigan, New York, Ohio, Texas, Utah, Virginia, Wisconsin, and Wyoming), is explicitly prohibited in Maryland and Washington DC, and is not permitted in the federal CODIS database.
Familial searches are restricted to people who are already in the CODIS database and, because they have been convicted of a crime, are considered to have forfeited some rights to privacy. Therefore, familial searching has far fewer privacy implications than IGG, which extends searches to both close and distant relatives who are in a genealogy DNA database, not all of whom have given specific consent for their profiles to be used. IGG can also involve networks of related family members who have not had their DNA tested but are approached for target testing.
It is therefore sobering to find that some cases where IGG was used could have been solved much earlier if familial searching had been implemented. For example, Patrick Leon Nicholas was identified as a suspect in the murder of Sarah Yarborough in Washington State, yet could have been caught through familial searching because his brother’s DNA was entered into CODIS in 2005.
Joseph DeAngelo, the Golden State Killer, had a brother who was convicted of a felony. He could have been caught many years earlier if familial searching had been in use at the time and if the law had been in place to allow police to take DNA from arrestees [
]. It is vital to ensure that the least privacy-invasive methods are used first and IGG is used as a last resort and not to compensate for systemic failures.
We are indebted to Brett Williams (CEO), Cydne Holt and Nicola Oldroyd-Clark of Verogen for sharing information about the GEDmatch database and Verogen’s future plans. We thank Ellen Greytak of Parabon NanoLabs for the detail provided in responses to the questionnaire and for permission to use the data and graphs in Supplementary File S3. We also thank the other service provider representatives who gave helpful information in their answers to the questionnaire. We thank Rock Harmon for informative discussions. Finally we thank the two reviewers for their helpful and constructive comments.