Volume 5, Issue 4 , Pages 308-315, August 2011
Validation of DNA-based identification software by computation of pedigree likelihood ratios
Article Outline
- Abstract
- 1. Introduction
- 2. Bonaparte
- 3. Algebraic preliminaries
- 4. Test cases
- 5. Validation report
- Appendix A.
- References
- Copyright
Abstract
Disaster victim identification (DVI) can be aided by DNA-evidence, by comparing the DNA-profiles of unidentified individuals with those of surviving relatives. The DNA-evidence is used optimally when such a comparison is done by calculating the appropriate likelihood ratios. Though conceptually simple, the calculations can be quite involved, especially with large pedigrees, precise mutation models etc. In this article we describe a series of test cases designed to check if software designed to calculate such likelihood ratios computes them correctly. The cases include both simple and more complicated pedigrees, among which inbred ones. We show how to calculate the likelihood ratio numerically and algebraically, including a general mutation model and possibility of allelic dropout. In Appendix A we show how to derive such algebraic expressions mathematically.
We have set up these cases to validate new software, called Bonaparte, which performs pedigree likelihood ratio calculations in a DVI context. Bonaparte has been developed by SNN Nijmegen (The Netherlands) for the Netherlands Forensic Institute (NFI). It is available free of charge for non-commercial purposes (see www.dnadvi.nl for details). Commercial licenses can also be obtained. The software uses Bayesian networks and the junction tree algorithm to perform its calculations.
Keywords: Disaster victim identification, Likelihood ratio, Kinship, Bayesian networks, Software validation
1. Introduction
In disaster situations, DNA is a useful tool in the identification process of the victims. After such a disaster has occurred, surviving relatives of those presumed to have perished may be typed, thus giving rise to pedigrees which each contain one or more missing persons. Given the DNA-profile of an unidentified individual (UI), a pedigree
and a missing person MP in
, one wishes to calculate the probability that UI
=
MP. Without further information and only DNA-profiles at our disposition, this is of course not possible; however, what we can do is compute the likelihood ratio
(1.1)
|
Y) denotes the conditional probability of the realization of X, given the realization of Y;
of which MP is a member and at least one DNA-profile in
.
The LR is an interesting quantity since, given a priori odds P(Hr)
/
P(Hu), it allows one to obtain the a posteriori odds P(Hr
|
E)
/
P(Hu
|
E): these are equal to the LR times the a priori odds. In closed disaster victim identification (DVI) situations (where the list of deceased individuals is known), it is often possible to meaningfully define prior odds. In addition, in many cases Hr and Hu are the only scenario’s with non-zero prior probability. This means that the DNA-evidence can be used to obtain the probability, given the DNA-profiles that are available, that UI
=
MP.
Usually, the DNA-profiles are those used in forensics, consisting of the genotype at a number of (nearly) independent loci. The likelihood ratio (1.1) is therefore computed as the product of the single-locus LR’s.
In a DVI situation one wishes to compute the LR’s (1.1) for every combination of UI and MP. This results, for a closed case with n victims, in n2 calculations. Hence, one needs an efficient algorithm in order to get the results in reasonable time. At present, there seems to be no software that can do so in an automated way and also allows for arbitrary pedigrees, including inbred ones. To resolve this, the Netherlands Forensic Institute (NFI1) has commissioned the development of new software. The resulting program, called Bonaparte, has been developed by the Dutch Foundation of Neural Networks in Nijmegen (SNN Nijmegen2). Bonaparte generates Bayesian networks from the pedigrees and then uses the junction tree algorithm (cf. [1]) to perform calculations therein. Its model is the Mendelian inheritance model with the possibility of mutation, where in addition allele frequencies are derived from reference databases in a way the user can determine. A live demonstration version can be accessed via http://www.dnadvi.nl. A more detailed presentation of the programme is given in [2].
To our knowledge, there does not exist a generally accepted validation standard to verify whether or not the computation of (1.1) has been done correctly. Simple pedigrees are easy to test and in addition, one may check if likelihood ratios are identical for certain relationships (such as half-sibling or uncle–nephew without mutation model, or more generally the pedigrees described in [3].) But for those latter cases, one then checks equality (and only in the absence of mutation possibility) rather than correctness.
Recently, Drabek (cf. [4]) has published an overview of software that computes (1.1) and has assessed performance, user-friendliness and documentation for two of these (Familias and Paternity Index). The validation of (1.1) was done by comparing the software’s output to that of DNA-View,3 a commercial program with many functionalities that is broadly used.
In our opinion, this is a somewhat unsatisfactory situation, since if a discrepancy is observed, it may be unclear which output is correct, and moreover only features incorporated into DNA-View can be tested. Therefore, we have defined a set of test cases for which we have calculated the LR numerically, and in many cases, also algebraically. There are several advantages of having the algebraic expressions: it allows for the verification of LR formulas when returned by the software to be tested, their evaluation is computationally much less demanding than performing large summations, and the algebraic expression allows one to see which allele frequencies play which role in the likelihood ratio.
The purposes of this paper (in no particular order) are (i) to present the software Bonaparte and to describe its model, (ii) to contribute to the discussion about the validation of such software, and (iii) to show how likelihood ratios for some more complicated pedigrees can be calculated by hand.
The outline is as follows. In Section 2, we give an overview of Bonaparte’s model and user-defined settings. Section 3 is preliminary and introduces some notation, that is used in Section 4 where we describe the test cases, indicating which feature these primarily test, and how we have computed the algebraic expression for the LR. In Section 5 we describe Bonaparte’s performance on these cases. In Appendix A we show how the algebraic expressions for the LR have been obtained.
2. Bonaparte
First, we describe some features that are specific to Bonaparte’s current model. It consists of the standard Mendelian inheritance model, enriched with the possibility of mutation, and allelic dropout in the profile. Subsequent versions are planned in which this model will be refined and enlarged. For example, the currently being developed version allows for the association of Y-STR and mitochondrial DNA-profiles to individuals, and contains matching algorithms for these DNA-technologies. Also, direct matching of unidentified individuals against each other to detect possible family relationships between them will be possible in this shortly coming version. Different mutation models and a correction for subpopulations (θ-correction) are being considered for future versions. This article focuses on the validation of the currently implemented model, which deals with (1.1) for autosomal DNA.
2.1. Pedigrees, founders
A graphical interface allows the user to define pedigrees and associate individuals to them. The pedigrees may be inbred, but must be connected. We call a node a founder if it has no parental nodes in the pedigree. The alleles of the founders are called founder alleles.
2.2. Allele frequencies
Likelihood ratios are functions of the population frequencies of the alleles exhibited by the typed individuals. These allele frequencies are derived from a population sample S. Bonaparte has, for each locus L in S, a set AL of alleles known to exist at the locus for the relevant population. Based on AL and S, there are three types of alleles for locus L:
When a new allele is encountered in a pedigree or UI, it will be registered in the allelic ladder for all computations involving that pedigree or UI.
When the computation of LR’s is requested, the user must specify three positive numbers λc, λr, λn. Suppose that AL contains kc common allele types, and kr rare ones. Furthermore, suppose that
and UI have kn new allele types. Let

(2.1)The expression (2.1) can be interpreted as the expectation of the posterior marginal distribution of the allele frequency of allele a, where posterior is w.r.t. the database, and the prior distribution of the vector of allele frequencies is a Dirichlet one with parameters λi. This is as far as the interpretation goes however, since the probability of drawing multiple alleles is simply taken to be the product of the appropriate (2.1). The idea of using a Dirichlet distribution to take sampling variation into account is well known and has many incarnations, e.g. as the size-bias correction (cf. [5]), or to deal with population substructure (cf. [6]). Curran et al. have investigated the effect of the prior for match probabilities, in [7]. When data from several population groups are available, then the λi can be empirically estimated using Newton’s method, as described e.g. in [8, §3.7].
2.3. Mutation model
When a parent passes an allele to an offspring, there is a possibility that it mutates. Let

Bonaparte uses a gender-independent uniform mutation model. According to this model,
(2.2)
and UI. The user may choose between different pre-set values of μ. The chosen value of μ is used for all loci and genders, except in the special case when μ
contains a mutation (see Example 4.2).This uniform mutation model, although not a realistic one, has the advantage that it does not seriously underestimate the probability of any specific mutation, which would hinder identification in case such a mutation has occurred. It is also computationally attractive.
2.4. F-allele
Suppose that it is known that only one allele on a locus has been typed, and that nothing is known concerning the second allele. The missing allele is then considered to be a random allele, and denoted by F. Hence, if the typed allele is x, the genotype is denoted (x, F).
3. Algebraic preliminaries
For many test cases, we will derive the algebraic form of the LR. In order to do so, we establish some notation in this section. We assume that we are working on one locus, which has k
=
kc
+
kr
+
kn different allele types. Further, we have chosen one UI and one MP, who belongs to pedigree
.
Let GUI be the genotype of the unidentified individual, and let
be the genotypes observed in pedigree
. Then
and hence (1.1) can be computed as
(3.1)For the computation of (3.1), we consider a genotype to be an ordered pair of alleles (g1, g2), i.e., (a, b)
≠
(b, a) unless a
=
b. We do so since it is more convenient in computations and it does not alter the likelihood ratio (3.1), since the probability of obtaining (a, b) as the genotype of the MP is the same as that of obtaining (b, a). Hence, to evaluate (3.1) we note that
(3.2)3.1. Allele transmissions
Let p
=
(p1, …, pk)t be the (column-)vector of allele frequencies. Let

(3.3)3.2. Inheritance functions
Next, we define some functions that compute the probability of having an offspring of a given genotype, given the genotypes of one or of both parents. For the case with two parents, we define
(3.4)
denotes the complement of x in {1, 2}. Then I2(a1, a2, b1, b2, c1, c2) is the probability that parents with (ordered) genotypes (a1, a2) and (b1, b2) have a child with (ordered) genotype (c1, c2).For the case where one parent has been typed and the other is a founder, we define
(3.5)Notice that I1 and I2 generalize immediately to the situation where the mutation probabilities depend on the gender.
4. Test cases
In this section we describe the test cases that we have defined, and how the gold standard has been established. As mentioned above, we consider a one-locus LR and consider all genotypes to be ordered; this does not affect the LR. In the examples that illustrate the computations we will use the uniform mutation model with mutation probability μ, but the expressions themselves are valid for an arbitrary mutation matrix M.
4.1. Standard trios
In this case, we consider a pedigree consisting of father F, mother M and child C. Either a parent is missing, the child and possibly the other parent have been typed, or the child is missing and one or both parents have been typed. The LR for such cases can be computed using (3.2), (3.3), (3.4), (3.5).
4.1.1. PurposeThe main purpose of these cases is to verify that paternity indices are computed correctly, including the motherless case.
Example 4.1.
Let μ
=
0 (no mutation), MP
=
F, and suppose that C and M are typed. We obtain the classical paternity case. If GUI
=
GC
=
GM
=
(aa), then LR
=
1
/
pa. This allows one to check if (2.1) has been calculated correctly.
Example 4.2.
Let μ
=
0 (no mutation), MP
=
F, GM
=
(aa), GC
=
(bb), GUI
=
(bc). Then



We have also defined input for these standard trios where profiles contained F-alleles (see 2.4) (notice that the allele is denoted F and the father F). In such cases, we can still use (3.4), (3.5) provided that we set
(4.1)In general, for the uniform mutation model, (2.2), (3.4), (3.5), (4.1) yield

4.2. Incestuous case
In this case we consider the pedigree described in Table 1.
Table 1. Pedigree for test-case incest.
| Individual | Father | Mother | Typed |
|---|---|---|---|
| F | – | – | No |
| M | – | – | No |
| D | F | M | Yes |
| MP | F | D | No |
That is, a child MP is missing. Its father F is also its mother D’s father, and the mother D is the only typed relative.
4.2.1. CalculationUnder Hr,
(4.2)As in the previous case, the only typed relative of the MP is a parent. The main purpose of this case is to check if it is possible to define an incestuous pedigree, and if the system is able to take this into account properly, including mutation.
4.3. Three generations
In this case, we consider a pedigree in which a child C and a parent F of the missing person MP have been typed. Suppose that F, UI, C have genotypes (ab), (cd), (ef) respectively.
4.3.1. Full (unpruned) pedigreeIf we add the untyped parents of MP and C to the pedigree, then the LR equals
(4.3)If we consider the pedigree to consist of F, MP, C only, then the alleles that MP and C did not inherit within the pedigree are founder alleles. Then the LR is equal to
(4.4)This pedigree can be described in two seemingly equivalent ways, by choosing to add or not to add the untyped parents of MP and C. Depending on whether or not this is done, one obtains (a numerical specialization of) (4.3) or (4.4). It is also conceivable for the software to automatically ‘prune’ the pedigree by recursively removing untyped founders with at most one child. One can then use these formulas to test if pruning has been done correctly. The modification of these formulas to the case where only one untyped parent is included in the pedigree is straightforward.
4.4. Missing third sibling
Let S1, S2 be siblings, and let MP be a sibling of S1, S2. The genotypes of S1, S2, UI are, respectively, (ab), (cd), (ef).
4.4.1. Gold standardThe most straightforward way to compute the LR is by evaluating


Given two siblings, many identical-by-descent configurations of the observed alleles are possible: when mutation is allowed, eight configurations are possible (see Table A.1 in Appendix A). This test case determines if the software can handle this correctly.
4.5. Two typed same-sided aunts
In this test case, MP has two typed aunts S1, S2, both sisters of the same parent S3 of MP. Suppose
. The pedigree consists of the parents of the three siblings, the siblings, the other parent of MP and MP.
The LR can be computed by brute force. The numerator, P(E
|
Hr) is given by


In addition to a purpose similar to the one for the previous case, one can use this case to see if pruning of the pedigree has been done correctly (there is one untyped founder that can be removed from the pedigree: the parent of MP that is unrelated to the typed siblings). Also, by comparing the computation time of the answer to that of the algebraic LR below, one can assess the efficiency of the algorithm in this case.
4.6. Complicated inbred pedigree
The pedigree for this case is given in the picture below. In short, there is a marriage between persons 7 and MP, who share one grand-parent. The barred individuals in the pedigree are those who do not have a DNA-profile.
In this case, it is too difficult and no longer very informative to write down the generic LR algebraically. Instead we have computed the gold standard by calculating P(E
|
Hr) and P(E
|
Hu) separately, summing over all possible allelic configurations of the untyped individuals. There are 9, resp. 10 untyped individuals in this pedigree under Hr, resp. Hu. A straightforward attempt to calculate P(E
|
Hr) and P(E
|
Hu), using ordered genotypes, would result in a summation involving k18, resp. k20 terms (k being the number of alleles on the locus under consideration), which cannot be done on a standard computer in reasonable time. Therefore we have done some preliminary calculations to reduce the number of variables that need to be summed over: we use value abstraction (cf. [9]), switch to unordered genotype computation, and we remove individuals from the pedigree by performing the necessary summations.
If the typed relatives and the UI have t different alleles on the locus we consider, then we can replace the original set of k alleles by a new set of t
+
1 alleles, consisting of all typed alleles and an auxiliary allele X, whose frequency is the sum of the frequencies of all the unseen alleles. We denote unordered genotypes by {a, b}, i.e., {a, b}
=
{b, a} is the unordered genotype corresponding to genotypes (a, b) and (b, a). The number of unordered genotypes for the reduced allele set is t(t
+
1)
/
2. For genotype i
=
{i1, i2} we let

To do so, we define the function

Furthermore, we define

is the probability that parents with genotypes {a1, a2} and {b1, b2} have a child with genotype {c1, c2}.Then P(E
|
Hr) is given by
(4.5)The expression for P(E
|
Hu) is similar: we need to replace gtui by gtmp and sum over that genotype as well. This yields
(4.6)The pedigree in this case is inbred, which presents a computational obstacle to some algorithms, e.g. the Elston–Stewart algorithm. In addition, there are many untyped relatives, including all founders. This case therefore serves well to check not only correctness of the computed likelihood ratio, but also performance.
4.7. Algebraically
The generic LR for this pedigree is too complicated to be of any use, even with μ
=
0. However, for specific choices of allelic configurations one can use (4.5), (4.6) to obtain the algebraic expressions for P(E
|
Hr), P(E
|
Hu) and LR. A few examples (with μ
=
0) are listed in Table 2.
Table 2. Allele configurations.
| Individual | LR1 | LR2 | LR3 |
|---|---|---|---|
| FM4 | (3,3) | (2,6) | (4,5) |
| FM5 | (1,3) | (2,5) | (1,5) |
| UI | (3,3) | (3,4) | (2,4) |
| FM9 | (2,3) | (3,3) | (4,4) |
| FM10 | (3,4) | (1,4) | (2,3) |
The corresponding LR’s are:
(4.7)
(4.8)
(4.9)5. Validation report
Based on the scenario’s described above, we have defined test cases for Bonaparte by choosing specific DNA-profiles for the pedigrees mentioned in the previous section. The gold standard LR has been computed with the software Mathematica 6.0 up to machine precision (about 16 significant digits). No discrepancies at all were observed with Bonaparte’s output. We’ve used a standard desktop computer throughout (Windows XP, Intel Core Duo, 2.33 GHz, 2 GB RAM).
In addition, we have performed some of the same computations with the freely available program Familias (cf. [10]). This program can not automatically compute all LR’s between a list of MP’s and a list of UI’s, but it has many features incorporated into it, such as various mutation models and subpopulation correction. We have only tested the uniform mutation model for a few choices of profiles per test case. Familias performed well on these cases (the reported LR equals the LR given by our formulas up to at least the first seven decimal places) but was sometimes slower than Bonaparte, especially for test case 4.6.
Appendix A.
In this section we derive the algebraic form of the likelihood ratios for some of the test cases.
Terminology. We say that two alleles are identical-by-descent (ibd) if they are descendants of the same ancestral allele. This terminology is somewhat abusive, since the alleles need not be identical in state (due to mutation). We write a
≡
b to denote that alleles a and b are ibd and a
≢
b to denote that they are not.
A.1. Incest
Proof of (4.2): If UI
=
MP then with probability 1
/
2, there is one pair of ibd alleles between GUI and GM and with probability 1/2, there are two. If there is one pair, then the situation is genetically non-incestuous so we get (3.5), which corresponds to the term of (4.2) containing the
. If there are two ibd-pairs, then there are two equally likely possibilities: (i) both u1 and u2 are descendants of the same allele f of F (of which either m1 or m2 is also a descendant), so u1 and u2 are ibd with the same maternal allele, say m1. (ii) The alleles u1, resp. u2 are ibd with m1, resp. m2 or the other way around. In both cases we can use that if allele y is a descendant of allele x, then P(y
=
i
|
x
=
j)
=
Mi,ifi
/
qj. This gives the following probabilities: in case (i) suppose (without loss of generality) that u1 is inherited from paternal allele f and u2 from m1. Then
. Summing over all the possibilities gives the four middle other terms in (4.2). In case (ii), suppose that u1 is ibd with m1 and u2 with m2. One of these ibd relations is through F, say this is the case for (u1, m1). Then
. Summing over all possible configurations gives the final four terms in (4.2).
A.2. Three generations
Proof of (4.3), (4.4): The numerator is the expansion of I1(a, b, c, d)
·
I1(c, d, e, f), which is the probability that GUI
=
(cd) and GC
=
(ef) given that GF
=
(ab) and Hr. In the denominator, the term pepf corresponds to the probability of observing GUI under Hu, and the remaining term is the probability of observing the genotype GC
=
(ef), given GF
=
(ab), under Hu.
A.3. Three siblings
We start with the probability Ps((ab), (cd)) of having two siblings with genotypes (a, b), (c, d). This probability can be computed from the a priori probabilities of all possible ibd configurations. These are given in Table A.1. We define an indicator variable I specifying the ibd-pairs, which we will use in the sequel.
Table A.1. Prior ibd probabilities for siblings with genotypes (ab), (cd).
| I | ibd-pairs | Probability |
|---|---|---|
| 1 | a | 1/8 |
| 2 | a | 1/8 |
| 3 | a | 1/8 |
| 4 | a | 1/8 |
| 5 | a | 1/8 |
| 6 | a | 1/8 |
| 7 | None | 1/4 |
It is then easy to see that, with Q as in (3.3),
(A.1)
(A.2)Let
. We will calculate the Pi(a, b, c, d, e, f).
We denote

With this notation the first term is equal to


(A.3)
(A.4)
or
, where
is the parental allele of which x is a (possibly mutated) copy. Both of these parental configurations have equal probability. For the first configuration, we get probability

can be computed. The denominator is straightforward (given by (3.2)), hence we can now compute the LR.Example A.1.
Let μ
=
0 and suppose that
. Then we can apply the above with a
=
x, b
=
y, c
=
x, d
=
z, e
=
y, f
=
z. This yields



Example A.2.
The above formulas apply directly in the case where the profiles contain F-alleles (i.e., in case of allele dropout). One needs to substitute (cf. (4.1))



and μ
A.4. Two aunts
We will encounter the situation where an unobserved founder allele has been passed on to an offspring as allele a and to a grandchild (through another offspring) as allele b. The probability of this happening can be summarized in matrix form: we define the matrix R
=
MtFM2 which has entries


Analogous to (A.2), we have

Analogous to the analysis in A.3, we define




Remark A.1.
The LR for the pruned pedigree, where MP’s second parent is left out of the pedigree, is obtained by replacing qg, qh with pg, ph in the Pi(a, b, c, d, g, h).
Example A.3.
Let μ
=
0 and take
. Then we can compute the LR by substituting a
=
x, b
=
y, c
=
x, d
=
y, g
=
x, h
=
z into the above formulas. This yields






References
- . Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, Series B (Methodological). 1988;50(2):157–224
- . Bayesian networks for victim identification on the basis of DNA profiles. Forensic Science International: Genetics Supplement Series. 2009;2:466–468
- . Identification of distant family relationships. Bioinformatics. 2009;25:2376–2382
- . Validation of software for calculating the likelihood ratio for parentage and kinship. Forensic Science International: Genetics. 2009;3(2):112–118
- . Estimating products in forensic identification. Journal of the American Statistical Association. 1995;90(431):839–844
- . DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands. Forensic Science International. 1994;64:125–140
- . The sensitivity of the Bayesian HPD method to the choice of prior. Science & Justice. 2006;46(3):169–178
- . Statistical Methods for Genetic Analysis. 2nd ed.. New York: Springer-Verlag; 2002;
- . Likelihood computations using value abstraction. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence. 2000;p. 192–200
- . Beyond traditional paternity and identification cases. Selecting the most probable pedigree. Forensic Science International. 2000;110(1):47–59
- 1 http://english.forensischinstituut.nl.
- 2 http://www.snn.ru.nl.
- 3 http://dna-view.com.
- 4 Note that the right hand side is symmetric in a1, a2 and in b1, b2, hence defines a function on unordered genotypes. The analogous remark applies to the definition of OPC and PC as well.
PII: S1872-4973(10)00109-2
doi:10.1016/j.fsigen.2010.06.005
© 2010 Elsevier Ireland Ltd. All rights reserved.
Volume 5, Issue 4 , Pages 308-315, August 2011

