Volume 5, Issue 4 , Pages 281-284, August 2011
The predictive value of the maximum likelihood estimator of the number of contributors to a DNA mixture
Article Outline
- Abstract
- 1. Introduction
- 2. Methods
- 3. Results and discussion
- 4. Conclusion
- Acknowledgments
- References
- Copyright
Abstract
We propose to quantify the accuracy of a likelihood-based estimator that was recently proposed for the determination of the number of contributors to a DNA mixture, when genetic data alone is considered [H. Haned, L. Pène, J.R. Lobry, A.B. Dufour, D. Pontier, Estimating the number of contributors to forensic DNA mixtures: does maximum likelihood perform better than maximum allele count? J. Forensic Sci., in press]. Using Bayes’ theorem, we derive a formula for the calculation of the predictive value (PV) of the likelihood-based estimator. The PV gives the probability that a DNA stain contains the DNAs of i people given that the maximum likelihood estimator gave an estimate of i contributors for this stain. We illustrate the PV calculations for two different types of DNA evidence: traces and body fluids.
The PV varied according to the number of contributors involved in the DNA stain. Setting the maximum number of possible contributors to five, the lowest predictive values were scored for five-person mixtures with a minimum value of 0.26 for traces, but values were always above 0.94 for stains comprising one, two or three contributors, for both traces and body fluids. Values remained relatively high for four-person mixtures with a minimum value of 0.69. These findings confirm that likelihood-maximization is a powerful approach for the determination of the number of contributors to forensic DNA mixtures.
Keywords: DNA mixtures, Likelihood estimator, Traces, Body fluids, Predictive value, Bayes’ theorem
1. Introduction
As the sensitivity of typing methods is constantly increasing, forensic experts deal with more and more complex cases of evidence containing the DNA of several individuals. Though numerous statistical methods exist to calculate the strength of DNA evidence, the most challenging step in the interpretation of such mixed stains is still the determination of the number of contributors involved [1]. Usually, the circumstances of the investigated crime combined with genetic and non genetic evidence can produce good grounds to the determination of this number. But the task is seriously complicated when scarce data is available about the origin of the stain. This is common in DNA casework where often no suspect or known contributors are available. A common laboratory practice consists on bounding the number of contributors to the minimum required to explain the observed DNA profiles without making any use of the available data except for the number of alleles per locus [2]. Recently, an alternative approach based on the maximum likelihood principle was proposed to overcome this issue [3]. Using qualitative information on which alleles are present in the mixture, this maximum likelihood estimator searches the number of contributors maximizing the likelihood of the observed DNA profiles. Using computer-simulated DNA mixtures, the authors of this study showed that maximizing the likelihood of the data to find the most likely number of contributors gives more accurate estimates than using a lower bound when dealing with mixtures of more than three contributors. However; before considering the use of this estimator in practical cases, it is important to have at disposal a method to quantify the level of confidence that can be given to the yielded results.
In this paper, we propose to globally quantify the accuracy of the maximum likelihood estimator. Relying on Bayes’ theorem, we derive a formula for the calculation of the predictive value (PV) of the estimator. The PV aims to give a global appreciation of the confidence that can be given to the estimates meanwhile taking into account prior information about the occurrences of mixed DNA stains in forensic casework. We explain the method and illustrate its potential use in forensic studies.
2. Methods
2.1. Theoretical background
The maximum likelihood estimator takes into account genetic data, namely, the frequencies of the alleles present at each locus characterizing the analyzed DNA stain, and searches the number of contributors that maximizes the likelihood of the observed profiles [3]. We define the predictive value of this estimator as the probability of having i contributor(s) to the tested DNA stain, knowing that the likelihood estimator gave an estimate of i contributor(s) for this stain. The PV is data-independent, which means that the observed data, namely the DNA profiles in the stain, are not involved in the calculations. The PV can thus be assimilated to a precision rate of the estimator, specific to each mixture type.
2.2. Formulation of the predictive value of the likelihood estimator
Denoting x the true number of contributors to the mixture and
its estimation, the predictive value of the estimator can be written as the conditional probability:
. A simple way to estimate this unknown probability is to rewrite it using its inverse, which is:
. The transformation is simply done using Bayes’ formula:
(1)
is the probability that the estimator classifies the considered stain as a mixture of i contributor(s), given that there are actually i contributor(s). Haned et al. [3] used a simulation procedure to estimate these conditional probabilities: a thousand mixture comprising two to five contributors were simulated by combining alleles at random, with respect to their allele frequencies. The efficiency of the estimator was estimated as the proportion of correctly identified mixtures. Here, we follow a similar procedure: We simulated 1000 DNA stains containing one to five individuals, using the US African American allele frequencies published in [4]. The conditional probabilities of success of the estimator were then estimated for each simulated number of contributors.Hereafter, we will refer to the probability
as the prior probability of encountering a mixture of i contributors.
is the probability of the estimator giving i as an estimate for the number of contributors to the stain, regardless of the concerned mixture type. Using the law of total probabilities we rewrite probability
to a product of conditional and prior probabilities as follows:
(2)
is the probability that the estimator classifies the considered stain as a mixture of i contributor(s) knowing that there are actually k contributor(s), where k can be equal or differ from i. Values of k range from 1 to K, where K is a biological meaningful threshold for the number of contributors. For illustrative purpose, we set K to 5 and search the maximum likelihood estimates in the discrete interval [1], [6]. As we later discuss, this threshold can be extended to
.2.3. Constructing the prior distribution of mixed DNA stains
Thanks to Eq. (2), the only term we have to determine now is the prior probability
. In order to construct this prior distribution we used a survey of the crime scene profiles analyzed at the Institut National de Police Scientifique (INPS), the national forensic laboratory in Lyon, France (data communicated by Laurent Pène). For the year 2008, 8479 crime scene profiles were analyzed at the INPS using the Applied Biosystems AmpFlSTR® Identifiler™ kit [5]. These samples were either classified as traces when they came from contact traces, for instance epithelial cells on a given object or tool, or as body fluids when samples came from biological fluids, namely, blood, saliva and semen. The number of individuals involved in the stain was also indicated. Samples comprising one contributor were classified as “single-source” stains, samples comprising two contributors were classified as “resolvable mixtures” and stains comprising more than two contributors were classified as “unresolvable mixtures”. This restricted classification is explained by the difficulty of determining the real number of individuals involved [6].
Two-person mixtures are believed to account for the majority of mixtures encountered in casework [7]. Three-, four- and five-person mixtures are believed to be rarer. But, as a consequence of the restricted classification, very scarce data is available in the literature about the occurrence of these complex mixtures in forensic casework. The construction of a prior distribution of mixtures occurrences in forensic casework was thus necessary for mixtures comprising more than two contributors.
The prior probabilities for stains comprising one or two contributors were set using the available data (survey of the INPS casework for year 2008). We chose to set the remaining probabilities for mixtures comprising more than two contributors using experts’ prior beliefs. We asked three experienced forensic experts at the INPS to set the proportions of mixed stains comprising three, four or five contributors. We focused on two key issues in setting up this prior distribution:
These requirements are meant to help the forensic experts to set the prior distribution but they are not compulsory to the method, and they can of course be modified or dropped.
3. Results and discussion
3.1. Crime scene profiles survey
Among the 8479 casework profiles stains, 5169 were body fluids and 3310 were traces. The majority of stains, 71%, comprised one contributor and was classified as “one contributor stains”. Among the remaining 29% stains, 6% were resolvable mixtures classified as two-person mixtures and 23% were classified as unresolvable mixtures. There were more mixed DNA stains among traces than among body fluids (Table 1). This finding agrees with our predictions and can be explained by the fact that in case of body fluids, the major contributor drowns the signal of other contributors to the mixture, whereas in case of traces, the low quantities of DNA contributed by each individual prevent from detecting single-source DNA contributors.
Table 1. Percentages of crime scene profiles comprising one, two or more than two individuals.
![]() | ![]() | ![]() | ||
|---|---|---|---|---|
| Traces | 45% | 4% | 51% | ![]() |
| Body fluids | 87% | 7% | 6% | ![]() |
3.2. Predictive value of the likelihood estimator
The conditional probabilities of success were estimated from simulated data (Table 2). We obtained similar results to those of Haned et al. [3]. Different prior values were chosen for traces and body fluids (Table 3).
Table 2. Estimates of the conditional probabilities
. The table is read vertically. For example, the probability of having an estimate of 5, knowing that there are actually 4 people in the DNA stain is 0.127.
![]() | ![]() | ![]() | ![]() | ![]() | ![]() | |
|---|---|---|---|---|---|---|
![]() | 1 | 0 | 0 | 0 | 0.00 | 0 |
![]() | 0 | 0.998 | 0.002 | 0 | 0.00 | 0 |
![]() | 0 | 0.005 | 0.937 | 0.058 | 0.00 | 0 |
![]() | 0 | 0 | 0.067 | 0.805 | 0.127 | 0.001 |
![]() | 0 | 0 | 0 | 0.131 | 0.662 | 0.207 |
Table 3. Prior distribution probabilities, for traces and body fluids, set by three forensic DNA experts: Expert 1, Expert 2 and Expert 3. Values for
and
were set using the data survey shown Table 1. Values for
were given by the interviewed forensic experts.
![]() | ![]() | ![]() | ![]() | ![]() | |
|---|---|---|---|---|---|
| Expert 1 | |||||
| Traces | 0.45 | 0.04 | 0.30 | 0.15 | 0.06 |
| Body fluids | 0.87 | 0.07 | 0.04 | 0.01 | 0.01 |
| Expert 2 | |||||
| Traces | 0.45 | 0.04 | 0.35 | 0.15 | 0.01 |
| Body fluids | 0.87 | 0.07 | 0.05 | 0.01 | 0 |
| Expert 3 | |||||
| Traces | 0.45 | 0.04 | 0.25 | 0.20 | 0.06 |
| Body fluids | 0.87 | 0.07 | 0.05 | 0.01 | 0 |
The predictive values varied according to the prior probabilities used. Where non null priors are used, the predictive values were relatively high, for both traces and body fluids, as values ranged from 0.69 to 1 for stains containing one, two, three or four contributors. The lowest values were scored for five-person mixtures (0.26 for traces). When similar priors are used, the PV slightly differed; in this case, it appeared that the distinction between the types of DNA stains under analysis is not necessary.
The priors used in this study are not arbitrary as they are defined by experts’ prior belief. The use of such priors in likelihood ratios is controversial as discussed in Buckleton et al. [9], but in this study, the focus is on methods evaluation and these priors are not related to the prior knowledge about the number of contributors before the DNA evidence is analyzed.
We set the threshold for the number of contributors to five (Table 3, Table 4) which led to searching the maximum likelihood estimates in the discrete interval [1], [6]. We believe that this is a biologically meaningful threshold for searching the most plausible number of contributors. However, this threshold can be extended, depending on the crime scene context and the type of evidence being analyzed. For instance, traces are likely to contain more contributors than stains from body fluids. Once the prior distributions of the mixed stains set, the results are straightforward.
Table 4. Predictive values of the maximum likelihood estimator according to the prior distributions defined by Experts 1–3 and shown Table 3. Predictive values are given for traces and body fluids, according to the number of individuals contributing to the stain (
).
![]() | ![]() | ![]() | ![]() | ![]() | |
|---|---|---|---|---|---|
| Expert 1 | |||||
| Traces | 1 | 0.96 | 0.96 | 0.83 | 0.67 |
| Body fluids | 1 | 0.99 | 0.98 | 0.69 | 0.84 |
| Expert 2 | |||||
| Traces | 1 | 0.96 | 0.97 | 0.85 | 0.26 |
| Body fluids | 1 | 0.99 | 0.98 | 0.73 | 0 |
| Expert 3 | |||||
| Traces | 1 | 0.96 | 0.94 | 0.88 | 0.61 |
| Body fluids | 1 | 0.99 | 0.98 | 0.73 | 0 |
4. Conclusion
In this paper, we propose the predictive value to be considered as a global measure of the likelihood-based estimator efficiency. It is notable that the PV is not meant to be a measure of the uncertainty related to the estimates.
The values presented in this study depend on the simulated data and the priors we defined. These can be adapted with respect to the context where the DNA evidence is analyzed. PV calculations using priors different from those we propose here can be carried out using the R package forensim, available from http://forensim.r-forge.r-project.org/.
The maximum likelihood estimator of the number of contributors to forensic DNA mixtures can be powerful in critical cases, for instance when dealing with DNA casework. Very often in such cases, scarce data is available about the origin of the stain and only genetic data are available. These data consist of qualitative information about which alleles are present in the stain and quantitative information about the alleles’ peak heights and areas. The maximum likelihood estimator only considers qualitative data. Quantitative information might not always help to separate the DNA profiles into individual components. Moreover, there is no consensus in the literature about how peak heights or areas should be taken into account, and the developments in the literature dealing with quantitative data [10], [11], [12], [13], [14], [15] have not encountered the expected success in the forensic community.
The fact that genetic data support a certain number of contributors to the evidentiary stain can be of significant help for the investigators, before any suspect or comparison between profiles can be processed. When no other information is available, this estimate can guide investigators in their search for potential suspects. To conclude, even if the maximum likelihood approach might seem too complex for presentation in court, it must not be neglected as a valuable tool to determine the number of contributors to DNA stains and forensic experts should be aware that an alternative method to maximum allele count exists.
Acknowledgments
We thank two referees for a thorough review and constructive comments. We are grateful to Anne Viallefont and David Fouchet for their helpful comments.
References
- . Mixtures. In: Buckleton J, Triggs CM, Walsh SJ editor. Forensic DNA Evidence Interpretation. CRC Press; 2005;p. 217–274
- . Empirical analysis of the STR profiles resulting from conceptual mixtures. J. Forensic Sci. 2005;50:1361–1366
- H. Haned, L. Pène, J.R. Lobry, A.B. Dufour, D. Pontier, Estimating the number of contributors to forensic DNA mixtures: does maximum likelihood perform better than maximum allele count? J. Forensic Sci., 2011, in press.
- . Allele frequencies for 15 Autosomal STR loci on U.S. Caucasian, African American, and Hispanic populations. J. Forensic Sci. 2003;8:908–911
- Applied Biosystems (2001) AmpFlSTR® Identifiler™ PCR Amplification Kit User’s Manual, Foster City, CA, P/N 4323291.
- Mixture interpretation: Defining the relevant features for guidelines for the assessment of mixed DNA profiles in forensic casework. J. Forensic Sci. 2009;54:810–821
- . DNA mixtures in forensic casework: a 4-year retrospective study. Forensic Sci. Int. 2003;134:180–186
- . Low Copy Number. In: Buckleton J, Triggs CM, Walsh SJ editor. Forensic DNA Evidence Interpretation. CRC Press; 2005;p. 275–297
- . Towards understanding the effect of uncertainty in the number of contributors to DNA stains. Forensic Sci. Int. Genet. 2007;1:20–28
- . Interpreting simple STR mixtures using allele peak areas. Forensic Sci. Int. 1998;91:41–53
- . Taking account of peak areas when interpreting mixed DNA profiles. J. Forensic Sci. 1998;43:62–69
- . Linear mixture analysis: a mathematical approach to resolving mixed DNA samples. J. Forensic Sci. 2001;46:1372–1378
- . Mixtures. In: Buckleton J, Triggs CM, Walsh SJ editor. Forensic DNA Evidence Interpretation. CRC Press; 2005;p. 217–274
- . Least-square deconvolution: a framework for interpreting short tandem repeat mixtures. J. Forensic Sci. 2006;51:1284–1297
- . Identification and separation of DNA mixtures using peak area information. Forensic Sci. Int. 2007;166:28–34
PII: S1872-4973(10)00079-7
doi:10.1016/j.fsigen.2010.04.005
© 2010 Elsevier Ireland Ltd. All rights reserved.
Volume 5, Issue 4 , Pages 281-284, August 2011


























