Advertisement
Research Article| Volume 7, ISSUE 1, P52-54, January 2013

Decision-making in familial database searching: KI alone or not alone?

      Abstract

      We consider the comparison of hypotheses “parent–child” or “full siblings” against the alternative of “unrelated” for pairs of individuals for whom DNA profiles are available. This is a situation that occurs repeatedly in familial database searching. A decision rule that uses both the kinship index (KI), also known as the likelihood ratio, and the identity-by-state statistic (IBS) was advocated in a recent report as superior to the use of KI alone. Such proposal appears to conflict with the Neyman–Pearson Lemma of statistics, which states that the likelihood ratio alone provides the most powerful criterion for distinguishing between any two simple hypotheses. We therefore performed a simulation study that was two orders of magnitude larger than in the previous report, and our results corroborate the theoretical expectation that KI alone provides a better decision rule than KI combined with IBS.

      Keywords

      1. Introduction

      In a recent comparison of statistical methods for familial database searches [
      • Ge J.
      • Chakraborty R.
      • Eisenberg A.
      • Budowle B.
      Comparison of familial DNA database searching strategies.
      ], Ge et al. advocated decision-making based upon the combined use of IBS and KI as superior to the use of KI alone. Here, IBS (‘identity-by-state’) denotes the number of alleles shared by two given individuals, and KI (‘kinship index’) is the likelihood ratio in favor of a specified relationship over the alternative that they are unrelated. In familial database searching, the relationships of interest are “parent–child” and “full siblings”, and we will write KIPC and KIFS respectively for the corresponding KIs. All humans are related to one another, though relationships may be remote and unknown. “Unrelated” usually refers to the simple, but artificial, model of independent sampling of alleles from a certain gene pool. A more realistic model would use population genetic parameter FST to account for the co-ancestry of “unrelated” individuals. We advocate using an FST adjustment in real applications, but for the purposes of this short note on the relative merits of different decision rules, we follow [
      • Ge J.
      • Chakraborty R.
      • Eisenberg A.
      • Budowle B.
      Comparison of familial DNA database searching strategies.
      ] and assume FST = 0.
      The assertion that the pair of statistics (IBS,KI) gives a better decision rule than KI alone appears implausible because it conflicts with the Neyman–Pearson Lemma [
      • Neyman J.
      • Pearson E.
      On the problem of the most efficient tests of statistical hypotheses.
      ], which states that the likelihood ratio is the most powerful statistic for distinguishing between two simple hypotheses. Intuitively, KI includes all the information provided by the genotype data for distinguishing a specified relationship from unrelated, and its efficiency cannot be enhanced by the concurrent consideration of any other statistic. The fact that (IBS,KI) is a pair of numbers does not affect the logic of the Neyman–Pearson Lemma: in statistics, a ‘statistic’ is any function of the data, be it univariate, bivariate or multivariate.
      There are superficial attractions to using IBS rather than KI. The former may be easier for non-experts to understand, and does not require knowledge of allele frequencies to compute. Moreover, use of KI entails deciding in advance the alternative hypothesis of interest (e.g. “parent–child” or “full siblings”), which IBS does not, and KI may fail to reject the null hypothesis of “unrelated” when the individuals are in fact related but the specified alternative hypothesis is wrong. These perceived advantages of IBS over KI are illusory however. Firstly, although allele frequencies are not required to compute IBS, they must be invoked to evaluate IBS. Secondly, to choose an appropriate IBS threshold, a power or similar analysis is required. This in turn requires specifying the alternative relationship as well. In any case, combining IBS with KI would lose any such seeming advantage of using IBS alone.
      Ge et al. [
      • Ge J.
      • Chakraborty R.
      • Eisenberg A.
      • Budowle B.
      Comparison of familial DNA database searching strategies.
      ] provided simulation results that they interpreted as proving the superiority of (IBS,KI) over KI alone. For example, for “parent–child” vs. “unrelated”, their false positive rate (FPR) of 0.0014 for KI > 10,000 drops to 0.0010 and 0.0007 when adding the requirement that IBS > 15 and IBS > 16, respectively. At the same time, the false negative rates (FNR) increases from 0.494 to 0.558 (incorrectly reported as 0.218) and 0.659. In their comments, the authors expressed the view that the gain in FPR would be worth the consequent loss in FNR. However, trading off FPR against FNR requires subjective judgments as to the relative harm of each type of error, and the conventional approach is to avoid these judgments by comparing the rates of one type of error when the other error rate is fixed. Here, we made these comparisons in the context of a much larger simulation study than was undertaken by Ge et al. [
      • Ge J.
      • Chakraborty R.
      • Eisenberg A.
      • Budowle B.
      Comparison of familial DNA database searching strategies.
      ].

      2. Methods

      One hundred million (108) each of full sibling and parent–child pairs, and one billion (109) unrelated pairs, were simulated according to Mendelian principles, using allele probabilities obtained from Caucasian population data for the 13 CODIS Short Tandem Repeat loci [
      • Budowle B.
      • Shea B.
      • Niezgoda S.J.
      • Chakraborty R.
      CODIS STR loci data from 41 sample populations.
      ]. This is greatly in excess of the one million (106) simulated pairs previously employed [
      • Ge J.
      • Chakraborty R.
      • Eisenberg A.
      • Budowle B.
      Comparison of familial DNA database searching strategies.
      ], because estimating the difference in power between two methods requires larger sample sizes than estimating the power of a single method.
      Following [
      • Ge J.
      • Chakraborty R.
      • Eisenberg A.
      • Budowle B.
      Comparison of familial DNA database searching strategies.
      ], we computed KIPC for the parent–child and unrelated pairs, KIFS for the full-sibling and unrelated pairs, and IBS for all pairs. To avoid specifying the relative costs/benefits of the two types of errors, we compared the false negative rate (FNR) when the false positive rate (FPR) was equalized for the two decision rules, and vice versa. For example, we compared the FNR of the decision rule that declares a pair to be full siblings when both IBS ≥ 15 and KIFS ≥ 1000, with the FNR of declaring “full sibling” when KIFS ≥ x, where x is chosen so that both decision rules have the same FPR. Similarly, we compared the FPR of the two rules when x was chosen to equalize the two FNRs.
      Additional computational effort was applied to the selection of x when IBS ≥ 16 and KI ≥ 100,000. Ge et al. report the false positive rates as 1 in a million for parent–child and 3 in a million for siblings. However, the sampling variation associated with the simulation of such small proportions in only one million pairs is going to be high. Therefore, we decided to invest more computational effort in firstly estimating the false positive rates more accurately, and secondly in determining appropriate thresholds for declaring kinship. The numbers of false positives in one billion unrelated pairs were 493 classified as parent–child and 478 classified as full siblings. To estimate the value of x giving equivalent FPRs by naively storing all the observed values would impose excessive memory requirements. To overcome this hurdle, we employed the method of Woodruff [
      • Woodruff R.S.
      Confidence intervals for medians and other position measures.
      ] which we explain by means of the following example:
      Assume that we seek to accurately estimate the 90th percentile of a certain distribution. A straightforward way to do this would be to draw a sample of size one billion from this distribution, and use the 900,000,000th largest value as the sought-after estimate. However, one billion double precision numbers require about 7.5 GB of RAM to store, and disk-based sorting of the sample values, while using less RAM, would be extremely slow. The alternative [
      • Woodruff R.S.
      Confidence intervals for medians and other position measures.
      ] is to take a smaller sample from the distribution first and to use it to calculate a confidence interval for the desired percentile in the full sample. Subsequently, only values from the full sample that fall within the confidence limits need to be stored. For example, with an initial sample of size n = 100 drawn from a standard normal distribution, a 99.7% (±3 standard deviations) confidence interval for the 90th percentile of the full sample would be obtained by first calculating a 99.7% confidence interval for binomial proportion p = 0.9 from the small sample. Using the normal approximation to the binomial, the required interval is
      p±3×p(1p)n


      which gives a confidence interval for the binomial proportion of [0.81, 0.99] if p = 0.9 and n = 100. In our example, the 0.81 and 0.99 sample quantiles of the small sample were 0.891 and 2.354, and these demarcated the confidence interval for the 90th percentile in the full sample. Now, when considering the full sample, only values between 0.891 and 2.354 had to be stored whereas values below this range were counted, but not stored. As a smaller illustration, we took a sample of size n = 100,000 (rather than one billion) and found 81,565 values below 0.891, and 17,492 values between 0.891 and 2.354. The sought-after 90,000th largest value of the sample was the 8435th (= 90,000 − 81,565) largest value among the stored values, which equaled 1.272. This compares favorably to the “true” value of 1.282, and we only had to store (and sort) fewer than one fifth of the simulated values.

      3. Results

      The results of our simulations are summarized in Table 1. Columns 4 and 7 of Table 1 correspond to Table 6(c) of Ge et al. [
      • Ge J.
      • Chakraborty R.
      • Eisenberg A.
      • Budowle B.
      Comparison of familial DNA database searching strategies.
      ]. The differences between the two tables reflect a higher precision resulting from the larger number of simulations performed in our study, and from the correction of one gross error in the earlier report, mentioned above. Although the values in the two tables are generally similar, they frequently differ by >10% and occasionally by >50%.
      Table 1Comparison of the percentage false positive rates (FPR) and false negative rates (FNR) of two decision rules for familial database searching. Columns 4 and 7 contain the FPR and FNR for the bivariate decision rule that declares a pair of individuals to be related if both IBS > ibs0 and KI > ki0, where the values of ibs0 and ki0 are specified in columns 1 and 2. Columns 3 and 6 give the values of kiN and kiP such that the univariate decision rule based upon KI > kiN has the same FNR, and that based upon KI > kiP has the same FPR, as the bivariate rule. Columns 5 and 8 give the FPR and FNR for the univariate decision rules using KI > kiN and KI > kiP, respectively. The purpose of the table is to allow comparison of columns 4 and 5, and of columns 7 and 8, and since these columns contain error rates, smaller values indicate the better decision rule.
      ibs0ki0kiNFPR (ibs0,ki0)FPR (kiN)kiPFNR (ibs0,ki0)FPR (kiP)
      Parent–child vs. unrelated
      141002914.77 × 10−43.23 × 10−41260.0480.018
      14100012281.36 × 10−41.20 × 10−410540.1710.153
      1410,00010,6081.41 × 10−51.35 × 10−510,1530.5100.503
      15100025951.12 × 10−46.22 × 10−513190.2750.180
      1510,00014,1091.25 × 10−59.76 × 10−610,9940.5580.516
      16100086107.22 × 10−51.77 × 10−522130.4750.251
      1610,00027,0889.23 × 10−64.18 × 10−614,3350.6610.561
      16100,000174,7864.78 × 10−72.75 × 10−7128,2350.8710.845
      Full siblings vs. unrelated
      141001087.89 × 10−47.63 × 10−41030.2380.235
      14100010098.93 × 10−58.79 × 10−510040.4260.426
      1410,00010,0117.56 × 10−67.46 × 10−699620.6340.633
      15100010728.69 × 10−58.25 × 10−510340.4320.428
      1510,00010,1137.52 × 10−67.38 × 10−610,0890.6340.634
      16100014107.72 × 10−56.22 × 10−511570.4570.439
      1610,00010,7747.28 × 10−66.80 × 10−610,3700.6400.637
      16100,000101,5604.93 × 10−74.73 × 10−799,3300.8060.804
      (1)(2)(3)(4)(5)(6)(7)(8)
      In no scenario considered was the univariate decision rule based upon KI alone inferior to the bivariate rule based upon (IBS,KI). Specifically, no entry of columns 4 and 7 of Table 1 is smaller than the corresponding entry of columns 5 or 8, respectively. In many settings the two estimated error rates are similar, but their ratio can range up to four in the scenarios considered.
      In the final row of Table 1, the FPR estimates correspond respectively to 493 and 473 observed false positives in one billion trials. The standard deviations of these counts were both approximately equal to 22, and so their difference was not statistically significant. However, 10 of the 16 FPR comparisons and all of 16 FNR comparisons were significant at α = 0.05 in favor of the univariate decision rule, and none yielded even nominal evidence against it.
      The R package used to generate the data in Table 1 has been made available in the Comprehensive R Archive Network (http://cran.r-project.org/web/packages/relSim/index.html).

      4. Conclusion

      In a simulation-based assessment of two decision rules for familial database searching, namely (IBS,KI) and KI alone, we found highly significant support for the latter. This result was to be expected because it is implied by the Neyman–Pearson Lemma [
      • Neyman J.
      • Pearson E.
      On the problem of the most efficient tests of statistical hypotheses.
      ]. The differences in error rates between the two approaches were small in many comparisons, but moderately large in others. Even if the gain in power of KI alone is small, because use of the compound decision rule adds complexity to the decision process and conveys no advantage, we recommend that KI alone is used to test any two competing hypotheses for the relationship between a pair of individuals, as occurs in familial database searching.

      Acknowledgements

      We gratefully acknowledge the comments of Lisa Melia, Michael Taylor, and two anonymous referees which greatly improved this paper. We would also like to acknowledge the assistance of Torben Tvedebrink whose help was invaluable in the validation and verification of the software. This work was supported in part by 136 grant 2011-DN-BX-K541 from the US National Institute of Justice.

      References

        • Ge J.
        • Chakraborty R.
        • Eisenberg A.
        • Budowle B.
        Comparison of familial DNA database searching strategies.
        J. Forensic Sci. 2011; 56: 1448-1456
        • Neyman J.
        • Pearson E.
        On the problem of the most efficient tests of statistical hypotheses.
        Philos. Trans. R. Soc. Lond. A. 1933; 231: 289-337
        • Budowle B.
        • Shea B.
        • Niezgoda S.J.
        • Chakraborty R.
        CODIS STR loci data from 41 sample populations.
        J. Forensic Sci. 2001; 46: 453-489
        • Woodruff R.S.
        Confidence intervals for medians and other position measures.
        JASA. 1952; 57: 622-627