Theoretically, we assume that there are an infinite number of structural loci. ..... When a dendrogram for a group of species is constructed from gene...

0 downloads 0 Views 460KB Size

Center for Demographic and Population Genetics, University of Texas at Houston, Texas 77025 Manuscript received November 1, 1977 Revised copy received February 13, 1978 ABSTRACT

The magnitudes of the systematic biases involved in sample heterozygosity and sample genetic distances are evaluated, and formulae for obtaining unbiased estimates of average heterozygosity and genetic distance are developed. It is also shown that the number of individuals to be used for estimating average heterozygosity can be very small if a large number of loci are studied and the average heterozygosity is low. The number of individuals to be used for estimating genetic distance can also be very small if the genetic distance is large and the average heterozygosity of the two species compared is low.

TUDYING the sampling variance of heterozygosity and genetic distance, NEI and ROYCHOUDHURY (1974) concluded that for estimating average heterozygosity and genetic distance a large number of loci rather than a large number of individuals per locus should be used when the total number of genes to be examined is fixed. Recently, GORMANand RENZI (unpublished) have shown that in lizards even a single individual from each species provides genetic distance estimates that are quite useful for constructing dendrograms, provided that the genetic distances between species are sufficiently large. They also confirmed NEI and ROYCHOUDHURY’S ( 1974) theoretical conclusion that a relatively reliable estimate of average heterozygosity can be obtained from a small number of individuals if a large number of loci are examined. I n this note I shall extend NEI and ROYCHOUDHURY’S (1974) study and present a further theoretical basis for GORMAN and RENZI’Sunpublished observations. I will also present statistical methods for obtaining unbiased estimates of average heterozygosity and genetic distance. The first problem I would like to discuss is the magnitude of systematic bias introduced by a small sample size when the ordinary method of estimating average heterozygosity and genetic distance is used. Let pc be the frequency of the ith allele at a locus in a population and xi be the corresponding allele frequency in a sample from the population. The population heterozygosity at this locus is = 1 - ~ p fwhere , 2 stands €or summation over all alleles. The average Genetics 8 9 : 583-590 July, 1978.

5 84

M. NE1

heterozygosity per locus (H) is defined as the mean of f over all structural loci in the genome. Theoretically, we assume that there are an infinite number of structural loci. We are then interested in estimating H by surveying r loci and n (diploid) individuals per locus. 'Thus, there are two sampling processes involved, i.e., sampling of loci from the genome and sampling of genes (2n genes) from the population at each locus. We assume that each of these samplings is conducted at random. Usually, H is estimated by a sample average heterozygosity, I?,, which is the average of 1 - Ex: over the r loci studied. Under the assumption of multinomial sampling of genes. the expectation of EX,' for a particular locus is given by Ep; f (1 - X p t ) / 2 n (e.g., CROWand KIMURA1970). Therefore, the expectation of A, is

where E, and E , are the expectation operators with respect to the distribution of f among loci and the multinomial distribution of si,respectively. For a single locus, an unbiased estimate of f is given by

h = 2 n ( l - 22:)/(212 - 1) , whereas the corresponding unbiased estimate of H is

H = k=l hk/r ,

(3)

where hk is the value of h for the kth locus. Here n may vary from locus to locus. The estimate (3) generally has a larger expected squared deviation from H than I?, (NEI and ROYCHOUDHURY 1974; MITRA1976), but if a few individuals are studied for a large number of loci, the systematic bias in (1) seems to be much (1974) were aware of this bias, but did more serious. NEI and ROYCHOUDHURY not particularly recommend formula ( 3 ) , since the sample size employed at that time was generally large. The ordinary estimate of genetic distance also has a systematic bias. Let p i and q i be the frequencies of the ith allele in populations X and Y , respectively, and xi and yi be the corresponding sample allele frequencies. NEI'S (1972) genetic (standard) distance is defined as

where Gx, Gy, and GxY are the means of x p t , xq: , and zpiqi over all loci in the genome, respectively. The usual method of estimating D is to replace population gene identities, Gx,Gy,and GXY,by sample gene identities, Jx, Jy, and Jxy,which are the averages of EX:, zy:, and ~ s i y over i the I loci studied, respectively. Namely, it is estimated by 0, = --In [ J x Y / v ' J x J y ] .When r is sufficiently large, the expectation of b, is given by

HETEROZGOSITY A N D GENETIC DISTANCE

585

EgEs(bl) =: - l n [ E , ( J x Y ) / ~ ~ H ) E s ( J Y ) ] (LI and NEI 1975)

where nx and ny are the numbers of individuals sampled from population X and Y , respectively, and (1 - GB)/(2nXGK)and (1 - Gy)/(2nyGy) are assumed to be small compared with unity, which is true in almost all cases. Here E , (D,) is the operator of taking the expectation of D,for r (given) loci with respect to the multinomial samplings of genes, whereas E, refers to taking the expectation of E , (B1)with respect to sampling of T loci from the genome. Since average heterozygosity ( H = 1 - G) is generally 0.2 o r less, the bias introduced by a small sample size in D,is of the same order of magnitude as that for A,. However, Dl tends to give a n overestimate of D,rather than an underestimate. It is noted that when D = 0, Gx = G, = G, and nx = ny = n, E (Dl) is approximately (1 - G) / (2nG). Namely, even if the two populations are genetically identical with each other, the sample genetic distance can be larger than 0 when the sample size is small. NEI (1973) has called this spurious distance. I n many lizard species, the average heterozygosity is of the order of 0.06 ( GORMAN and RENZI,unpublished). Therefore, the expected magnitude of the bias when a single individual is sampled from each of the two species to be compared is about 0.03. This magnitude of bias is not important if D is large, say more than 0.15, but becomes serious when D is very small. On the other hand, if nx and nr are 100, the expected bias is about 0.0003, which is generally negligible. An unbiased estimate of D may be obtained by substituting the unbiased estimates of Gx and Gy for Jx and JE.Namely,

ex

where and GFare the averages of (2nxJx - 1)/(2nx - 1) and (2nyJy- 1 ) / (2ny- 1) over the r loci studied, respectively, and G.YY= Jx,. It is noted that, unlike D,,b can be negative, though its absolute value should not be large. This negative value is caused by sampling error and will occur only very rarely if nx and ny are large. A negative value of D creates a problem in constructing a dendrogram. I suggest that all negative values of b should be replaced by 0 in this case. Let us now consider the sampling variance of the unbiased estimate of average heterozygosity. It should be noted that this variance consists of two components, i.e., interlocus variance and intralocus variance (NEI and ROY-

586

M. NE1

CHOUDHURY 1974). The former arises because of the fact that population heterozygosity varies greatly from locus to locus. This is caused by the evolutionary forces such as mutation, selection, and random genetic drift. The intralocus variance is generated primarily by the process of sampling a finite number of genes from the population. The underlying statistical model for the decomposition of the total sampling variance is as follows: For the kth locus, the observed heterozygosity (the unbiased estimate: 121, = 2n (1 - 2 2 )/ ( 2 n - 1) ) may be written as

where & is the population heterozygosity (1 - Zp:) , and s k is the sampling error with mean = 0 and variance V,( h k ) . Therefore, the variance, V ( h ), of h k over all loci (the entire genome) is

where V ( h ) is the variance of & and V,(h)is the expectation of V,(hk) over 5 all loci. Here we have assumed that there are linkage equilibria among different loci and genes are sampled independently at each locus. Note that the variance components in (8) are slightly different from those of NEI and ROYCHOUDHURY (1974), since they considered the sample heterozygosity, 1 - Zz;. If we note that the unbiased estimate ( A ) of H is a simple average of heterozygosities for all individual loci, its variance is given by

V ( r i )=V(h)/r .

(9)

To evaluate the effect of the number of individuals on the accuracy of the estimate of average heterozygosity, we have to know the relative magnitudes of Vs( h ) and V,(h)in (8). To get a rough idea, I consider an equilibrium population in which the effects of mutation and random genetic drift are balanced with the same mutation rate for all loci, assuming no selection. If we use the infiniteallele model, the interlocus variance is given by V (h)= 2 M / ( M 1) ( M 2 ) 5 ( M 4-3), where M = NU, in which N and U are the effective population size 1974; and the mutation rate per locus per generation, respectively (WATTERSON STEWART 1976; LI and NEI 1975). I n practice, the value of V (h) seems to be 5 slightly larger than that given by the above formula presumably because of interlocus variation in mutation rate and some other effects (NEI et al. 1976), but for our purpose it does not matter. (The stepwise mutation model gives a smaller interlocus variance than the infinite-allele model.) The intralocus variance of the unbiased estimate of heterozygosity for a locus may be obtained by modifying NEI and ROYCHOUDHURY'S (1974) formula for the variance of 1 - Ex2. It becomes

+

+

587

HETEROZGOSITY A N D GENETIC DISTANCE

The expectation of this variance over all loci can be obtained by evaluating the expectations of Xp:, ([email protected]) 2, and Xp: with respect to the allele frequency distributions. These expectations have been evaluated by LI and NEI (1975). Using their results, we have

+

2 M ( M f 4 ) 8(n- l ) M V,(h) = 2n(2n-I)(M+l)(M-k2)(M+3) *

(11)

The values of V ( h ) and V ,( h ) for various values of M and n are given in s Table 1. It is clear that for n = 1, V ,( h ) is larger than V ( h )but V ,( h ) rapidly 3 decreases as n increases. With n = 10, V ,( h ) is nearly one-tenth of V ( h ). This s clearly indicates that in order to reduce the sampling error of average heterozygosity we must examine a large number of loci rather than a large number of individuals per locus. Of course, if one wants to study not only the average heterozygosity but also the allele frequency distribution for each locus, he must examine a large number of individuals. However, some warning against using an extremely small number of individuals should be mentioned. The above argument assumes that a large number of loci are available for study. In practice, technical difficulties often limit the number of loci studied. In fact, less than 30 loci were studied in most recent protein surveys. This number is small; ideally, more than 50 loci should be used to obtain a reliable estimate of average heterozygosity for the total genome. If this cannot be done technically, a large number of individuals studied per locus still helps to reduce the standard error of average heterozygosity. In Table 1, for example, if H is 0.167, the expected intralocus variance (V ,( h )) is 0.09943 for n = 1. Thus, if one individual is examined for 25 loci, the expected standard TABLE 1 Effectsof sample size (n = number of indiuiduals) on the intralocus variance [V,(h)] of heierozygosity

V , (h)

M

H

V s( h )

n=l

n=2

n=lO

n=U)

n=50

0.02 0.06 0.1 0.2 0.4

0.020 0.057 0.091 0.167 0.286

0.00630 0.01694 0.02539 0.039% 0.05002

0.01292 0.03646 0.05725 0.09943 0.15406

0.00430 0.011206

0.00068 0.00189 0.W95 0.043601 0.00744

0.0033 0.00092 0.00143

0.00.013 0.00036 0.00056 0.00095 0.00142

M = 4Nu. H heterozygosity.

M/(1

0.01885 0.03235 0.04902

0.010243

0.00361

+ M ) = the expected heterozygosity. V 3 ( h )= interlocus variance d

588

M. NE1

error of average heterozygosity estimate becomes (0.009943/25)% = 0.06, neglecting the effect of interlocus variation. This is more than a third of H. On the other hand, if 50 individuals are studied for 25 loci, it becomes 0.0062, which is 1/27 of H. A similar study can be made about the effect of the number of individuals on the estimate of genetic distance. For this purpose, however, it is simpler to work with the minimum distance rather than the standard distance (see NEI and ROYCHOUDHURY 1974). The minimum distance for the kth locus is defined as 5i, = (Zp: 4-Zq:)/2 - Zpiqi, and the distance for all loci (Dm) is the arithmetic mean of this quantity. An unbiased estimate of single locus genetic distance is given by

2nxxz:- 1 dk

2(2nz- 1)

2nyZyq - 1

+ 2(2ny- 1 )

- ZXiYi

7

(12)

whereas the unbiased estimate o'f D,, is given by

As with h k , dk may be written as dk = & 3. sk, where sk is the sampling error with mean = 0 and variance V ,( d k ) .Again modifying NEIand ROYCHOUDHURY'S (1974) formula, the intralocus variance, V ,( d k ) , becomes

The variance osf dk over all loci is

V ( d )=

7( d ) + V , ( d ) .

where V s( d ) and V,( d ) are the variance of & and the mean of V ,( d k ) over loci, respectively. Evaluation of the exact value of V ( d ) is complicated, but it can s be shown that it increases with increase of the mean distance, c k D, (LI and NEI1975). If the mutation-drift balance is maintained in each of the two populations throughout the evolutionary process with 4 N v = 0.1, then V ( d ) is 0.00410 J for D,= 0.018 and 0.11156 for D, = 0.168. On the other hand, V , ( d ) is of the same order of magnitude as V,( h ) when D, is small but decreases slowly as D,

HETEROZGOSITY A N D GENETIC DISTANCE

589

increases. (The property of V,(d) is virtually the same as that of the intralocus variance of sample minimum distance, which was studied by NEI and ROYCHOUDHURY 1974). Therefore, it is clear that if D, is as large as 0.168 and a large number of loci are examined, the number of individuals per locus can be very small. On the other hand, if D, is as small as 0.018, a considerable number of individuals must be examined. Needless to say, the variance of S, is given by V ( 4 b . The sampling variance of the unbiased estimate of standard genetic distance (B)and its components can be obtained again by modifying NEI and ROYCHOUDHURY’S (1974) formulae. That is, if we replace I,, I=, and J X p in their Gy,and G,,, respectively, they are immediately formulae (22) and ( 2 3 ) by Gx, obtained. However, I shall not present the results here, since they are too complicated. (They are incorporated inte our new computer program.) On the other hand, the relative values of the components corresponding to V ( d ) and V ,( d ) 5 in (15) can be evaluated by LI and NEI’S (1975) method. The results obtained are virtually the same as those for Dl,and thus support GORMAN and RENZI’S (unpublished) empirical finding. It should be noted, however, that the number of individuals to be examined depends also on the level of heterozygosity (Table 1) . More individuals should be examined when heterozygosity is high than when it is low. When a dendrogram for a group of species is constructed from genetic distance estimates, the reliability of the topology of the dendrogram depends on the differences in genetic distance among different pairs of species. If these differences are small, the genetic distances must be estimated accurately. Namely, a considerable number of individuals should be examined f o r each locus. O n the other hand, if the differences are large, even a single individual may be sufficient for obtaining the correct topology of a dendrogram. In fact, this is exactly what GORMAN and RENZI (unpublished) observed with the Anolis roquet and A. bimaculatus group species. Another factor that affects the dendrogram is the level of heterozygosity. As discussed above, the standard error of genetic distance is large when average heterozygosity is high. Thus, in organisms with average heterozygosity higher than 0.1 a relatively large number of individuals should be examined to construct a reliable dendrogram. Our formulae for obtaining unbiased estimates of average heterozygosity and genetic distance apply to any sample size and are superior to sample average heterozygosity and genetic distance, as long as many loci are used. However, the difference between the biased and unbiased estimators is very small when the number of individuals used is large, say more than 50. A computer program for obtaining the unbiased estimates of average heterozygosity and (standard) genetic distance and their standard errors is available by writing to the author. I would like t o thank GEORGE C. GORMANfor showing me his unpublished manuscript. This work was supported by grants from the National Science Foundation and the Public Health Service.

590

M. NE1 LITERATURE CITED

CROW,J. F. and M. KIMURA,1970 An Introduction t o Population Genetics Theory. Harper and Row, New York. LI, W. H. and M. NEI, 1975 Drift variances of heterozygosity and genetic distance in transient states. Genet. Res. 25: 229-248. MITRA,S., 1976 More on Nei and Roychoudhury’s sampling variances of heterozygosity and genetic distance. Genetics 82: 543-545. NEI, M., 1972 Genetic distance between populations. American Naturalist 106: 283-292. NEI, M., 1973 The theory and estimation of genetic distance. pp. 45-54. In: Genetic Structure of Populations. Edited by N. E. MORTON, University Hawaii Press, Honolulu. NEI, M. and A. K. ROYCHOUDHURY, 1974 Sampling variances of heterozygosity and genetic distance. Genetics 76: 379-390. NEI, M., P. A. FUERST and R. CHAKRABORTY, 1976 Testing the neutral mutation hypothesis by distribution of single locus heterozygosity. Nature 262 : 491-493. STEWART, F. M., 1976 Variability in the amount of heterozygosity maintained by neutral mutations. Theoret. Popul. Biol. 9 : 188-201. WATTERSON, G. A., 1974 Models for the logarithmic species abundance distributions. Theor. Pop. Biol. 6: 217-250. Corresponding editor: B. S. WEIR

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close