This paper studies the estimation of the density-weighted average derivative of a general ... representation of the average derivative. that employs kernel estimates of the densitv of x. Extensions of classical ...... aIV = [Eax x']]. x y]. Equation
alternative methods such as Bayesian inference. JEL Classification: C14, C21, C52. Keywords: Average Treatment Effects, Unconfoundedness, Semiparametric Methods, Match- ing, Propensity Score, Exogeneity. *Department of Economics, and Department of Ag
Aug 31, 2009 - The classical TAVC estimate which is based on batched means does not allow ... TAVC, thus having memory complexity of order O(1) and the com- .... 5 provides applications to Markov chains and linear processes. ... Write Î¾ = Î¾2. ... 2
Feb 9, 2015 - ... (co)variances were estimable. With both pen and environmental competition effects ... 9,720 males and 1,515 females had records. Seasons.
Arthur Lewbelâ. Boston College. October, 2003; revised June 2006. Abstract. This paper considers identification and estimation of the effect of a mismeasured binary regressor in a nonparametric or semiparametric regression, ... Powell (2003) and Fl
Source: The Review of Economics and Statistics, Vol. 86, No. 1 (Feb., 2004), pp. 4-29 .... Imbens, 2003), or, in its extreme form, a bounds analysis. (Manski, 1990 ...
Stata implementation of their method for ATE estimation: the W]a9G= ... rameters, such as the number of matches to use per unit in matching procedures, may ... signment to treatment conditional on characteristics or the propensity score.
propensity score estimates as weights (e.g., Hahn 1998; Hirano, Imbens, and ... bias using the propensity score that effectively reduces the high dimensional.
efficiency of our framework we estimated the WTP for proper management of solid waste in. Bally Municipality applying Seemingly Unrelated Bivariate Probit ...
facilitate the genotyping of a multitude of single nucleotide polymorphisms ... Affymetrix, Genechip Human Mapping 10/50/100/250/500K array platforms and the.
indicates absence in an individual. sk estimates and xk estimates are indicated, other estimated parameters are L = C $1,~~ = 16.22 and H = 0.4180; LM = 7 and Ap = 60, from which Lp = 14.66 and H,,,,. = 0.5 115. where hi is the estimate of heterozygo
Average Consensus using Ratio Consensus. Pair of Simultaneous Iterations. Run two iterations [Benezit et al, 2010; D-G & H, 2012] x[k + 1] = Pcx[k] x = [x1.
Jun 20, 2017 - propose a deep learning algorithm for EAP estimation, which is ... CV] 20 Jun 2017 .... complicated steps and the gradient of the error is difficult to compute .... result resembles the gold standard, and LEAPE better resolves ...
Sep 11, 2014 - Average genome size estimation enables accurate quantification of gene family abundance and sheds light on the functional ecology of the ...
performance measures recorded during the ERP word classification task and the ERP responses themselves discriminated between chil- dren with above-average, average, and below-average reading skills. ERP amplitudes and peak latencies decreased as read
Sep 23, 2004 - staff used vehicle survival rates from California's motor vehicle emissions model,. EMFAC2002, Version 2.2 (Apr03). The rates, derived from the Department of. Motor Vehicles registration database, represent the fraction of vehicles of
Table 6.2: Regression-Adjusted Impact Results, by Study . ... Chapter 2 discusses the Neyman causal inference model, and. Chapters 3 and 4 discuss the ...
Mar 29, 2017 - class of piecewise smooth processes, we propose estimators of the average number of continuous crossings of an hypersurface .... model for the velocity r but we directly estimate the scalar product |(r(x), n(x))| appearing in (1) from
Monte Carlo methods and real data on estimation of production functions. The specific level of .... efficient market hypothesis. He finds that not only does .... .0061 .0495 .1083 .0190 .0146 .0511 .0400 -.1010 .0088 -.0750 -.0717 -.1501 Î²1-Î²t.
Aug 30, 2012 - âIdentification of. Causal Effects Using Instrumental Variables.â Journal of the American Statistical As- sociation 91:444â55. Angrist, Joshua D. and JÃ¶rn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An. Empiricist's Com
Hanley Chiang. Mathematica Policy Research, Inc. Abstract. In randomized control trials (RCTs) in the education field, the complier average causal effect (CACE) ... The authors for this report, Dr. Peter Schochet and Dr. Hanley Chiang are employees o
Jun 29, 2011 - the covariates, pi = Pr(Wi | Xi ), are strictly between zero and one,. 0 < pi < 1. The assignment ... outcomes of individuals from the treatment and comparison groups would NOT differ in the absence of .... more than grade school but l
Two-stage least squares (TSLS) is widely used in econometrics to estimate parameters in systems of linear simultaneous equations and to solve problems.
Researchers have developed estimators based on regression methods (Hahn, 1998; Heckman et al., 1998), matching (Rosenbaum, 1989; Abadie & Imbens,. 2006) and methods based on the propensity score (Rosenbaum & Rubin, 1983; Hirano et al.,. 2003). Relate
ESTIMATION OF AVERAGE HETEROZYGOSITY AND GENETIC DISTANCE FROM A SMALL NUMBER OF INDIVIDUALS MASATOSHI NE1
Center for Demographic and Population Genetics, University of Texas at Houston, Texas 77025 Manuscript received November 1, 1977 Revised copy received February 13, 1978 ABSTRACT
The magnitudes of the systematic biases involved in sample heterozygosity and sample genetic distances are evaluated, and formulae for obtaining unbiased estimates of average heterozygosity and genetic distance are developed. It is also shown that the number of individuals to be used for estimating average heterozygosity can be very small if a large number of loci are studied and the average heterozygosity is low. The number of individuals to be used for estimating genetic distance can also be very small if the genetic distance is large and the average heterozygosity of the two species compared is low.
TUDYING the sampling variance of heterozygosity and genetic distance, NEI and ROYCHOUDHURY (1974) concluded that for estimating average heterozygosity and genetic distance a large number of loci rather than a large number of individuals per locus should be used when the total number of genes to be examined is fixed. Recently, GORMANand RENZI (unpublished) have shown that in lizards even a single individual from each species provides genetic distance estimates that are quite useful for constructing dendrograms, provided that the genetic distances between species are sufficiently large. They also confirmed NEI and ROYCHOUDHURY’S ( 1974) theoretical conclusion that a relatively reliable estimate of average heterozygosity can be obtained from a small number of individuals if a large number of loci are examined. I n this note I shall extend NEI and ROYCHOUDHURY’S (1974) study and present a further theoretical basis for GORMAN and RENZI’Sunpublished observations. I will also present statistical methods for obtaining unbiased estimates of average heterozygosity and genetic distance. The first problem I would like to discuss is the magnitude of systematic bias introduced by a small sample size when the ordinary method of estimating average heterozygosity and genetic distance is used. Let pc be the frequency of the ith allele at a locus in a population and xi be the corresponding allele frequency in a sample from the population. The population heterozygosity at this locus is = 1 - ~ p fwhere , 2 stands €or summation over all alleles. The average Genetics 8 9 : 583-590 July, 1978.
heterozygosity per locus (H) is defined as the mean of f over all structural loci in the genome. Theoretically, we assume that there are an infinite number of structural loci. We are then interested in estimating H by surveying r loci and n (diploid) individuals per locus. 'Thus, there are two sampling processes involved, i.e., sampling of loci from the genome and sampling of genes (2n genes) from the population at each locus. We assume that each of these samplings is conducted at random. Usually, H is estimated by a sample average heterozygosity, I?,, which is the average of 1 - Ex: over the r loci studied. Under the assumption of multinomial sampling of genes. the expectation of EX,' for a particular locus is given by Ep; f (1 - X p t ) / 2 n (e.g., CROWand KIMURA1970). Therefore, the expectation of A, is
where E, and E , are the expectation operators with respect to the distribution of f among loci and the multinomial distribution of si,respectively. For a single locus, an unbiased estimate of f is given by
h = 2 n ( l - 22:)/(212 - 1) , whereas the corresponding unbiased estimate of H is
H = k=l hk/r ,
where hk is the value of h for the kth locus. Here n may vary from locus to locus. The estimate (3) generally has a larger expected squared deviation from H than I?, (NEI and ROYCHOUDHURY 1974; MITRA1976), but if a few individuals are studied for a large number of loci, the systematic bias in (1) seems to be much (1974) were aware of this bias, but did more serious. NEI and ROYCHOUDHURY not particularly recommend formula ( 3 ) , since the sample size employed at that time was generally large. The ordinary estimate of genetic distance also has a systematic bias. Let p i and q i be the frequencies of the ith allele in populations X and Y , respectively, and xi and yi be the corresponding sample allele frequencies. NEI'S (1972) genetic (standard) distance is defined as
where Gx, Gy, and GxY are the means of x p t , xq: , and zpiqi over all loci in the genome, respectively. The usual method of estimating D is to replace population gene identities, Gx,Gy,and GXY,by sample gene identities, Jx, Jy, and Jxy,which are the averages of EX:, zy:, and ~ s i y over i the I loci studied, respectively. Namely, it is estimated by 0, = --In [ J x Y / v ' J x J y ] .When r is sufficiently large, the expectation of b, is given by
HETEROZGOSITY A N D GENETIC DISTANCE
EgEs(bl) =: - l n [ E , ( J x Y ) / ~ ~ H ) E s ( J Y ) ] (LI and NEI 1975)
where nx and ny are the numbers of individuals sampled from population X and Y , respectively, and (1 - GB)/(2nXGK)and (1 - Gy)/(2nyGy) are assumed to be small compared with unity, which is true in almost all cases. Here E , (D,) is the operator of taking the expectation of D,for r (given) loci with respect to the multinomial samplings of genes, whereas E, refers to taking the expectation of E , (B1)with respect to sampling of T loci from the genome. Since average heterozygosity ( H = 1 - G) is generally 0.2 o r less, the bias introduced by a small sample size in D,is of the same order of magnitude as that for A,. However, Dl tends to give a n overestimate of D,rather than an underestimate. It is noted that when D = 0, Gx = G, = G, and nx = ny = n, E (Dl) is approximately (1 - G) / (2nG). Namely, even if the two populations are genetically identical with each other, the sample genetic distance can be larger than 0 when the sample size is small. NEI (1973) has called this spurious distance. I n many lizard species, the average heterozygosity is of the order of 0.06 ( GORMAN and RENZI,unpublished). Therefore, the expected magnitude of the bias when a single individual is sampled from each of the two species to be compared is about 0.03. This magnitude of bias is not important if D is large, say more than 0.15, but becomes serious when D is very small. On the other hand, if nx and nr are 100, the expected bias is about 0.0003, which is generally negligible. An unbiased estimate of D may be obtained by substituting the unbiased estimates of Gx and Gy for Jx and JE.Namely,
where and GFare the averages of (2nxJx - 1)/(2nx - 1) and (2nyJy- 1 ) / (2ny- 1) over the r loci studied, respectively, and G.YY= Jx,. It is noted that, unlike D,,b can be negative, though its absolute value should not be large. This negative value is caused by sampling error and will occur only very rarely if nx and ny are large. A negative value of D creates a problem in constructing a dendrogram. I suggest that all negative values of b should be replaced by 0 in this case. Let us now consider the sampling variance of the unbiased estimate of average heterozygosity. It should be noted that this variance consists of two components, i.e., interlocus variance and intralocus variance (NEI and ROY-
CHOUDHURY 1974). The former arises because of the fact that population heterozygosity varies greatly from locus to locus. This is caused by the evolutionary forces such as mutation, selection, and random genetic drift. The intralocus variance is generated primarily by the process of sampling a finite number of genes from the population. The underlying statistical model for the decomposition of the total sampling variance is as follows: For the kth locus, the observed heterozygosity (the unbiased estimate: 121, = 2n (1 - 2 2 )/ ( 2 n - 1) ) may be written as
where & is the population heterozygosity (1 - Zp:) , and s k is the sampling error with mean = 0 and variance V,( h k ) . Therefore, the variance, V ( h ), of h k over all loci (the entire genome) is
where V ( h ) is the variance of & and V,(h)is the expectation of V,(hk) over 5 all loci. Here we have assumed that there are linkage equilibria among different loci and genes are sampled independently at each locus. Note that the variance components in (8) are slightly different from those of NEI and ROYCHOUDHURY (1974), since they considered the sample heterozygosity, 1 - Zz;. If we note that the unbiased estimate ( A ) of H is a simple average of heterozygosities for all individual loci, its variance is given by
V ( r i )=V(h)/r .
To evaluate the effect of the number of individuals on the accuracy of the estimate of average heterozygosity, we have to know the relative magnitudes of Vs( h ) and V,(h)in (8). To get a rough idea, I consider an equilibrium population in which the effects of mutation and random genetic drift are balanced with the same mutation rate for all loci, assuming no selection. If we use the infiniteallele model, the interlocus variance is given by V (h)= 2 M / ( M 1) ( M 2 ) 5 ( M 4-3), where M = NU, in which N and U are the effective population size 1974; and the mutation rate per locus per generation, respectively (WATTERSON STEWART 1976; LI and NEI 1975). I n practice, the value of V (h) seems to be 5 slightly larger than that given by the above formula presumably because of interlocus variation in mutation rate and some other effects (NEI et al. 1976), but for our purpose it does not matter. (The stepwise mutation model gives a smaller interlocus variance than the infinite-allele model.) The intralocus variance of the unbiased estimate of heterozygosity for a locus may be obtained by modifying NEI and ROYCHOUDHURY'S (1974) formula for the variance of 1 - Ex2. It becomes
HETEROZGOSITY A N D GENETIC DISTANCE
The expectation of this variance over all loci can be obtained by evaluating the expectations of Xp:, ([email protected]) 2, and Xp: with respect to the allele frequency distributions. These expectations have been evaluated by LI and NEI (1975). Using their results, we have
2 M ( M f 4 ) 8(n- l ) M V,(h) = 2n(2n-I)(M+l)(M-k2)(M+3) *
The values of V ( h ) and V ,( h ) for various values of M and n are given in s Table 1. It is clear that for n = 1, V ,( h ) is larger than V ( h )but V ,( h ) rapidly 3 decreases as n increases. With n = 10, V ,( h ) is nearly one-tenth of V ( h ). This s clearly indicates that in order to reduce the sampling error of average heterozygosity we must examine a large number of loci rather than a large number of individuals per locus. Of course, if one wants to study not only the average heterozygosity but also the allele frequency distribution for each locus, he must examine a large number of individuals. However, some warning against using an extremely small number of individuals should be mentioned. The above argument assumes that a large number of loci are available for study. In practice, technical difficulties often limit the number of loci studied. In fact, less than 30 loci were studied in most recent protein surveys. This number is small; ideally, more than 50 loci should be used to obtain a reliable estimate of average heterozygosity for the total genome. If this cannot be done technically, a large number of individuals studied per locus still helps to reduce the standard error of average heterozygosity. In Table 1, for example, if H is 0.167, the expected intralocus variance (V ,( h )) is 0.09943 for n = 1. Thus, if one individual is examined for 25 loci, the expected standard TABLE 1 Effectsof sample size (n = number of indiuiduals) on the intralocus variance [V,(h)] of heierozygosity
V , (h)
V s( h )
0.02 0.06 0.1 0.2 0.4
0.020 0.057 0.091 0.167 0.286
0.00630 0.01694 0.02539 0.039% 0.05002
0.01292 0.03646 0.05725 0.09943 0.15406
0.00068 0.00189 0.W95 0.043601 0.00744
0.0033 0.00092 0.00143
0.00.013 0.00036 0.00056 0.00095 0.00142
M = 4Nu. H heterozygosity.
0.01885 0.03235 0.04902
+ M ) = the expected heterozygosity. V 3 ( h )= interlocus variance d
error of average heterozygosity estimate becomes (0.009943/25)% = 0.06, neglecting the effect of interlocus variation. This is more than a third of H. On the other hand, if 50 individuals are studied for 25 loci, it becomes 0.0062, which is 1/27 of H. A similar study can be made about the effect of the number of individuals on the estimate of genetic distance. For this purpose, however, it is simpler to work with the minimum distance rather than the standard distance (see NEI and ROYCHOUDHURY 1974). The minimum distance for the kth locus is defined as 5i, = (Zp: 4-Zq:)/2 - Zpiqi, and the distance for all loci (Dm) is the arithmetic mean of this quantity. An unbiased estimate of single locus genetic distance is given by
2nxxz:- 1 dk
2nyZyq - 1
+ 2(2ny- 1 )
whereas the unbiased estimate o'f D,, is given by
As with h k , dk may be written as dk = & 3. sk, where sk is the sampling error with mean = 0 and variance V ,( d k ) .Again modifying NEIand ROYCHOUDHURY'S (1974) formula, the intralocus variance, V ,( d k ) , becomes
The variance osf dk over all loci is
V ( d )=
7( d ) + V , ( d ) .
where V s( d ) and V,( d ) are the variance of & and the mean of V ,( d k ) over loci, respectively. Evaluation of the exact value of V ( d ) is complicated, but it can s be shown that it increases with increase of the mean distance, c k D, (LI and NEI1975). If the mutation-drift balance is maintained in each of the two populations throughout the evolutionary process with 4 N v = 0.1, then V ( d ) is 0.00410 J for D,= 0.018 and 0.11156 for D, = 0.168. On the other hand, V , ( d ) is of the same order of magnitude as V,( h ) when D, is small but decreases slowly as D,
HETEROZGOSITY A N D GENETIC DISTANCE
increases. (The property of V,(d) is virtually the same as that of the intralocus variance of sample minimum distance, which was studied by NEI and ROYCHOUDHURY 1974). Therefore, it is clear that if D, is as large as 0.168 and a large number of loci are examined, the number of individuals per locus can be very small. On the other hand, if D, is as small as 0.018, a considerable number of individuals must be examined. Needless to say, the variance of S, is given by V ( 4 b . The sampling variance of the unbiased estimate of standard genetic distance (B)and its components can be obtained again by modifying NEI and ROYCHOUDHURY’S (1974) formulae. That is, if we replace I,, I=, and J X p in their Gy,and G,,, respectively, they are immediately formulae (22) and ( 2 3 ) by Gx, obtained. However, I shall not present the results here, since they are too complicated. (They are incorporated inte our new computer program.) On the other hand, the relative values of the components corresponding to V ( d ) and V ,( d ) 5 in (15) can be evaluated by LI and NEI’S (1975) method. The results obtained are virtually the same as those for Dl,and thus support GORMAN and RENZI’S (unpublished) empirical finding. It should be noted, however, that the number of individuals to be examined depends also on the level of heterozygosity (Table 1) . More individuals should be examined when heterozygosity is high than when it is low. When a dendrogram for a group of species is constructed from genetic distance estimates, the reliability of the topology of the dendrogram depends on the differences in genetic distance among different pairs of species. If these differences are small, the genetic distances must be estimated accurately. Namely, a considerable number of individuals should be examined f o r each locus. O n the other hand, if the differences are large, even a single individual may be sufficient for obtaining the correct topology of a dendrogram. In fact, this is exactly what GORMAN and RENZI (unpublished) observed with the Anolis roquet and A. bimaculatus group species. Another factor that affects the dendrogram is the level of heterozygosity. As discussed above, the standard error of genetic distance is large when average heterozygosity is high. Thus, in organisms with average heterozygosity higher than 0.1 a relatively large number of individuals should be examined to construct a reliable dendrogram. Our formulae for obtaining unbiased estimates of average heterozygosity and genetic distance apply to any sample size and are superior to sample average heterozygosity and genetic distance, as long as many loci are used. However, the difference between the biased and unbiased estimators is very small when the number of individuals used is large, say more than 50. A computer program for obtaining the unbiased estimates of average heterozygosity and (standard) genetic distance and their standard errors is available by writing to the author. I would like t o thank GEORGE C. GORMANfor showing me his unpublished manuscript. This work was supported by grants from the National Science Foundation and the Public Health Service.
M. NE1 LITERATURE CITED
CROW,J. F. and M. KIMURA,1970 An Introduction t o Population Genetics Theory. Harper and Row, New York. LI, W. H. and M. NEI, 1975 Drift variances of heterozygosity and genetic distance in transient states. Genet. Res. 25: 229-248. MITRA,S., 1976 More on Nei and Roychoudhury’s sampling variances of heterozygosity and genetic distance. Genetics 82: 543-545. NEI, M., 1972 Genetic distance between populations. American Naturalist 106: 283-292. NEI, M., 1973 The theory and estimation of genetic distance. pp. 45-54. In: Genetic Structure of Populations. Edited by N. E. MORTON, University Hawaii Press, Honolulu. NEI, M. and A. K. ROYCHOUDHURY, 1974 Sampling variances of heterozygosity and genetic distance. Genetics 76: 379-390. NEI, M., P. A. FUERST and R. CHAKRABORTY, 1976 Testing the neutral mutation hypothesis by distribution of single locus heterozygosity. Nature 262 : 491-493. STEWART, F. M., 1976 Variability in the amount of heterozygosity maintained by neutral mutations. Theoret. Popul. Biol. 9 : 188-201. WATTERSON, G. A., 1974 Models for the logarithmic species abundance distributions. Theor. Pop. Biol. 6: 217-250. Corresponding editor: B. S. WEIR