Reparametrize the model by defining \(\pi \equiv P(Y =0) = p+(1-p)e^{-\lambda}\). Solve for \(p\) in terms of \(\pi\) and \(\lambda\), then substitute so that the density only depends on them.
Let \(n_0\) represent the number of samples in an iid sample of size \(n\). Assuming the complete data is available, show that the likelihood factors into two pieces and that \(\hat{\pi} = n_0/n\). Also show that the MLE for \(\lambda\) is the solution to a simple nonlinear equation involving \(\bar{Y}_+\).
Thus, \(\hat{\lambda}_{MLE}\) will be a solution to the above simple non-linear equation, which involves \(\bar{y}_+\).
c)
Now consider the truncated or conditional sample consisting of the \(n- n_0\) nonzero values. Write down the conditional likelihood for these values and obtain the same equation for \(\hat{\lambda}_{MLE}\) as in a)
Thus, we need to maximize the same function we found in part b), which will be the solution to a non-simple linear equation involving \(\bar{y}\).
2.4
In sampling land areas for counts of an animal species, we obtain an iid sample of counts \(Y_1, \dots, ,Y_n\), where each \(Y_i\) has a Poisson distribution with parameter \(\lambda\)
a) Derive the maximum likelihood estimator of \(\lambda\). Call it \(\hat{\lambda}_{MLE_1}\)
For simplicity in quadrat sampling, sometimes only the presence or absence of a species is recorded. Let \(n_0\) be the number of \(Y_i\)’s that are zero. Write down the binomial likelihood based only on \(n_0\). Show that the maximum likelihood estimator of \(\lambda\) based only on \(n_0\) is \(\hat{\lambda}_{MLE_2}= -\log(n_0/n)\).
Use the delta theorem from Ch. 1 to show that the asymptotic relative efficiency of \(\hat{\lambda}_{MLE_1}\) to \(\hat{\lambda}_{MLE_2}\) is \(\frac{e^{\lambda}-1}{\lambda}\).
The overall goal of the sampling is to estimate the mean number of the species per unit land area. Comment on the use of \(\hat{\lambda}_{MLE_2}\) in place of \(\hat{\lambda}_{MLE_1}\). That is, explain to a researcher for what values and under what distributional assumptions is it reasonable?
Solution:
Since \(e^{\lambda}-1 > \lambda\), \(\hat{\lambda}_{MLE_2}\) will always be more efficient. However, the distributional assumptions behind when to use which estimator are inherently different. \(\hat{\lambda}_{MLE_2}\) presumes that we have presence-absence data, or zero inflated poisson data at the worst, and is looking to predict the number of land units with the absence of a species. However, if we know the data absolutely comes from a poisson distribution, \(\hat{\lambda}_{MLE_1}\) will fit the distributional assumptions better.
2.9
The sample \(Y_1,\dots, Y_n\) is iid with distribution \(F_Y(y;p_o,p_1,\alpha, \beta)= p_0I(0 \leq y) + (1-p_0-p_1)F(y;\alpha,\beta) + p_1I(y\geq 1)\), where \(F(y;\alpha,\beta)\) is the beta distribution. Use the \(2h\) method to show that the likelihood is \(p_0^{n_0}p_1^{n_1}(1-p_0-p_1)^{n-n_0-n_1} \prod_{0<Y_i<1} f(Y_i;\alpha,\beta)\).
For an iid sample \(Y_1,\dots, Y_n\), Type II censoring occurs when we observe only the smallest \(r\) values. For example, in a study of light bulb lifetimes, we might stop the study after the first \(r=10\) bulbs have failed. Assuming a continuous distribution with density \(f(y;\mathbf{\theta})\), the likelihood is just the joint density of the smallest \(r\) order statistics evaluated at those order statistics: \[\mathcal L\left(\mathbf \theta; Y_{(1)}, \dots, Y_{(r)}\right) = \frac {n!}{(n-r)!}\left[\prod_{i=1}^r f\left(Y_{(i)}; \mathbf \theta \right) \right]\left[1-F\left(Y_{(r)}; \mathbf \theta \right)\right]^{n-r}\] For this situation, let \(f(y;\sigma) = e^{-y/\sigma}/\sigma\) and find the MLE of \(\sigma\).
The standard Box-Cox regression model (Box and Cox 1964) assumes that after transformation of the observed \(Y_i\) to \(Y_i^{(\lambda)}\). we have the linear model \[ Y_i^{(\lambda)} = \mathbf{x_i^T \beta} + e_i,~~~~~ i =1,\dots,n \] where \(Y_i\) is assumed positive and the \(x_i\)\(i= 1, \dots, n\). In addition assume that \(e_1,\dots,e_n\) are iid normal\((0, \sigma^2)\) errors. Recall that the Box-Cox transformation is defined in Problem 2.1 (p. 107) and is strictly increasing for all \(\lambda\) Show that the likelihood is \[\mathcal L \left(\beta,\sigma,\lambda| \{Y_i, \mathbf x_i\}_{i=1}^n \right) = \left(\sqrt{2\pi}\sigma\right)^n \exp\left[-\sum_{i=1}^n \frac{ \left(Y_i^{(\lambda)} - \mathbf{x}_i^T\mathbf \beta \right)^2}{2 \sigma^2} \right] \prod_{i=1}^n \left|\frac{\partial t^{(\lambda)}}{\partial t} \bigg\rvert_{t= Y_i} \right|\]
Solution:
We know from the Jacobian method of transformations that \[f_Y(y;\theta) = f_{Y^{(\lambda)}} (y^{(\lambda)};\theta)\left|J\right|=f_{Y^{(\lambda)}} (y;\theta)\left|\frac{\partial t^{(\lambda)}}{\partial t} \bigg\rvert_{t= Y}\right|\].
Thus, since \(Y_i^{(\lambda)} \overset {iid}\sim N(\mathbf{x^T \beta}, \sigma^2)\),
One version of the negative binomial probability mass function is given by \[f(y;\mu,k) = \frac{\Gamma(y+k)}{\Gamma(k)\Gamma(y+1)} \left( \frac{k}{\mu+k}\right)^k \left( 1- \frac{k}{\mu+k}\right)^y I(y \in \mathbb N)\] where \(\mu\) and \(k\) are parameters. Assume that \(k\) is known and put \(f(y; \mu,k)\) in the GLM form (2.14, p. 53), identifying \(b(\theta)\), etc., and derive the mean and variance of \(Y, E(Y) = \mu, Var(Y)= \mu + \mu^2/k\).
\[\begin{aligned}
c(y)&=\log \left({\Gamma(y+k)} \right)- \log \left( \Gamma(k)\right)- \log \left({\Gamma(y+1)} \right) + k \log \left({k}\right) \\
b(\theta) &= k \log \left({\mu+k}\right) \\
&= k \log \left((\mu + k)\frac{k}{k}\right) \\
&= k \log \left(\frac{k}{\frac{k}{(\mu + k)}}\right) \\
&= k \log \left(\frac{k}{\frac{\mu+k - \mu}{(\mu + k)}}\right) \\
&= k \log \left(\frac{k}{1-\frac{\mu}{(\mu + k)}}\right) \\
&= k \log \left(\frac{k}{1-e^{\theta}}\right)
\end{aligned}\]
Because \(k\) is known, I have shown this parametrization of the negative binomial distribution can be manipulated to be in Generalized Linear Model form.
The usual gamma density is given by \[f(y;\alpha,\beta) = \frac{1}{\Gamma (\alpha)\beta^\alpha}y^{\alpha-1}e^{-y/\beta} I(0 \leq y < \infty) ~~~ \alpha,\beta>0\] and has mean \(\alpha \beta\) and variance \(\alpha \beta^2\). First reparameterize by letting \(\mu= \alpha \beta\) so that the parameter vector is now \((\mu, \alpha)\). Now put this gamma family in the form of a generalized linear model, identifying \(\theta, b(\theta), a_i(\phi)\), and \(c(y,\phi)\). Note that \(\alpha\) is unknown here and should be related to \(\phi\) Make sure that \(b(\theta)\) is actually a function of \(\theta\) and take the first derivative to verify that \(b(\theta)\) is correct.
Consider the standard one-way ANOVA situation with \(Y_{ij}\) distributed as \(N(\mu_i,\sigma^2)\), \(i=1,\dots,k\), \(j=1,\dots,n_i\) and all the random variables are independent.
a.
Form the log likelihood, take derivatives, and show that the MLEs are \(\hat \mu_i = \bar{Y_i}\), \(i=1,\dots,k\), \(\hat \sigma^2 = SSE/N\), where \(SSE= \sum_{i=1}^k \sum_{j=1}^{n_i} (Y_{ij}-\bar{Y_i})^2\) and \(N=\sum_{i=1}^k n_i\).
Now define \(\mathbf V_i^T = (Y_{i1} -\bar {Y_i}, \dots, Y_{i,n_i-1}- \bar{Y_i})\). Using standard matrix manipulations with the multivariate normal distribution, the density of \(\mathbf V_i\) is given by \[(2 \pi)^{-(n_i-1)/2}n_i^{1/2}\sigma^{-(n_i-1)} \exp \left(- \frac{1}{2 \sigma ^2} \mathbf v_i^t [I_{n_i-1} + J_{n_i -1}] \mathbf v_i \right) \] where \(I_{n_i-1}\) is the \(n_i-1\) by \(n_i-1\) identity matrix and \(J_{n_i-1}\) is an \(n_i-1\) by \(n_i-1\) matrix of 1’s. Now form the (marginal) likelihood based on \(\mathbf V_1, \dots, \mathbf V_k\) and show that the MLE for \(\sigma^2\) is now \(\sigma^2 = SSE/(N-k)\).
Finally, let us take a more general approach and assume that \(Y\) has an \(N\) dimensional multivariate normal distribution with mean \(X \mathbf{\beta}\) and covariance matrix \(\Sigma = \sigma^2 Q(\boldsymbol{\theta})\), where X is an \(N\times p\) full rank matrix of known constants, \(\boldsymbol \beta\) is a \(p\)-vector of regression parameters, and \(Q(\boldsymbol{\theta})\) is an \(N\times N\) standardized covariance matrix depending on the unknown parameter \(\boldsymbol{\theta}\). Typically \(\boldsymbol{\theta}\) would consist of variance component and/or spatial correlation parameters. We can concentrate the likelihood by noting that if \(Q(\boldsymbol{\theta})\) were known, then the generalized least squares estimator would be \(\boldsymbol{\hat \beta(\theta)} =(X^T Q(\boldsymbol{\theta})^{-1}X)^{-1}X^TQ(\boldsymbol{\theta})^{-1} \mathbf Y\). Substituting for \(\boldsymbol \beta\) yields the profile log likelihood \[-\frac{N}{2} \log(2\pi) - N \log \sigma- \frac{1}{2} \log |Q(\boldsymbol{\theta})| - \frac{GSSE(\boldsymbol{\theta})}{2\sigma^2},\] where \(GSSE(\boldsymbol{\theta}) =(\mathbf Y - X \boldsymbol{\hat{\beta}}(\boldsymbol{\theta}))^T Q(\boldsymbol{\theta})^{-1}(\mathbf Y -X\boldsymbol{\hat \beta}(\boldsymbol{\theta}))\). To connect with part a), let \(Q(\boldsymbol{\theta})\) be the identity matrix (so that \(GSSE(\boldsymbol{\theta})\) is just \(SSE\)) and find the maximum likelihood estimator of \(\sigma^2\).
Continuing part c), the REML approach is to transform to \(\mathbf V = A^T \mathbf Y\), where the \(N \times p\) columns of \(A\) are linearly independent and \(A^T X =0\) so that \(\mathbf V\) is \(MN(0;A^T \Sigma A)\) A special choice of \(A\) leads to the REML log likelihood \[-\frac{N-p}{2}\log(2\pi) -(N-p)\log \sigma - \frac{1}{2} \log |X^T Q(\boldsymbol{\theta} )^{-1}X|-\frac{1}{2} \log |Q(\boldsymbol{\theta})| - \frac{GSSE(\boldsymbol{\theta})}{2 \sigma^2}.\] To connect with part b), let \(Q(\boldsymbol{\theta})\) be the identity matrix and find the maximum likelihood estimator of \(\sigma^2\).
If \(\mathbf Y\) is from an exponential family where \((\mathbf W, \mathbf V)\) are jointly sufficient for \((\boldsymbol \theta_1, \boldsymbol \theta_2)\), then the conditional density of \(\mathbf W | \mathbf V\) is free of the nuisance parameter \(\boldsymbol \theta_2\) and can be used as a conditional likelihood for estimating \(\boldsymbol \theta_1\). In some cases it may be difficult to find the conditional density. However, from (2.19, p. 57) we have \[\frac{f_{\mathbf Y}(\mathbf y; \boldsymbol \theta_1, \boldsymbol \theta_2)}{f_{\mathbf V}(\mathbf y; \boldsymbol \theta_1, \boldsymbol \theta_2)}= f_{\mathbf{W|V}}(\mathbf{w|v}; \boldsymbol \theta_1).\] Thus, if you know the density of \(\mathbf Y\) and of \(\mathbf V\), then you can get a conditional likelihood equation.
a.
Now let \(\mathbf Y_1, \dots, \mathbf Y_n\) be iid \(N(\mu,\sigma^2)\), \(V=\bar Y\), and \(\boldsymbol \theta = (\sigma, \mu)^T\). Form the ratio above and note that it is free of \(\mu\). (It helps to remember that \(\sum (Y_i-\mu)^2= \sum (Y_i- \bar Y)^2 + n(\bar Y - \mu)^2\).)
Consider the normal theory linear measurement error model \[\begin{aligned} Y_i = \alpha + \beta U_i +\sigma_e e_i, & X_i =U_i+ \sigma Z_i, & i = 1,\dots,n \end{aligned}\] where \(e_1, \dots, e_n\), \(Z_1,\dots, Z_n\) are iid \(N(0,1)\) random variables, \(\sigma^2\) is known, and \(\alpha,\beta, \sigma_e\), and \(U_1, \dots, U_n\) are unknown parameters.
a.
Let \(s_X^2\) denote the sample variance of \(X_1,\dots, X_n\). Show that \(E(s_X^2)=s_U^2+\sigma^2\) where \(s_U^2\) is the sample variance of \(\{U_i\}_1^n\).
Solution:
\[
\bar X = \sum_{i=1}^n (U_i+\sigma Z_i)/n = \bar U + \sigma \bar Z, \bar Z \sim N(0,1/n)
\]\[\begin{aligned}
E(s_X^2) &= E \left(\frac{\sum_{i=1}^n (X_i -\bar X)^2}{n-1} \right).
= E \left(\frac{\sum_{i=1}^n X_i^2 -2X_i\bar X + \bar X^2}{n-1} \right)\\
&= E \left(\frac{\sum_{i=1}^n (U_i + \sigma Z_i)^2 -2(U_i + \sigma Z_i)(\bar U + \sigma \bar Z) +(\bar U + \sigma \bar Z)^2}{n-1} \right)\\
&= E \left(\frac{\sum_{i=1}^n U_i^2 + 2\sigma U_i Z_i+ \sigma^2 Z_i^2 -2(U_i \bar U + \sigma U_i \bar Z +\sigma Z_i \bar U + \sigma^2Z_i \bar Z) +\bar U^2 + 2\sigma \bar U \bar Z + \sigma^2\bar Z^2}{n-1} \right)\\
&= E \left(\frac{\sum_{i=1}^n U_i^2-2U_i \bar U+\bar U^2 + \sigma^2 Z_i^2 -2 \sigma^2Z_i \bar Z + \sigma^2\bar Z^2}{n-1} \right)\\
&= E \left(\frac{\sum_{i=1}^n(U_i- \bar U)^2}{n-1} \right) + \sigma^2 E \left(\frac{\sum_{i=1}^n(Z_i- \bar Z)^2}{n-1} \right)\\
&= s_U^2 + \sigma^2 E \left(\frac{\sum_{i=1}^n(Z_i- \bar Z)^2}{n-1} \right)\\
&= s_U^2 + \sigma^2 E \left(\frac{\sum_{i=1}^n(Z_i- \bar Z)^2}{n-1} \right)\\
\end{aligned}
\] Note that, \[ \begin{aligned}
(n-1) \left(\frac{\sum_{i=1}^n(Z_i- \bar Z)^2}{n-1} \right)\sim \chi^2(n-1) \\
\implies E \left[(n-1) \left(\frac{\sum_{i=1}^n(Z_i- \bar Z)^2}{n-1} \right)\right] &= n-1 \\
\implies E \left(\frac{\sum_{i=1}^n(Z_i- \bar Z)^2}{n-1} \right) &= 1
\end{aligned}
\] Therefore, \(E(s_X^2)= s_U^2 + \sigma^2 E \left(\frac{\sum_{i=1}^n(Z_i- \bar Z)^2}{n-1} \right) =s_U^2 + \sigma^2\).
For the remainder of this problem assume that \(s_U^2 \to \sigma_U^2\) as \(n \to \infty\) and that \(s_X^2\) converges in probability to \(\sigma_U^2\). (Note that even though \(\{U_i\}_1^n\) are parameters it still makes sense to talk about their sample variance, and denoting the limit of \(s_U^2\) as \(n \to \infty\) by \(\sigma_U^2\) is simply a matter of convenience).
b.
Show that the estimate of slope from the least squares regression of \(\{Y_i\}_1^n\) on \(\{X_i\}_1^n\) (call it \(\hat{\beta}_{Y|X}\) is not consistent for \(\beta\) as \(n \to \infty\)). This shows that it is not OK to simply ignore the measurement error in the predictor variable.
Solution:
It is a well-known result that in SLR \(\hat{\beta}_{Y|X}= \frac{\widehat{Cov(X,Y)}}{s^2_X}\). Thus, \[ \begin{aligned}
\lim_{n \to \infty}\hat{\beta}_{Y|X} &= \lim_{n \to \infty} \frac{Cov(X,Y)}{s^2_X} \\
&= \frac{1}{\sigma_U^2+ \sigma^2} \lim_{n \to \infty} \frac{\sum_{i=1}^n(X_i-\bar X)(Y_i-\bar Y)}{n-1} \\
&= \frac{1}{\sigma_U^2+ \sigma^2} \lim_{n \to \infty} \frac{\sum_{i=1}^nX_iY_i-X_i\bar Y-\bar XY_i+\bar X\bar Y}{n-1} \\
X_iY_i &= (U_i + \sigma Z_i)(\alpha + \beta U_i +\sigma_e e_i) \\
&= \alpha U_i + \beta U_i^2 +\sigma_e U_ie_i +\alpha \sigma Z_i+ \beta \sigma U_i Z_i +\sigma_e\sigma e_i Z_i \\
X_i\bar Y &= (U_i + \sigma Z_i)(\alpha + \beta \bar U +\sigma_e \bar e) \\
&= \alpha U_i + \beta U_i \bar U +\sigma_e U_i \bar e +\alpha \sigma Z_i+ \beta \sigma \bar U Z_i +\sigma_e\sigma \bar e Z_i \\
\bar XY_i &= (\bar U + \sigma \bar Z)(\alpha + \beta U_i +\sigma_e e_i) \\
&= \alpha \bar U + \beta U_i \bar U +\sigma_e \bar Ue_i +\alpha \sigma \bar Z+ \beta \sigma U_i \bar Z +\sigma_e\sigma e_i \bar Z \\
\bar X \bar Y &= (\bar U + \sigma \bar Z)(\alpha + \beta \bar U +\sigma_e \bar e) \\
&= \alpha \bar U + \beta \bar U^2 +\sigma_e \bar U \bar e +\alpha \sigma \bar Z+ \beta \sigma \bar U \bar Z +\sigma_e\sigma \bar e \bar Z \\
\implies \lim_{n \to \infty}\hat{\beta}_{Y|X} &= \frac{1}{\sigma_U^2+ \sigma^2} \lim_{n \to \infty} \frac{\sum_{i=1}^n \beta (U_i^2-2 U_i\bar U +\bar U^2)+\sigma_e(U_i e_i -U_i \bar e -\bar U e_i + \bar U \bar e)}{n-1} \\
&~~~~~+\frac{\sum_{i=1}^n \beta \sigma (U_iZ_i- \bar U Z_i - U_i \bar Z +\bar U \bar Z)+\sigma_e \sigma(e_i Z_i -\bar e Z_i - e_i\bar Z + \bar e\bar Z )}{n-1} \\
&= \frac{1}{\sigma_U^2+ \sigma^2} \lim_{n \to \infty} \left[ \beta s_U^2+\frac{\sum_{i=1}^n \sigma_e \sigma(e_i Z_i -\bar e Z_i - e_i\bar Z + \bar e\bar Z )}{n-1} \right] ~~~~~~ \left( \overset{\text{Central Limit Theorem,}}{e_i,Z_i, \sqrt n \bar e, \sqrt n \bar Z \sim N(0,1)} \right) \\
&= \frac{1}{\sigma_U^2+ \sigma^2} \lim_{n \to \infty} \left[ \beta s_U^2+\frac{\sigma_e \sigma \left[\sum_{i=1}^n (e_i - \bar e) (Z_i- \bar Z)\right]}{n-1} \right]\\
&= \frac{1}{\sigma_U^2+ \sigma^2} \lim_{n \to \infty} \left[ \beta s_U^2+\sigma_e \sigma Cov(e_i,Z_i) \right] \\
&= \frac{1}{\sigma_U^2+ \sigma^2}(\beta s_U^2)~~~~~~~~~~~~~~~~~\bigg(e_i,Z_i \overset{iid}{\sim} N(0,1) \implies Cov(e_i,Z_i) =0 \bigg) \\
&= \frac{\beta s_U^2}{\sigma_U^2+ \sigma^2}
\end{aligned}
\] Therefore \(\lim_{n \to \infty}\hat{\beta}_{Y|X}\) is not a consistent estimator of \(\beta\).
c.
Now construct the full likelihood for \(\alpha,\beta,\sigma_{\epsilon}^2, U_1,\dots,U_n\) and show that it has no sensible maximum. Do this by showing that the full likelihood diverges to \(\infty\) when \(U_i =(Y_i-\alpha)/\beta\) for all \(i\) and \(\sigma_{\epsilon}^2 \to 0\). This is another well-known example of the failure of maximum likelihood to produce meaningful estimates.
Consider the simple estimator (of \(\beta\)) \[\hat \beta_{MOM} = \frac{s_X^2}{s_X^2-\sigma^2} \hat \beta _{Y|X},\] and show that it is a consistent estimator of \(\beta\). This shows that consistent estimators exist, and thus the problem with maximum likelihood is not intrinsic to the model.
Assuming that all other parameters are known, show that \(T_i = Y_i \beta/\sigma_{e}^2+X_i/\sigma^2\) is a sufficient statistic for \(U_i\), \(i=1,\dots,n\).
Thus, \(\left(\frac{y_i\beta}{\sigma_e^2}+\frac{ x_i}{\sigma^2}\right)\) is sufficient for \(U_i\).
f.
Find the conditional distribution of \(Y_i|T_i\) and use it to construct a conditional likelihood for \(\alpha, \beta\), and \(\sigma_{\epsilon}^2\) in a manner similar to that for the logistic regression measurement error model.
Ran out of time to finish problem. but just taking derivatives to find MLEs of conditional likelihood from here.
2.35.
Derive the Fisher information matrix \(\mathbf I(\boldsymbol \theta)\) for the reparameterized ZIP model:\[\begin{aligned} P(Y =0) = \pi\\ P(Y=y) = \left(\frac{1-\pi}{1-e^{-\lambda}}\right)\frac{\lambda^ye^{-y}}{y!} \end{aligned}\]
The fisher information is the same as the sample fisher information as the derivative of the score function is constant with respect to the sample data. The fisher information and sample information should be equal when variables are iid.
b.
Now for general differentiable \(g(\theta)\), find \(I(\theta)\)
Use simulation to verify that the information matrix for Example 2.21 (p. 78) is correct when \(\mu =1\) and \(\sigma =1\). One approach is to generate samples of size \(n=100\) (or larger) from a normal(1,1) distribution and exponentiate to get lognormal data. Then form the log likelihood and use a numerical derivative routine to find the second derivative matrix for each sample. Then average over 1000 replications and compare to the given information matrix.
Solution:
Show the code
library(numDeriv)set.seed(11062023)n =100n_sim =1000params =c(mu =1, sigma =1, lambda =0)fisher_info =list()##Using Sima's code as a base# Define log-likelihood function of box-cox transformation# Source: Scott Hyde master's thesis, Montana State University#https://math.montana.edu/grad_students/writing-projects/Pre-2000/99hyde.pdflog_likelihood <-function(params, y) { mu = params[1] sigma = params[2] lambda = params[3]if(lambda ==0){ z=log(y) } else{ z = (y^(lambda)-1)/lambda } log_lik <-sum(-1/(2*sigma^2)*(z-mu)^2-0.5*log(2*pi*sigma^2)+(lambda-1)*log(y))return(log_lik) # Return log-likelihood}for(i in1:n_sim){ y =exp(rnorm(n, mean = params["mu"], sd = params["sigma"])) fisher_info[[i]] =-hessian(log_likelihood,params,y= y)/n}mean_fisher_info =Reduce("+", fisher_info) /length(fisher_info)## Get matrix as defined in booktau_1 = (params["sigma"]^2+params["mu"]^2)/2tau_2 = (7*params["sigma"]^4+10*params["sigma"]^2*params["mu"]^2+params["mu"]^4)/4book =matrix(c(1,0,-tau_1,0,2,-2*params["sigma"]*params["mu"],-tau_1,-2*params["sigma"]*params["mu"], tau_2),nrow =3,ncol=3)mean_fisher_info
The mean of the fisher information matrices that I simulated is printed above, with the fisher info the book describes right below it. These two matrices are very close together, which means the simulation supports the book’s example.
2.43
For the generalized linear model with link function \(g\), not necessarily the canonical link, write down the likelihood function and show how to obtain the likelihood score equation, \[S(\boldsymbol \beta, \phi) = \sum_{i=1}^n \mathbf D_i \frac{(Y_i-\mu_i)}{Var(Y_i)}=0, \text{ where } \mathbf D_i =\mathbf D_i(\boldsymbol \beta) = \frac{\partial \mu_i(\boldsymbol \beta)}{\partial \boldsymbol \beta^T}\] (In the above expression we have suppressed the dependence of \(D_i\) and \(\mu_i\) on \(\boldsymbol \beta\).) The key idea used is the chain rule and the fact that the derivative of \(\theta_i= b'^{-1}(\mu_i)\) with respect to \(\mu_i\) is \(1/b''(\theta_i)\).
Continuing the last problem, show that the Fisher information matrix for the \(\boldsymbol \beta\) part of the generalized linear model is given by \[\bar{\mathbf I} ( \boldsymbol \beta ) = \frac{1}{n} \sum_{i=1}^n \frac{\mathbf{D_i D_i}^T}{Var(Y_i)}\] Here you can use either of two methods: a) take the expectation of the negative of the derivative of \(S(\boldsymbol \beta, \phi)\), and noting that all the ugly derivatives drop out because \(E(Y_i-\mu_i)=0\); or b) the individual summed components of \(\mathbf{\bar I}(\boldsymbol \beta)\) can also be found using the cross-product definition of information in (2.39, p. 67).
Suppose that \(X_1\) and \(X_2\) are independent and continuous random variables with densities \(f_1\) and \(f_2\), respectively. \(Z\) is a Bernoulli(\(p\)) random variable and independent of \(X_1\) and \(X_2\). Define \(Y= ZX_1 +(1-ZX_2)\).
a)
Use the \(2h\) method to show that the joint density of \((Y,Z)\) is given by \[f_{Y,Z}(y,z) = [pf_1(y)]^z[(1-p)f_2(y)]^{1-z}\]
Use the \(2h\) method to show that \[P(Z=1|Y=y) = \frac{pf_1(y)}{pf_1(y) + (1-p)f_2(y)}\]
Solution:
I have already used the \(2h\) method in the problem above to find the pdf \(f_{Y,Z}(y,z)\), now we can use simple probability rules for the rest of the work:
A mixture of three component densities has the form \[f(y;\boldsymbol \theta, \mathbf p) = p_1f_1(y;\boldsymbol \theta) + p_2f_2(y;\boldsymbol \theta) + p_3f_3(y;\boldsymbol \theta),\] where \(p_1= p_2 = p_3 = 1\). We observe an iid sample \(Y_1,\dots, Y_n\) from \(f(y;\boldsymbol \theta, \mathbf p )\).
a.
Show how to define multinomial(\(1;p_1,p_2,p_3\)) vectors \((Z_{i1},Z_{i2},Z_{i3})\) to get a representation for the \(Y_i\) from \(f(y;\boldsymbol \theta, \mathbf p )\) based on independent random variables \((X_{i1},X_{i2},X_{i3})\) from the individual components.
Solution:
Let \(Z_{i1} \sim Bernoulli(p_1),Z_{i2} \sim Bernoulli(p_2),Z_{i3} \sim Bernoulli(p_3)\) such that \((Z_{i1},Z_{i2},Z_{i3}) \sim multinomial(1;p_1,p_2,p_3)\) Also, let the pdfs of \(X_{i1},X_{i2},X_{i3}\) be \(f_1(y;\boldsymbol \theta), f_1(y;\boldsymbol \theta), f_1(y;\boldsymbol \theta)\) respectively.
Where \(\begin{aligned} w_{ij}^v= E_{(\boldsymbol \theta^v,p_1^v,p_2^v,p_3^v)}[Z_{ij}|Y_i]&= \frac{p_j^{v}f_j(Y_i,\boldsymbol \theta^v)}{\sum_{j=1}^n p_j^vf_j(Y_i,\boldsymbol \theta^v)}\end{aligned}\)
2.48
Suppose that the data \(Y_1, \dots, Y_n\) are assumed to come from a mixture of two binomial distributions. Thus \[f(y;p,\theta_1,\theta_2) = p {n \choose y} \theta_1^y(1-\theta_1)^{n-y} +(1-p) {n \choose y} \theta_2^y(1-\theta_2)^{n-y}.\] Find \(Q(p,\theta_1,\theta_2,p^v,\theta_1^v,\theta_2^v)\) and the updating formulas.
Solution:
Define \(Z_i \sim Bernoulli(p)\), \(Y_i = ZX_{1i}+ (1-Z)X_{2i}\), where \(X_{1i} \sim Binom(n,\theta_1)\), \(X_{2i} \sim Binom(n,\theta_2)\) and \(X_{1i},X_{2i}\) are independent of \(Z\). Thus, \(Y_i\) will have a mixture density equal to \(f(y;p,\theta_1,\theta_2)\).
Therefore, \[\mathcal L_C(p,\theta_1,\theta_2;y_i,Z) = \left[p f_{X_1}(y_i;n,\theta_1)\right]^{z_i} \left[(1-p) f_{X_2}(y_i;n,\theta_2)\right]^{1-{z_i}}\] And, following steps outlined on page 85 of Boos, \[\begin{aligned}
Q(p,\theta_1,\theta_2,p^v,\theta_1^v,\theta_2^v) &= \sum_{i=1}^n w_i^v \log f_{X_1}(y_i;n,\theta_1)+(1-w^v_i) \log f_{X_2}(y_i;n,\theta_2) +w^v_i \log p + (1-w^v_i)\log(1-p)
\end{aligned}
\]
Where \(w^v_i = \frac{p^vf_{X_1}(y_i;n,\theta_1)}{p^vf_{X_1}(y_i;n,\theta_1)+(1-p^v)f_{X_2}(y_i;n,\theta_2)}\).
Recall that the ZIP model is just a mixture of densities \(f(y;\lambda, p) =pf_1(y)+(1-p)f_2(y;\lambda)\) where \[\begin{aligned}f_1(y)=I(y=0)~&~~~~~~~~~f_2(y;\lambda)= \frac{\lambda^ye^{-\lambda}}{y!}~&~~y=0,1,2,\dots\end{aligned}\] Lambert (1992) used it to model product defects as a function of covariates. In the “perfect” state, no defects occur (\(P(Y_i= 0)= 0\)), whereas in the “imperfect” state, the number of defects \(Y_i\) follows a Poisson\((\lambda)\) distribution. The author used the EM Algorithm as follows (except we won’t do the more complicated modeling with covariates.) Let \(Z_i = 1\) if the product is in the perfect state and \(Z_i = 0\) for the imperfect state. Recall that the contribution to the complete data likelihood for a pair \((Y_i,Z_i)\) is \([pf_1(Y_i)]^{Z_i}[(1-p)f_2(Y_i;\lambda)]^{1-Z_i}\) (and here note that the first part reduces to \(p^{Z_i}\) because \(f_1\) is a point mass at 0).
a. E step.
Write down the complete data log likelihood and find \(Q(\lambda,p,\lambda^v,p^v)\) in terms of \(w_i^v = E(Z_i|Y_i,\lambda^v,p^v)\). (You do not need to give an expression for \(w_i^v\).)