The Profile Likelihood

Chapter 6 he Profile Likelihood 6.1 he Profile Likelihood See also Section 4.5., Davison (). 6.1.1 he method of profiling Let us suppose that the unknown parameters can be partitioned as (ψ λ ), where ψ are the p-dimensional parameters of interest (eg. mean) and λ are the q-dimensional nuisance parameters (eg. variance). We will need to estimate both ψ and λ, but our interest is in testing only the parameter ψ (without any information on λ) and construction confidence intervals for ψ (without constructing unnecessary confidence intervals for λ - confidence intervals for a large number of parameters are wider than those for a few parameters). o achieve this one often uses the profile likelihood. o motivate the profile likelihood, we first describe a method to estimate the parameters (ψ ψ) in two stages and consider some examples. Let us suppse that {X t } are iid random variables, with density f(x; ψ λ) where our objective is to estimate ψ and λ. In this case the log-likelihood is L (ψ λ) log f(x t ; ψ λ). o estimate ψ and λ one can use (ˆλ ˆψ ) arg max λψ L (ψ λ). However, this can quite difficult, and lead to expression which are hard to maximise. Instead let us consider a different method, which may, sometimes, be easier to evaluate. Suppose, for now, ψ is known, then we rewrite the likelihood as L (ψ λ) L ψ (λ) (to show that ψ is fixed but λ varies). o estimate λ we maximise L ψ (λ) with respect to λ, ie. ˆλ ψ arg max L ψ(λ). λ 43

In reality ψ this unknown, hence for each ψ we can evaluate ˆλ ψ. Note that for each ψ, we have a new curve L ψ (λ) over λ. Now to estimate ψ, we evaluate the maximum L ψ (λ), over λ, and choose the ψ, which is the maximum over all these curves. In other words, we evaluate ˆψ arg max L ψ(ˆλ ψ ) arg max L (ψ ˆλ ψ ). ψ ψ A bit of logical deduction shows that ˆψ and λ ˆψ are the maximum likelihood estimators (ˆλ ˆψ ) arg max ψλ L (ψ λ). We note that we have profiled out nuisance parameter λ, and the likelihood L ψ (ˆλ ψ ) L (ψ ˆλ ψ ) is completely in terms of the parameter of interest ψ. he advantage of this best illustrated through some examples. Example 6.1.1 Let us suppose that {X t } are iid random variables from a Weibull distribution with density f(x; α ) αyα 1 exp( (y/) α ). We know from Example 4.1., that if α, were α known an explicit expression for the MLE can be derived, it is ˆ α arg max L α () arg max arg max log α + (α 1) log Y t α log Y t α α log Y t α ( 1 Y α t ) 1/α where L α (X; ) log α + (α 1) log Y t α log Y t α. hus for a given α, the maximum likelihood estimator of can be derived. he maximum likelihood estimator of α is ˆα arg max α log α + (α 1) log Y t α log( 1 Y α t ) 1/α Y t α ( 1 Y. t α)1/α herefore, the maximum likelihood estimator of is ( 1 Y ˆα t ) 1/ˆα. We observe that evaluating ˆα can be tricky but no worse than maximising the likelihood L (α ) over α and. As we mentioned above, we often do not have any interest in the nuisance parameters λ and are only interesting in testing and constructing CIs for α. In this case, we are interested in the limiting distribution of the MLE ˆα. his can easily be derived by observing that ˆψ ψ ˆλ λ Iψψ N 44 I λψ I ψλ I λλ 1.

where Iψψ I λψ I ψλ I λλ log f(x t;ψλ) log f(x t;ψλ) log f(x t;ψλ) log f(x t;ψλ). (6.1) o derive an exact expression for the limiting variance of ( ˆψ ψ), we note that the inverse of a block matrix is 1 A B C D hus the above implies that (A BD 1 C) 1 A 1 B(D CA 1 B) 1 D 1 CB(A BD 1 C) 1 (D CA 1 B) 1 ( ˆψ ψ) N ( (I ψψ I ψλ I 1 λλ I λψ) 1 ). hus if ψ is a scalar we can easily use the above to construct confidence intervals for ψ. Exercise: How to estimate I ψψ I ψλ I 1 λλ I λψ? 6.1. he score and the log-likelihood ratio for the profile likelihood o ease notation, let us suppose that ψ and λ are the true parameters in the distribution. he above gives us the limiting distribution of ( ˆψ ψ ), this allows us to test ψ, however the test ignores any dependency that may exist with the nusiance estimator parameter ˆλ. An alternative test, which circumvents this issue is to do a log-likelihood ratio test of the type max (ψ λ) max (ψ λ) ψλ λ. (6.) However, to derive the limiting distribution in this case for this statistic is a little more complicated than the log-likelihood ratio test that does not involve nusiance parameters. his is because a direct aylor expansion does not work. However we observe that max L (ψ λ) max L (ψ λ) max L (ψ λ) L (ψ λ ) max L (ψ λ) max L (ψ λ ) ψλ λ ψλ λ λ now we will show below that by using a few aylor expansions we can derive the limiting distribution of (6.). In the theorem below we will derive the distribution of the score and the nested- loglikelihood. Please note you do not have to learn this proof. heorem 6.1.1 Suppose Assumption 4.1.1 holds. Suppose that (ψ λ ) are the true parameters. hen we have L (ψ λ) ˆλψ ψ L (ψ λ) λψ I ψλ I 1 45 L (ψ λ) λ λ ψλ (6.3).

and 1 L (ψ λ) ˆλψ ψ N ( (Iψψ I ψλ I 1 λ λ I λψ )) (6.4) L ( ˆψ ˆλ ) L (ψ ˆλ ψ ) χ p (6.5) where I is defined as in (6.1). PROOF. We first prove (6.3) which is the basis of the proofs of (6.4) and (6.5) - in the remark below we try to interprete (6.3). o avoid, notationaly difficulties by considering the elements of the vector L (ψλ) ˆλψ ψ and L (ψλ) λλψ (as discussed in Section 4.1.3) we will suppose that these are univariate random variables. Our objective is to find an expression for L (ψλ) ˆλψ ψ in terms of L (ψλ) λλψ and L (ψλ) λλψ which will allow us to obtain its variance and asymptotic distribution easily. Now making a aylor expansion of L (ψλ) ˆλψ ψ about L (ψλ) λψ gives L (ψ λ) ˆλψ ψ L (ψ λ) λψ + L (ψ λ) λψ (ˆλ ψ λ ). Notice that we have used instead of because we replace the second derivative with its true parameters. Now if the sample size is large enough then we can say that L (ψλ) λψ L (ψλ) λψ. o see why this is true consider the case that of iid random variables then 1 herefore we have that L (ψ λ) λψ 1 L (ψ λ) log f(x t ; ψ λ) log f(x t ; ψ λ) λψ. λψ L (ψ λ) ˆλψ λψ + I λψ (ˆλ ψ λ ) (6.6) Hence we have the first part of the decomposition of L (ψ λ) ˆλψ into the distribution which is known, now we need find a decomposition of (ˆλ ψ λ ) into known distributions. We first recall that since L (ψ ˆλ ψ ) arg max λ L (ψ λ) then L (ψ λ) ˆλψ (as long as the parameter space is large enough and the maximum is not on the boundary). herefore making a aylor expansion of L (ψ λ) about L (ψ λ) ˆλψ λψψ gives L (ψ λ) L (ψ λ) ˆλψ λψ + L (ψ λ) λψ (ˆλ ψ λ ). 46

Again using the same trick as in (6.6) we have herefore L (ψ λ) L (ψ λ) ˆλψ λψ + I λλ (ˆλ ψ λ ). (ˆλ ψ λ ) I 1 λλ herefore substituting (6.6) into (6.7) gives and (6.3). L (ψ λ) ˆλψ L (ψ λ) L (ψ λ) λψ. (6.7) λψ I ψλ I 1 λλ L (ψ λ) ψλ o prove (6.4) (ie. obtain the asympototic distribution and limiting variance of L (ψ λ) ˆλψ ), we recall that the regular score function satisfies 1 L (ψ λ) λψ 1 L (ψλ) λψ L (ψ λ) ψλ N ( I( )). Now by substituting the above into (6.4) we immediately obtain (6.4). Finally to prove (6.5) we the following decomposition, aylor expansions and the trick in (6.6) to obtain L ( ˆψ ˆλ ) L (ψ ˆλ ψ ) L ( ˆψ ˆλ ) L (ψ λ ) L (ψ ˆλ ψ ) L (ψ λ ) (ˆ ) I()(ˆ ) (ˆλ ψ λ ) I λλ (ˆλ ψ λ ) (6.8) where ˆ ( ˆψ ˆλ) (the mle). Now we want to rewrite (ˆλ ψ λ ) in terms of (ˆ ). We start by recalling that from (6.6) we have (ˆλ ψ λ ) I 1 λλ L (ψ λ) λψ. Now we will rewrite L (ψ λ) λψ in terms of (ˆ ) by using L () L () ˆ + I()(ˆ ) L () I()(ˆ ). herefore concentrating on the subvector L () ψλ we see that L () ψ λ I λψ ( ˆψ ψ ) + I λλ (ˆλ λ ). (6.9) 47

Substituting (6.9) into (6.7) gives (ˆλ ψ λ ) I 1 λλ I λψ( ˆψ ψ ) + (ˆλ λ ). Finally substituting the above into (6.8) and making lots of cancellations we have L ( ˆψ ˆλ ) L (ψ ˆλ ψ ) ( ˆψ ψ ) (I ψψ I ψλ I 1 λλ I λψ)( ˆψ ψ ). Finally, since (ˆ ) N ( I() 1 ) by using inversion formulas for block matrices we have that ( ˆψ ψ ) I ψλ I 1 λλ I λψ) 1 ), which gives the desired result. N ( (I ψψ Remark 6.1.1 (i) We first make the rather interesting observation. he limiting variance of L (ψλ) ψλ is I ψψ, whereas the the limiting variance of L (ψλ) ˆλψ ψ is (I ψψ I ψλ I 1 λλ I λψ) and the limiting variance of ( ˆψ ψ ) is (I ψψ I ψλ I 1 λλ I λψ) 1. (ii) Look again at the expression L (ψ λ) ˆλψ ψ L (ψ λ) λψ I ψλ I 1 λλ L (ψ λ) λψ (6.1) It is useful to understand where it came from. Consider the problem of linear regression. Suppose X and Y are random variables and we want to construct the best linear predictor of Y given X. We know that the best linear predictor is Ŷ (X) (XY )/(Y )X and the residual and mean squared error is (XY ) Y Ŷ (X) Y (Y X and Y (XY ) ) (Y ) X (Y ) (XY )(Y ) 1 (XY ). Compare this expression with (6.1). We see that in some sense L (ψλ) ˆλψ ψ can be treated as the residual (error) of the projection of L (ψλ) λψ onto L (ψ λ) lambdaψ. his is quite surprising We now aim to use the above result. It is immediately clear that (6.5) can be used for both constructing likelihoods and testing. For example, to construct a 95% CI for ψ we can use the mle ˆ ( ˆψ ˆλ ) and the profile likelihood and use the 95% CI ψ; L ( ˆψ ˆλ ) L (ψ ˆλ ψ ) χ p(.95). As you can see by profiling out the parameter λ, we have avoided the need to also construct a CI for λ too. his has many advantages, from a practical perspective it reduced the dimension of the parameters. 48

he log-likelihood ratio test in the presence of nuisance parameters An application of heorem 6.1.1 is for nested hypothesis testing, as stated at the beginning of this section. (6.5) can be used to test H : ψ ψ against H A : ψ ψ since max L (ψ λ) max L (ψ λ) χ p. ψλ λ Example 6.1. χ -test for independence) Now it is worth noting that using the Profile likelihood one can derive the chi-squared test for independence (in much the same way that the Pearson goodness of fit test was derived using the log-likelihood ratio test). Do this as an exercise (see Davison, Example 4.37, page 135). he score test in the presence of nuisance parameters We recall that we used heorem 6.1.1 to obtain the distribution of max ψλ L (ψ λ) max λ L (ψ λ) under the null, we now motivate an alternative test to test the same hypothesis (which uses the same heorem). We recall that under the null H : ψ ψ the derivative L (ψλ) ˆλψ ψ, but the same is not true of L (ψλ) ˆλψ ψ. However, if the null is true we would expect if ˆλ ψ to be close to the true λ and for L (ψλ) ˆλψ ψ to be close to zero. Indeed this is what we showed in (6.4), where we showed that under the null 1 L (ψ λ) ˆλψ N ( Iψψ I ψλ I 1 λλ I λψ) (6.11) where λ ψ arg max λ L (ψ λ). herefore (6.11) suggests an alternative test for H : ψ ψ against H A : ψ ψ. We can use 1 L (ψλ) as the test statistic. his is called the score or LM test. ˆλψ he log-likelihood ratio test and the score test are asymptotically equivalent. here are advantages and disadvantages of both. (i) An advantage of the log-likelihood ratio test is that we do not need to calculate the information matrix. (ii) An advantage of the score test is that we do not have to evaluate the the maximum likelihood estimates under the alternative model. 6.1.3 Examples Example: An application of profiling to frequency estimation Question 49

Suppose that the observations {X t ; t 1... } satisfy the following nonlinear regression model X t A cos(ωt) + B sin(ωt) + ε t where {ε t } are iid standard normal random variables and < ω < π. he parameters A B and ω are real and unknown. Some useful identities are given at the end of the question. (i) Ignoring constants, obtain the log-likelihood of {X t }. L (A B ω). Denote this likelihood as (ii) Let S (A B ω) Xt 1 X t A cos(ωt) + B sin(ωt) + (A + B ). Show that L (A B ω) + S (A B ω) (A B ) cos(ω) + AB sin(ω). hus show that L (A B ω)+ 1 S (A B ω) O(1) (ie. the difference does not grow with ). Since L (A B ω) and 1 S (A B ω) are asymptotically equivalent, for the rest of this question, use 1 S (A B ω) instead of the likelihood L (A B ω). (iii) Obtain the profile likelihood of ω. (hint: Profile out the parameters A and B, to show that ˆω arg max ω X t exp(itω) ). Suggest, a graphical method for evaluating ˆω? (iv) By using the identity exp(iωt) exp( 1 i( +1)Ω) sin( 1 Ω) sin( 1 Ω) < Ω < π Ω or π. (6.1) show that for < Ω < π we have t cos(ωt) O( ) t sin(ωt) O( ) t cos(ωt) O( ) 5 t sin(ωt) O( ).

(v) By using the results in part (iv) show that the Fisher Information of L (A B ω) (denoted as I(A B ω)) is asymptotically equivalent to I(A B ω) E S ω B + O( ) A + O( ) B + O( ) A + O( ) 3 3 (A + B ) + O( ). (vi) Derive the asymptotic variance of maximum likelihood estimator, ˆω, derived in part (iv). Comment on the rate of convergence of ˆω. Useful information: In this question the following quantities may be useful: exp(iωt) exp( 1 i( +1)Ω) sin( 1 Ω) sin( 1 Ω) < Ω < π Ω or π. (6.13) the trignometric identities: sin(ω) sin Ω cos Ω, cos(ω) cos (Ω) 1 1 sin Ω, exp(iω) cos(ω) + i sin(ω) and t ( + 1) t ( + 1)( + 1). 6 Solution (i) Since {ε t } are standard normal iid random variables the likelihood is L (A B ω) 1 (X t A cos(ωt) B sin(ωt)). 51

(ii) It is straightforward to show that L (A B ω) Xt X t A cos(ωt) + B sin(ωt) +A cos (ωt) + B sin (ωt) + AB Xt X t A cos(ωt) + B sin(ωt) + A (1 + cos(ω)) + B sin(ωt) cos(ωt) (1 cos(ω)) + AB sin(ω) Xt X t A cos(ωt) + B sin(ωt) + (A + B ) + (A B ) cos(ω) + AB sin(ω) S (A B ω) + (A B ) cos(ω) + AB sin(ω) Now by using (6.13) we have L (A B ω) S (A B ω) + O(1) as required. (iii) o obtain the profile likelihood, let us suppose that ω is known, hen the mle of A and B (using 1 S ) is Â (ω) X t cos(ωt) ˆB (ω) X t sin(ωt). hus the profile likelihood (using the approximation S ) is 1 S p(ω) 1 Xt 1 Xt X t Â (ω) cos(ωt) + ˆB(ω) sin(ωt) + (Â (ω) + ˆB(ω) ) Â (ω) + ˆB (ω). 5

hus the ω which maximises 1 S p(ω) is the parameter that maximises Â (ω) + ˆB (ω). Since Â (ω) + ˆB (ω) 1 X t exp(itω), we have as required. ˆω arg max ω ( 1/)S p(ω) arg max Â (ω) + ˆB (ω) arg max X t exp(itω) ω (iv) Differentiating both sides of (6.1) with respect to Ω and considering the real and imaginary terms gives t cos(ωt) O( ) t sin(ωt) O( ). Differentiating both sides of (6.1) twice wrt to Ω gives the second term. (v) Differentiating S (A B ω) X t X t A cos(ωt)+b sin(ωt) + 1 (A +B ) twice wrt to A B and ω gives and S A, S B, S A X t cos(ωt) + A S B X t sin(ωt) + B S ω AX t t sin(ωt) BX t t cos(ωt). S A B, S ω A X t t sin(ωt) S ω B X t t cos(ωt) S ω ω t X t A cos(ωt) + B sin(ωt). Now taking expectations of the above and using (v) we have E( S ω A ) t sin(ωt) A cos(ωt) + B sin(ωt) B t sin (ωt) + At sin(ωt) cos(ωt) B t(1 cos(ωt)) + A t sin(ωt) 53 ( + 1) B + O( ) B + O( ).

Using a similar argument we can show that E( S ω B ) A + O( ) and E( S ω ) t A cos(ωt) + B sin(ωt) (A + B ) ( + 1)( + 1) 6 Since E( L ) 1 E( S ), this gives the required result. + O( ) (A + B ) 3 /3 + O( ). (vi) Noting that the asymptotic variance for the profile likelihood estimator ˆω by subsituting (vi) into the above we have I ωω I ω(ab) I 1 AB I (BA)ω 1 A + B 1 3 + O( ) 1 6 (A + B ) 3 hus we observe that the asymptotic variance of ˆω is O( 3 ). ypically estimators have a variance of order O( 1 ), so we see that the estimator ˆω variance which converges to zero, much faster. compared with the majority of parameter estimators. hus the estimator is extremely good Example: An application of profiling in survival analysis Question (his question also uses some methods from Survival Analysis which is covered later in this course - see Sections 13.1 and 19.1). Let i denote the survival time of an electrical component. It is known that the regressors x i influence the survival time i. o model the influence the regressors have on the survival time the Cox-proportional hazard model is used with the exponential distribution as the baseline distribution and ψ(x i ; β) exp(βx i ) as the link function. More precisely the survival function of i is i (t) (t) ψ(xi;β) where (t) exp( t/). Not all the survival times of the electrical components are observed, and there can arise censoring. Hence we observe Y i min( i c i ), where c i is the censoring time and δ i, where δ i is the indicator variable, where δ i denotes censoring of the ith component and δ i 1 denotes that it is not censored. he parameters β and are unknown. 54

(i) Derive the log-likelihood of {(Y i δ i )}. (ii) Compute the profile likelihood of the regression parameters β, profiling out the baseline parameter. Solution (i) he survivial function and the density are f i (t) ψ(x i ; β) (t) [ψ(x i ;β) 1] f (t) and i (t) (t) ψ(x i;β). Hence for this example we have log f i (t) log ψ(x i ; β) ψ(x i ; β) 1 t log t log i (t) ψ(x i ; β) t. herefore, the likelihood is L n (β ) n δ i log ψ(xi ; β) + log f ( i ) + (ψ(x i ; β) 1) log (t) + i1 n (1 δ i ) ψ(x i ; β) log (t) i1 n δ i log ψ(xi ; β) log i1 n i1 ψ(x i ; β) i (ii) Keeping β fixed and differentiating the above with respect to and equating to zero gives L n n i1 δ i 1 + i1 ψ(x i ; β) i and n i1 ˆ(β) ψ(x i; β) i n i1 δ. i Hence the profile likelihood is P (β) n δ i log ψ(xi ; β) log ˆ(β) i1 n i1 ψ(x i ; β) i ˆ(β). Hence to obtain an estimator of β we maximise the above with respect to β. 55

An application of profiling in semi-parametric regression We now consider how the profile likelihood (we use inverted commas here because we do not use the likelihood, but least squares instead) can be used in semi-parametric regression. Recently this type of method has been used widely in various semi-parametric models. his section needs a little knowledge of nonparametric regression, which is considered later in this course. Suppose we observe (Y t U t X t ) where Y t βx t + φ(u t ) + ε t (Y t X t U t ) are iid random variables and φ is an unknown function. o estimate β, we first profile out φ( ), which we estimate as if β were known. In other other words, we suppose that β is known and let Y t (β) Y t βx t. We then estimate φ( ) using the classic local least estimator, in other words the φ( ) which minimises the criterion ˆφ β (u) arg min W b (u U t )(Y t (β) a) t W b(u U t )Y t (β) a t t W b(u U t ) t W b(u U t )Y t t t W β W b(u U t )X t b(u U t ) t W b(u U t ) : G b (u) βh b (u) (6.14) where G b (u) t W b(u U t )Y t t W b(u U t ) and H b (u) t W b(u U t )X t t W b(u U t ). hus, given β the estimator of φ and the residuals ε t are ˆφ β (u) G b (u) βh b (u) and Y t βx t ˆφ β (U t ). Given the estimated residuals Y t βx t ˆφ β (U t ) we can now use least squares to estimate coefficient β, where L (β) t Yt βx t ˆφ β (U t ) t Yt βx t G b (U t ) + βh b (U t ) t Yt G b (U t ) β[x t H b (U t )]. herefore, the least squares estimator of β is t ˆβ b [Y t G b (U t )][X t H b (U t )] t [X t H b (U t )]. Using β b we can then estimate (6.15). We observe how we have the used the principle of profiling to estimate the unknown parameters. here is a large literature on this, including 56

Wahba, Speckman, Carroll, Fan etc. In particular it has been shown that under some conditions on b (as ), the estimator ˆβ b has the usual rate of convergence. U t t It should be mentioned that using random regressors U t are not necessary. It could be that (on a grid). In this case ˆφ β (u) arg min a W b (u t )(Y t(β) a) t t W b(u t )Y t(β) t W b(u t ) t W b (u t )Y t β t W b (u U t )X t : G b (u) βh b (u) (6.15) where G b (u) t W b (u t )Y t and H b (u) t W b (u t )X t. Using the above estimator of φ( ) we continue as before. 57