Section 4: Conditional Likelihood: Sufficiency and Ancillarity in the Presence of Nuisance Parameters

70 Section 4: Conditional Likelihood: Sufficiency and Ancillarity in the Presence of Nuisance Parameters In this section we will: (i) Explore decomposing information in terms of conditional and marginal likelihood parts (ii) Give conditions (ancillarities) under which the conditional likelihood is efficient

71 Section 4.1 Partial sufficiency and ancillarity Suppose that X p(x; θ 0 ) P= {p(x; θ) :θ Θ}. We assume θ =(γ,λ),θ=γ Λ, where Γ is an open interval of the real line and Λ is an open set in R k. The parameter of interest is γ and the nuisance parameter is λ. Let T (X) be a statistic. Then, we know that p(x; θ) =h(x T (x);θ)f(t (x);θ) where h is the conditional density of X given T (X) andf is the marginal density of T (X).

72 Partial Sufficiency Definition: If h(x T (x); θ) = h(x T (x); λ) and f(t (x); θ) =f(t (x); γ), i.e., p(x; θ) =h(x T (x);λ)f(t (x);γ) then we say that T (X) ispartially sufficient for γ. The term partial is introduced because we must establish sufficiency for each fixed λ. Inference about γ can be made using only the marginal distribution of T (X). There will be no loss of information. If the score for γ is in the class of unbiased estimating functions, then the score is the most efficient member of the class. see Basu (JASA, 1977, pgs. 355-366).

73 Example 4.1: Partial Sufficiency Suppose that X = X where (1 γ)λ exp(λx) x 0 p(x; θ) = γλexp( λx) x>0 Here Γ = (0, 1) and Λ = R +. Show that T (X) =I(X >0) is partially sufficient for γ. Toseethis,notethat h(x T (x); θ) = (λ exp( λx)) T (x) 1 T (x) (λ exp(λx)) f(t (x); θ) = γ T (x) 1 T (x) (1 γ)

74 Partial Ancillarity Suppose h(x T (x); θ) =h(x T (x); γ) andf(t (x); θ) =f(t (x); θ), i.e., p(x; θ) =h(x T (x);γ)f(t (x);θ) Make inference about γ via the conditional distribution. We will make assumptions about the marginal distribution of T (X) in order to guarantee that there will not be information loss.

75 Section 4.2 Important types of partial ancillarity S-Ancillarity Definition: T (X) issaidtobes-ancillary for γ if f(t (x); θ) depends only on λ, i.e., p(x; θ) =h(x T (x);γ)f(t (x);λ) This definition is equivalent to letting T (X) be partially sufficient for λ. see Basu (JASA, 1977, pgs. 355-366).

76 R-Ancillarity Definition: T (X) issaidtober-ancillary for γ if there exists a reparameterization between θ =(γ,λ) and(γ,φ) such that f(t (x); θ) depends on θ only through φ. Thatis, p(x; θ) =h(x T (x);γ)f(t (x);φ) see Basawa (Biometrika, 1981, pgs. 153-164).

77 C-Ancillarity Definition: T (X) issaidtobec-ancillary for γ if for all γ Γ, the class {f(t (x); γ,λ):λ Λ} is complete. Completeness: E θ [m(t (X); γ)] = 0 for all λ Λ implies P θ [m(t (X); γ) =0]=1forallλ Λ see Godambe (Biometrika, 1976, pgs. 277-284).

78 A (Weak)-Ancillarity Definition: T (X) issaidtobea-ancillary if for any given θ 0 Θ and any other γ Γ, there exists a λ = λ(γ,θ 0 ) such that If this condition holds, then f(t; θ 0 )=f(t; γ,λ) for all t {f(t (x); γ,λ); λ Λ} isthesamewhateverbethevalueofγ. Intuitively, observation of T (X) cannot give us any information about γ when λ is unknown. see Andersen (JRSS-B, 1970, pgs. 283-301).

79 Example 4.2: Partial Ancillarity Let X =(Y,Z), where Y and Z are independent normal random variables with variance 1 and means γ and γ + λ, respectively. Here Γ=Λ=R. Let T (X) =Z. Then, we know that Y T (X) N(γ,1) and T (X) N(γ + λ, 1). T (X) isnot S-ancillary for γ. Let φ = γ + λ, sothatt (X) N(φ, 1) Then, T (X) is R-ancillary for γ. For fixed γ, weseethat f(t; θ) = (2π) 1/2 exp( 1 2 (t2 2γt 2λt +(γ + λ) 2 )) = (2π) 1/2 exp( 1 2 (t2 2γt)) exp( 1 2 (γ + λ)2 )exp(λt)

80 By exponential family results, we know that for fixed γ {f(t; γ,λ):λ R} is complete. Thus, we know that T (X) is C-ancillary. For fixed θ 0 and given γ, we define λ = λ(γ,θ 0 )= γ + γ 0 + λ 0. Then for all t, we know that So, T (X) is A-ancillary. f(t; θ 0 )=f(t; γ,λ)

81 Section 4.3 Information decomposition Information Decomposition Suppose h(x T (x); θ) =h(x T (x); γ) andf(t (x); θ) =f(t (x); θ), i.e., p(x; θ) =h(x T (x);γ)f(t (x);θ) Recall: The Fisher information for γ was defined as I γ (θ) = I γγ (θ) I γλ (θ)i λλ (θ) 1 I λγ (θ) = E θ [(ψ γ (X; θ) Π[ψ γ (X; θ) Λ θ ]) 2 ] The generalized Fisher information for γ from a class of unbiased estimating function G was defined as I γ(θ) =E θ [(ψ γ (X; θ) Π[ψ γ (X; θ) G γ ]) 2 ]

82 Let ψγ C (X; γ) andψγ M (X; θ) bethescoresforγ based on the conditional and marginal parts of the factorization of the density of X. Letψ λ (X; θ) =ψλ M (X; θ) be the score for λ from the marginal part of the factorization. Note that the nuisance tangent space Λ θ is the same as the nuisance tangent space from the marginal density of T (X). The Fisher information for γ contained in the conditional distribution h is I C γ (θ) =E θ [ψ C γ (X; γ) 2 ] The generalized Fisher information for γ from a class of unbiased estimating function G and based on the conditional distribution h is I C γ (θ) =E θ [(ψ C γ (X; γ) Π[ψ C γ (X; γ) G γ ) 2 ]

83 If ψγ C (X; γ) G γ,theniγ C (θ) =Iγ C (θ). The Fisher information for γ contained in the marginal distribution f is I M γ (θ) =E θ [(ψ M γ (X; θ) Π[ψ M γ (X; θ) Λ θ ]) 2 ] The generalized Fisher information for γ from a class of unbiased estimating function G based on the marginal distribution f is I M γ (θ) =E θ [(ψ M γ (X; θ) Π[ψ M γ (X; θ) G γ ]) 2 ]

84 Theorem 4.1: If ψ C γ (X; γ) G γ, I γ(θ) =I C γ Proof: (θ)+iγ M (θ). I γ(θ) = E θ [(ψ γ (X; θ) Π[ψ γ (X; θ) G γ ]) 2 ] = E θ [(ψ C γ (X; γ)+ψ M γ (X; θ) Π[ψ M γ (X; θ) G γ ]) 2 ] = I C γ (θ)+iγ M (θ)+ 2E θ [ψ C γ (X; γ)(ψ M γ (X; θ) Π[ψ M γ (X; θ) G γ ])] = Iγ C (θ)+iγ M (θ)+2e θ [ψγ C (X; γ)ψγ M (X; θ)] = I C γ (θ)+iγ M (θ)

85 Theorem 4.2: I γ (θ) =I C γ (θ)+i M γ (θ). Proof: Note that ψ C γ (X; γ) Λ θ. I γ (θ) = E θ [(ψ γ (X; θ) Π[ψ γ (X; θ) Λ θ ]) 2 ] = E θ [(ψγ C (X; γ)+ψγ M (X; θ) Π[ψγ M (X; θ) Λ θ ]) 2 ] = Iγ C (θ)+iγ M (θ)+ 2E θ [ψγ C (X; γ)(ψγ M (X; θ) Π[ψγ M (X; θ) Λ θ ])] = Iγ C (θ)+iγ M (θ)+2e θ [ψγ C (X; γ)ψγ M (X; θ)] = Iγ C (θ)+iγ M (θ) The information decompositions clearly spell out how the information about γ is partitioned between the conditional and marginal parts of the factorization.

86 Section 4.4 Optimality of the conditional score under ancillarities What is the efficiency of ψ C γ (X; γ)? Assume that ψ C γ (X; γ) G γ. Eff θ [ψ C γ (X; γ)] E θ[ ψ C γ (X; γ)/ γ] 2 E θ [ψ C γ (X; γ) 2 ] = {E θ[ 2 log h(x T (X); γ)/ γ 2 ]} 2 E θ [( log h(x T (X); γ)/ γ) 2 ] = E θ [ψγ C (X; γ) 2 ]=Iγ C (θ) =Iγ C (θ) Now, we say an UEF g 0 is optimal for γ if Eff θ [g 0 ]=Iγ(θ). The conditional score function will be optimal if Iγ M (θ) = 0. We will now give conditions under which this latter condition holds.

87 Lemma 4.3: If T (X) is S-ancillary, then the conditional score function is optimal. Proof: I M γ (θ) =E θ [(ψ M γ (X; θ) Π[ψ M γ (X; θ) G γ ]) 2 ]=0 since ψ M γ (X; θ) =0

88 Lemma 4.4: If T (X) is R-ancillary, then the conditional score function is optimal. Proof: It is sufficient to show that Iγ M (θ) =0sinceweknowthat 0 Iγ M (θ) Iγ M (θ). Since T (X) is R-ancillary, we know that there exists a one-to-one reparameterization between θ =(γ,λ) and (γ,φ) such that f(t (x); θ) depends on θ only through φ. Thatis, f(t; θ) =f (t; φ(θ)) Under suitable regularity conditions, we know that ψ M γ (θ) = log f (T (X); φ(θ))/ φ φ(θ)/ γ ψ M λ (θ) = log f (T (X); φ(θ))/ φ φ(θ)/ λ Assuming that φ(θ)/ λ is positive definite, we know that log f (T (X); φ(θ))/ φ = ψ M λ (θ) [ φ(θ)/ λ] 1

89 This implies that ψ M γ (θ) = ψ M λ (θ) [ φ(θ)/ λ] 1 φ(θ)/ γ = φ(θ)/ γ [ φ(θ)/ λ] 1 ψ M λ (θ) Note that ψγ M (θ) Λ θ.thisimpliesthatπ[ψγ M (θ) Λ θ ]=ψγ M (θ). So, Iγ M (θ) =E θ [(ψγ M (θ) Π[ψγ M (θ) Λ θ ]) 2 ]=0

90 Lemma 4.5: If T (X) is C-ancillary, then the conditional score function is optimal. Proof: It suffices to show that ψ M γ (X; θ) G γ.sincet (X) is C-ancillary, we know that for all γ Γ, the family {f(t; γ,λ):λ Λ} is complete. That is, E θ [m(t (X); γ)] = 0 for all λ Λ P θ [m(t (X); γ) =0]=1forallλ Λ Now, for an UEF g(x; γ), E θ [ψ M γ (X; θ)g(x; γ)] = E θ [ψ M γ (X; θ)e θ [g(x; γ) T (X)]] Note that E θ [g(x; γ) T (X)] is a function of T (X) andγ with mean zero. This implies that E θ [g(x; γ) T (X)] = 0 a.e. So, E θ [ψ M γ (X; θ)g(x; γ)] = 0 which implies that ψ M γ (X; θ) G γ.

91 Lemma 4.6: If T (X) is A-ancillary, then it is R-ancillary. Proof: If T (X) is A-ancillary, then for any given θ 0 Θandany other γ Γ, there exists a λ = λ(γ,θ 0 ) such that f(t; θ 0 )=f(t; γ,λ(γ,θ 0 )) for all t For any γ, we know that the distribution of T (X) depends only on φ = λ(γ,θ 0 ). So, there is a transformation between θ 0 and (γ,φ) such that f(t (x); θ) depends only on φ. Thatis,T (X) is R-ancillary. Corollary 4.7: If T (X) is A-ancillary, then the conditional score function is optimal. Proof: If T (X) is A-ancillary, then it is R-ancillary. R-ancillarity implies that the conditional score is optimal.

92 Four Examples of Ancillarity Example 4.3: Let X =(Y 1,Y 2 ) be independent Poisson random variables with means µ 1 and µ 2.Letγ = µ 1 µ 1 +µ 2 and λ = µ 1 + µ 2. Let T (X) =Y 1 + Y 2. Show that T (X) is S-ancillary for γ. We know that T (X) P oisson(λ). The conditional distribution of X given T (X) isequalto

93 h(x T (x); θ) = P [Y 1 = y 1,Y 2 = y 2,T(X) =t] P [T (X) =t] = P [Y 1 = y 1,Y 2 = t y 1 ]I(y 1 t) P [Y 1 + Y 2 = t] = exp( µ 1)µ y 1 1 exp( µ 2)µ t y 1 2 I(y 1 t)/y 1!(t y 1 )! exp( µ 1 µ 2 )(µ 1 + µ 2 ) t /t! = = y1 t y1 µ1 µ2 µ 1 + µ 2 µ 1 + µ 2 t! y 1!(t y 1 )! γy 1 (1 γ) t y 1 I(y 1 t) t! y 1!(t y 1 )! I(y 1 t) So, p(x; θ) =h(x T (x);γ)f(t (x);λ) and T (X) is S-ancillary.

94 Example 4.4: Let X =(X 1,X 2,...,X n ) be i.i.d. Normal random variables with mean µ and variance σ 2.Letγ = σ 2 and λ = µ. Let T (X) =X. We know that X N(λ, γ/n). By exponential family results, we know that for fixed γ, T (X) is complete for λ. The conditional distribution of X given T (X) isequalto h(x T (X) =t; θ) = (2πγ) n/2 exp( P n i=1 (x i λ) 2 /(2γ))I( P n i=1 x i = nt) ( n 2πγ )1/2 exp( n(t λ) 2 /(2γ)) = (2πγ) (n 1)/2 n 1/2 exp( ( nx i=1 x 2 i nt 2 )/(2γ)) So, p(x; θ) =h(x T (x);γ)f(t (x);θ) and T (X) is C-ancillary.

Example 4.5: Suppose that X 1 Binomial(n 1,p 1 )and X 2 Binomial(n 2,p 2 )wheren 1 and n 2 are fixed sample sizes. We assume X 1 is independent of X 2.LetX =(X 1,X 2 ), q 1 =1 p 1, q 2 =1 p 2, γ =log( p 2/q 2 p 1 /q 1 )andλ =log(p 1 /q 1 ). There is a one-to-one mapping between (p 1,p 2 )and(γ,λ). Let T (X) =X 1 + X 2. What is the distribution of T (X)? 95

96 f(t; θ) = = min(t,n 2 ) X u=0 min(t,n 2 ) X u=max(0,t n 1 ) P [X 1 + X 2 = t X 2 = u]p [X 2 = u] P [X 1 = t u]p [X 2 = u] = min(t,n 2 ) X u=max(0,t n 1 ) t u 0 n 1 @ 1 0 A p t u 1 q n 1 t+u n 2 1 @ u 1 A p u 2 q n 2 u 2 = min(t,n 2 ) X u=max(0,t n 1 ) t u 0 n 1 @ u 1 n 0 2 @ A 1 A exp(γu + λt)q n 1 1 q n 2 2 T (X) is C-ancillary, so the conditional score function is optimal. T (X) is not S (obvious), R, A- ancillary.

97 To be thorough, we compute the conditional density of X given T (X). This is given by h(x T (X) =t; θ) = P [X 1 = x 1,X 2 = x 2,T(X) =t] P [T (X) =t] = = x 1 0 n 1 @ x 2 1 n 0 2 @ A 0 P min(t,n2 ) n @ 1 u=max(0,t n 1 ) t u 0 n 1 @ x 1 1 A 0 n 2 @ 0 P min(t,n2 ) n @ 1 u=max(0,t n 1 ) t u 1 A exp(γx2 + λt)q n 1 1 qn 2 2 u 1 n 0 2 @ A 1 A exp(γx2 ) x 2 u 1 n 0 2 @ A 1 A exp(γu + λt)q n 1 1 qn 2 2 1 A exp(γu)

98 One advantage of using the conditional score function is that we may consider situations where there is an infinite-dimensional nuisance parameter (i.e., a semiparametric model). Suppose that there are n independent tables. Let λ i be the baseline log odds for the ith table, but let γ be the common odds ratio across the n tables. Assume that λ i L. Inthiscase,the parameters are (γ,l). Let X =(X 1,...,X n ), T (X) =(T 1 (X 1 ),...,T n (X n )), T i (X i )=X 1i + X 2i, X i =(X 1i,X 2i ), and X 1i and X 2i be independent Binomial random variables with fixed sample sizes n 1i and n 2i and random success probabilities p 1i and p 2i, respectively. Let q 1i =1 p 1i and q 2i =1 p 2i. So, we know that λ i =log(p 1i /q 1i )and γ =log( p 2i/q 2i p 1i /q 1i ) for all i. The conditional distribution of X given T (X) doesn t depend on L.

Example 4.6: Consider the semiparametric truncation model. Let Y and T be independent non-negative random variables. Suppose that Y k(k) andt l(l). In the right truncation problem, (Y,T ) is only observed if Y T. Lagakos (Biometrika, 1988) describes a study population of subjects infected with the HIV virus from contaminated blood transfusions. The date of infection was known for all subjects and only those subjects who contracted AIDS by a fixed date were included in the analysis. Interest is in the incubation time of AIDS, i.e., time from infection to AIDS. Let Y denote the incubation time and T be the time from infection to the fixed cutoff date. Note that Y and T are only observed if Y T. 99

100 Let X =(Y,T) betheobserved(y,t ). Then, p(x) =P [Y = y, T = t] =P [Y = y, T = t Y T ]= k(y)l(t)i(y t) β where P [Y T ]=β = 0 K(t)l(t)dt = 0 (1 L(y))k(y)dy Let T (X) =T be the conditional statistic. Then h(x T (x) =t) =P [Y = y T = t] =P [Y = y T = t, Y T ]= k(y)i(y t) K(t) and f(t) =P [T = t] =P [T = t Y T ]= K(t)l(t) β

101 Suppose that we parameterize the law of Y via a parameter γ and leave the distribution of T unspecified. So, we have a semiparametric model with parameters (γ,l). So, p(x; γ,l)= k γ(y)i(y t) K γ (t) K γ (t)l(t) β γ,l where β γ,l = 0 K γ (t)l(t)dt = 0 (1 L(y))k γ (y)dy

102 Claim 4.8: T (X) is A-ancillary for γ. Proof: For any given (γ 0,l 0 )andanygivenγ, thereexists l = l(γ,γ 0,l 0 ) such that f(t; γ 0,l 0 )=f(t; γ,l(t; γ,γ 0,l 0 ) for all t, where f(t; γ,l)= K γ(t)l(t) β γ,l If we take where l(t; γ,γ 0,l 0 )= K γ 0 (t)l 0 (t) K γ (t)c c = 0 K γ0 (t)l 0 (t) dt K γ (t) then the above equality holds. This implies that the conditional score function is optimal.