Various types of likelihood

Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood, Bayesian asymptotics 2. quasi-likelihood, composite likelihood 3. semi-parametric likelihood, partial likelihood 4. empirical likelihood, penalized likelihood 5. bootstrap likelihood, h-likelihood, weighted likelihood, pseudo-likelihood, local likelihood, sieve likelihood, simulated likelihood STA 4508: Topics in Likelihood Inference January 14, 2014 1/57

Nuisance parameters: notation θ = (ψ, λ) = (ψ 1,..., ψ q, λ 1,..., λ d q ) ( ) Uψ (θ) U(θ) =, U U λ (θ) λ (ψ, ˆλ ψ ) = 0 ( ) ( ) iψψ i i(θ) = ψλ jψψ j j(θ) = ψλ i λψ i λλ ( i i 1 (θ) = ψψ i ψλ ) i λψ i λλ j λψ j λλ ( j j 1 (θ) = ψψ j ψλ ). j λψ i ψψ (θ) = {i ψψ (θ) i ψλ (θ)i 1 λλ (θ)i λψ(θ)} 1, l p (ψ) = l(ψ, ˆλ ψ ), j p (ψ) = l p(ψ) j λλ STA 4508: Topics in Likelihood Inference January 14, 2014 2/57

Nuisance parameters: approximate pivots w u (ψ) = U ψ (ψ, ˆλ ψ ) T {i ψψ (ψ, ˆλ ψ )}U ψ (ψ, ˆλ ψ ). χ 2 q w e (ψ) = ( ˆψ ψ) T {i ψψ ( ˆψ, ˆλ)} 1 ( ˆψ ψ). χ 2 q w(ψ) = 2{l( ˆψ, ˆλ) l(ψ, ˆλ ψ )} = 2{l p ( ˆψ) l p (ψ)}. χ 2 q; r u (ψ) = l p(ψ)j 1/2 p ( ˆψ) r e (ψ) = ( ˆψ ψ)j 1/2 p ( ˆψ). N(0, 1),. N(0, 1), r(ψ) = sign( ˆψ ψ)[2{l p ( ˆψ) l p (ψ)}] 1/2. N(0, 1) STA 4508: Topics in Likelihood Inference January 14, 2014 3/57

Nuisance parameters: properties of likelihood maximum likelihood estimates are equivariant: ĥ(θ) = h(ˆθ) for one-to-one h( ) question: which of w e, w u, w are invariant under reparametrization of the full parameter: ϕ(θ)? question: which of r e, r u, r are invariant under interest-respecting reparameterizations (ψ, λ) {ψ, η(ψ, λ)}? consistency of maximum likelihood estimate equivalence of maximum likelihood estimate and root of score equation observed vs. expected information STA 4508: Topics in Likelihood Inference January 14, 2014 5/57

Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood 2. quasi-likelihood, composite likelihood 3. semi-parametric likelihood, partial likelihood 4. empirical likelihood, penalized likelihood 5. bootstrap likelihood, h-likelihood, weighted likelihood, pseudo-likelihood, local likelihood, sieve likelihood, simulated likelihood STA 4508: Topics in Likelihood Inference January 14, 2014 7/57

Marginal and conditional likelihoods Example: Y N(Xβ, σ 2 ), Y R n Example: Y ij N(µ i, σ 2 ), Example: Y ij N(µ, σ 2 i ), j = 1,..., k; i = 1,..., m j = 1,..., k i ; i = 1,..., m Example: Y i1, Y i2 Bernoulli(p i1, p i2 ), i = 1,..., n Example: Y i1, Y i2 Exponential(λ i ψ, λ i /ψ) or ψλ i, ψ/λ i STA 4508: Topics in Likelihood Inference January 14, 2014 8/57

Frequentist inference, nuisance parameters first-order pivotal quantities r u (ψ) = l P (ψ)j P( ˆψ) 1/2. N(0, 1), r e (ψ) = ( ˆψ ψ)j P ( ˆψ) 1/2. N(0, 1), r(ψ) = sign( ˆψ ψ)[2{l P ( ˆψ) l P (ψ)}] 1/2. N(0, 1) all based on treating profile log-likelihood as a one-parameter log-likelihood example y = Xβ + ɛ, ɛ N(0, σ 2 ) ˆσ 2 = (y X ˆβ) T (y X ˆβ)/n STA 4508: Topics in Likelihood Inference January 14, 2014 10/57

log-likelihood -6-4 -2 0 3 4 5 6 7 8 ψ 1 2

Eliminating nuisance parameters by using marginal density f (y; ψ, λ) f m (t 1 ; ψ)f c (t 2 t 1 ; ψ, λ) Example N(Xβ, σ 2 I) : f (y; β, σ 2 ) f m (RSS; σ 2 )f c ( ˆβ RSS; β, σ 2 ) by using conditional density f (y; ψ, λ) f c (t 1 t 2 ; ψ)f m (t 2 ; ψ, λ) Example N(Xβ, σ 2 I) : f (y; β, σ 2 ) f c (RSS ˆβ; σ 2 )f m ( ˆβ; β, σ 2 ) STA 4508: Topics in Likelihood Inference January 14, 2014 12/57

Linear exponential families conditional density free of nuisance parameter f (y i ; ψ, λ) = exp{ψ T s(y i ) + λ T t(y i ) k(ψ, λ)}h(y i ) f (y; ψ, λ) = s = t = f (s, t; ψ, λ) = f (s t; ψ) = STA 4508: Topics in Likelihood Inference January 14, 2014 13/57

Adjusted profile log-likelihood l A (ψ) = l p (ψ) + A(ψ) = l(ψ, ˆλ ψ ) + A(ψ) A(ψ) assumed to be O p (1) generic form is A FR (ψ) = + 1 2 log j λλ(ψ, ˆλ ψ ) log d(λ) d ˆλ ψ Fraser, 2003 closely related A BN (ψ) = 1 2 log j λλ(ψ, ˆλ ψ ) + log d ˆλ d ˆλ ψ SM 12.4.1, BN 1983 if i ψλ (θ) = 0, then ˆλ ψ = ˆλ + O p (n 1 ), suggesting we ignore last term if ψ is scalar, then in principle we can find a parametrization (ψ, λ) in which i ψλ (θ) = 0 SM 12.4.2 STA 4508: Topics in Likelihood Inference January 14, 2014 14/57

Asymptotics for Bayesian inference exp{l(θ; y)}π(θ) π(θ y) = exp{l(θ; y)}π(θ)dθ expand numerator and denominator about ˆθ, assuming l (ˆθ) = 0 π(θ y). = N{ˆθ, j 1 (ˆθ)} expand denominator only about ˆθ result π(θ y). = 1 (2π) d/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) STA 4508: Topics in Likelihood Inference January 14, 2014 15/57

Posterior is asymptotically normal π(θ y). N{ˆθ, j 1 (ˆθ)} θ R, y = (y 1,..., y n ) careful statement STA 4508: Topics in Likelihood Inference January 14, 2014 16/57

... posterior is asymptotically normal π(θ y). N{ˆθ, j 1 (ˆθ)} θ R, y = (y 1,..., y n ) equivalently l π (θ) = STA 4508: Topics in Likelihood Inference January 14, 2014 17/57

... posterior is asymptotically normal In fact, If π(θ) > 0 and π (θ) is continuous in a neighbourhood of θ 0, there exist constants D and n y s.t. F n (ξ) Φ(ξ) < Dn 1/2, for all n > n y, on an almost-sure set with respect to π(θ 0 )f (y; θ 0 ), where y = (y 1,..., y n ) is a sample from f (y; θ 0 ), and θ 0 is an observation from the prior density π(θ). F n (ξ) = Pr{(θ ˆθ)j 1/2 (ˆθ) ξ y} Johnson (1970); Datta & Mukerjee (2004) STA 4508: Topics in Likelihood Inference January 14, 2014 18/57

Laplace approximation π(θ y). = 1 (2π) 1/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) π(θ y) = π(θ y) = 1 (2π) 1/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) {1+O p(n 1 )} y = (y 1,..., y n ), θ R 1 1 (2π) 1/2 j π(ˆθ π ) +1/2 exp{l π (θ; y) l π (ˆθ π ; y)}{1+o p (n 1 )} STA 4508: Topics in Likelihood Inference January 14, 2014 19/57

Posterior tail area θ π(ϑ y)dϑ. = θ 1 (2π) 1/2 el(ϑ;y) l( ˆϑ;y) 1/2 π(ϑ) j( ˆϑ) π( ˆϑ) dϑ STA 4508: Topics in Likelihood Inference January 14, 2014 20/57

Posterior cdf θ π(ϑ y)dϑ. = θ 1 (2π) 1/2 el(ϑ;y) l( ˆϑ;y) 1/2 π(ϑ) j( ˆϑ) π( ˆϑ) dϑ SM, 11.3 STA 4508: Topics in Likelihood Inference January 14, 2014 21/57

BDR, Ch.3, Cauchy with flat prior

Nuisance parameters y = (y 1,..., y n ) f (y; θ), θ = (ψ, λ) π m (ψ y) = π(ψ, λ y)dλ = exp{l(ψ, λ; y)π(ψ, λ)dλ exp{l(ψ, λ; y)π(ψ, λ)dψdλ STA 4508: Topics in Likelihood Inference January 14, 2014 24/57

... nuisance parameters y = (y 1,..., y n ) f (y; θ), θ = (ψ, λ) π m (ψ y) = π(ψ, λ y)dλ = exp{l(ψ, λ; y)π(ψ, λ)dλ exp{l(ψ, λ; y)π(ψ, λ)dψdλ j(ˆθ) = j ψψ (ˆθ) j λλ (ˆθ) STA 4508: Topics in Likelihood Inference January 14, 2014 25/57

Posterior marginal cdf, d = 1 Π m (ψ y) =. = ψ ψ π m (ξ y)dξ 1 (2π) 1/2 elp(ξ) lp(ˆξ) j 1/2 p (ˆξ) π(ξ, ˆλ ξ ) j λλ (ˆξ, ˆλ) 1/2 π(ˆξ, ˆλ) j λλ (ξ, ˆλ ξ ) 1/2 dξ STA 4508: Topics in Likelihood Inference January 14, 2014 26/57

... posterior marginal cdf, d = 1 Π m (ψ y) r = r(ψ) =. = Φ(r B ) = Φ{r + 1 r log(q B r )} q B = q B (ψ) = STA 4508: Topics in Likelihood Inference January 14, 2014 27/57

normal circle, k=2 p value 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8 STA 4508: Topics in Likelihood Inference January 14, 2014 28/57 ψ

normal circle, k = 2, 5, 10 p value 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8 STA 4508: Topics in Likelihood Inference January 14, 2014 31/57 ψ

Link to adjusted log-likelihoods π m (ψ y). = 1 (2π) d/2 elp(ψ) lp( ˆψ) j 1/2 p ( ˆψ) π(ψ, ˆλ ψ ) π( ˆψ, ˆλ) j λλ ( ˆψ, ˆλ) 1/2 j λλ (ψ, ˆλ ψ ) 1/2 π m (ψ y) =. c exp{l p (ψ) 1 2 log j λλ(ψ, ˆλ ψ ) + log π(ψ, ˆλ ψ )} l A (ψ) = l p (ψ) 1 2 log j d ˆλ λλ(ψ, ˆλ ψ ) + log d ˆλ ψ if i ψλ (θ) = 0, then ˆλ ψ = ˆλ + O p (n 1 ) STA 4508: Topics in Likelihood Inference January 14, 2014 35/57

Composite likelihood Vector observation: Y f (y; θ), Y Y R m, θ R d Set of events: {A k, k K } Composite Likelihood: (Lindsay, 1988) CL(θ; y) = k K L k (θ; y) w k L k (θ; y) = f ({y A k }; θ) likelihood for an event {w k, k K } a set of weights STA 4508: Topics in Likelihood Inference January 14, 2014 36/57

Examples Composite Conditional Likelihood: (Besag, 1974) L C (θ; y) = s S f s s c(y s y s c) ws, and variants by modifying events Composite Marginal Likelihood: CML(θ; y) = s S f s(y s ; θ) ws, f s (y s ; θ): marginal density of the subvector y s induced by f Independence Likelihood: Pairwise Likelihood: STA 4508: Topics in Likelihood Inference January 14, 2014 37/57

Derived quantities log composite likelihood: cl(θ; y) = log CL(θ; y) score function: U(θ; y) = θ cl(θ; y) = s S w su s (θ; y) U s (θ; y) = θ log f s (y s ; θ) variability matrix: J(θ) = var θ {U(θ; Y )} sensitivity matrix: H(θ) = E θ { θ U(θ; Y )} Godambe information (or sandwich information): G(θ) = H(θ)J(θ) 1 H(θ) STA 4508: Topics in Likelihood Inference January 14, 2014 38/57

Inference Sample: Y 1,..., Y n, i.i.d., CL(θ; y) = n i=1 CL(θ; y i) n(ˆθ CL θ). N{0, G 1 (θ)} G(θ) = H(θ)J(θ) 1 H(θ) STA 4508: Topics in Likelihood Inference January 14, 2014 39/57

... inference w(θ) = 2{cl(ˆθ CL ) cl(θ)}. d a=1 µ az 2 a Z a N(0, 1) µ 1,..., µ d eigenvalues of J(θ)H(θ) 1 STA 4508: Topics in Likelihood Inference January 14, 2014 40/57

... inference w(θ) = 2{cl(ˆθ CL ) cl(θ)}. d a=1 µ az 2 a Z a N(0, 1) µ 1,..., µ d eigenvalues of J(θ)H(θ) 1 w(θ). = (ˆθ CL θ){nh(θ)}(ˆθ CL θ) ˆθ CL. N{θ, G 1 (θ)} STA 4508: Topics in Likelihood Inference January 14, 2014 41/57

Nuisance parameters θ = (ψ, λ) constrained estimator: θ ψ = sup θ=θ(ψ) cl(θ; y) n( ˆψ CL ψ). N{0, G ψψ (θ)} G(θ) = H(θ)J(θ) 1 H(θ) w(ψ) = 2{cl(ˆθ CL ) cl( θ ψ )}. d 0 a=1 µ az 2 a µ 1,..., µ d0 eigenvalues of (H ψψ ) 1 G ψψ Kent, 1982 STA 4508: Topics in Likelihood Inference January 14, 2014 42/57

Model selection Akaike s information criterion Varin and Vidoni, 2005 AIC = 2cl(ˆθ CL ; y) 2 dim(θ) Bayesian information criterion Gao and Song, 2009 BIC = 2cl(ˆθ CL ; y) log n dim(θ) effective number of parameters (?) dim(θ) = tr{h(θ)g 1 (θ)} these criteria used for model averaging Hjort and Claeskens, 2008 or for selection of tuning parameters Gao and Song, 2009 STA 4508: Topics in Likelihood Inference January 14, 2014 43/57

Example: symmetric normal Y i N(0, R), var(y ir ) = 1, corr (Y ir, Y is ) = ρ compound bivariate normal densities to form pairwise likelihood nm(m 1) cl(ρ; y 1,..., y n ) = log(1 ρ 2 ) m 1 + ρ 4 2(1 ρ 2 ) SS w (m 1)(1 ρ) SS b 2(1 ρ 2 ) m SS w = n i=1 s=1 m (y is ȳ i. ) 2, SS b = n(m 1) l(ρ; y 1,..., y n ) = log(1 ρ) n log{1 + (m 1)ρ} 2 2 1 2(1 ρ) SS 1 SS w b 2{1 + (m 1)ρ} m n i=1 y 2 i. STA 4508: Topics in Likelihood Inference January 14, 2014 44/57

... symmetric normal a. var(ˆρ) = a. var(ˆρ CL ) = 2 {1 + (m 1)ρ} 2 (1 ρ) 2 nm(m 1) 1 + (m 1)ρ 2 2 (1 ρ) 2 c(m, ρ) nm(m 1) (1 + ρ 2 ) 2 c(m, ρ) = (1 ρ) 2 (3ρ 2 + 1) + mρ( 3ρ 3 + 8ρ 2 3ρ + 2) + m 2 ρ 2 (1 ρ) 2 2 (1 ρ) 2 a.var(ˆρ CL ) = nm(m 1) (1 + ρ 2 c(m, ρ) ) 2 O( 1 n ) O(1) n m STA 4508: Topics in Likelihood Inference January 14, 2014 45/57

... symmetric normal a.var(ˆρ ), m = 3, 5, 8, 10 a.var(ˆρ CL ) (Cox & Reid, 2004) efficiency 0.85 0.90 0.95 1.00 0.0 0.2 0.4 0.6 0.8 1.0 ρ STA 4508: Topics in Likelihood Inference January 14, 2014 46/57

Likelihood ratio test log likelihoods 30 20 10 0 rho=0.5, n=10, q=5 log likelihoods 40 30 20 10 0 rho=0.8, n=10, q=5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 rho rho log likelihoods 100 60 40 20 0 rho=0.2, n=10, q=5 log likelihoods 70 50 30 10 0 rho=0.2, n=7, q=5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 rho rho STA 4508: Topics in Likelihood Inference January 14, 2014 47/57

... symmetric normal + Y i N(µ1, σ 2 R) R st = ρ ˆµ = ˆµ CL, ˆσ 2 = ˆσ 2 CL, ˆρ = ˆρ CL G(θ) = H(θ)J(θ) 1 H(θ) = i(θ) expected Fisher information pairwise likelihood is fully efficient also true for Y i N(µ, Σ) (Mardia, Hughes, Taylor 2007; Jin 2009) because U CL (θ) = J(θ)H(θ)U full (θ) Pagui; Pace et al., 2011 STA 4508: Topics in Likelihood Inference January 14, 2014 48/57

Example: dichotomized MV Normal Y ir = 1{Z ir > 0} Z N(0, R) r = 1,..., m; i = 1, l 2 (ρ) = n {y ir y is log P(y r = 1, y s = 1) + y ir (1 y is ) log P 10 i=1 s<r + (1 y ir )y is log P 01 + (1 y ir )(1 y is ) log P 00 } a.var(ˆρ CL ) = 1 n 4π 2 m 2 (1 ρ 2 ) (m 1) 2 var(t ) T = i (2y ir y is y ir y is ) s<r var(t ) = nm 4 (p 1111 2p 111 + 2p 11 p 2 11 + 1 4 )+ m 3 ( 6p 1111...) + m 2 (...) + m(...) STA 4508: Topics in Likelihood Inference January 14, 2014 49/57

a. variance 0.00 0.05 0.10 0.15 0.20 0.25 pairwise full 0.0 0.2 0.4 0.6 0.8 1.0 rho ρ 0.02 0.05 0.12 0.20 0.40 0.50 ARE 0.998 0.995 0.992 0.968 0.953 0.968 ρ 0.60 0.70 0.80 0.90 0.95 0.98 ARE 0.953 0.903 0.900 0.874 0.869 0.850

Example: multi-level probit model latent variable: z ir = x ir β + b i + ɛ ir, ɛ ir N(0, 1) binary observations: y ir = 1(z ir > 0); r = 1,... m i ; i = 1,... n probit model: Pr(y ir = 1 b i ) = Φ(x ir β + b i); b i N(0, σ 2 b ) likelihood L(β, σ b ) = n i=1 m i pairwise likelihood CL(β, σ b ) = r=1 i=1 r<s Φ(x ir β+b i) y ir {1 Φ(x ir β+b i)} 1 y ir φ(b i, σ 2 b )db i n P y ir y is 11 P y ir (1 y is ) 10 P (1 y ir )y is 01 P (1 y ir )(1 y is ) 00 each Pr(y ir = j, y is = k) evaluated using Φ 2 (, ; ρ irs ) (Renard et al., 2004) STA 4508: Topics in Likelihood Inference January 14, 2014 51/57

... multi-level probit (Renard et al. 2004) computational effort doesn t increase with the number of random effects pairwise likelihood numerically stable efficiency losses, relative to maximum likelihood, of about 20% for estimation of β somewhat larger for estimation of σ 2 b STA 4508: Topics in Likelihood Inference January 14, 2014 52/57

... Example

Markov chains Hjort and Varin, 2008 comparison of likelihood L(θ; y) = pr(y r = y r Y r 1 = y r 1 ; θ) adjoining pairs CML CML(θ; y) = pr(y r = y r, Y r 1 = y r 1 ; θ) composite conditional likelihood (= Besag s PL) CCL(θ; y) = pr(y r = y r neighbours ; θ) STA 4508: Topics in Likelihood Inference January 14, 2014 54/57

... Markov chain example Random walk with p states and two reflecting barriers Transition matrix P = 0 1 0 0... 0 1 ρ 0 ρ 0... 0 0 1 ρ 0 ρ... 0...... 0...... 0 1 0 STA 4508: Topics in Likelihood Inference January 14, 2014 55/57

... Markov chain example Reflecting barrier with five states: efficiency of pairwise likelihood (dashed line) and Besag s pseudolikelihood (solid line) STA 4508: Topics in Likelihood Inference January 14, 2014 56/57

Example: longitudinal count data subjects i = 1,..., n observations counts y ir, r = 1,... m i model y ir Poisson(u ir x T u i1,..., u imi ir β) gamma-distributed random effects but correlated corr(u ir, u is ) = ρ r s joint density has combinatorial number of terms in m i ; impractical weighted pairwise composite likelihood L pair (β) = n i=1 1 m i 1 m i m i r=1 s=r+1 f (y ir, y is ; β) weights chosen so that L pair = full likelihood if ρ = 0 Henderson & Shimura, 2003 STA 4508: Topics in Likelihood Inference January 14, 2014 57/57