Non-informative prior distributions

Non-informative prior distributions ANGELIKA VAN DER LINDE University of Bremen March 2004 1. Introduction 2. Standard non-informative priors 3. Reference priors (univariate parameters) 4. Discussion 1

Preliminaries random vector Y generates data y single observations correspond to Y n R q Y = (Y 1,..., Y N ) T, sample size N model M i : p i (y θ) and prior p i (θ), θ Θ R s i θ varies with M i but subscripts most often omitted s i = 1 : θ univariate s i > 1 : θ multivariate special case iid data: model M i : p i (y θ) and prior p i (θ), θ Θ R s i with p i (y θ) = n p i (y n θ) 2

1. Introduction 1.1 Existence of prior distributions de Finetti s representation theorem Y 1, Y 2,... Y N real valued, exchangeable w.r.t. P ( i.e permutations of finite subsets have the same distribution) = there is a measure Q on set of distribution functions F such that N P (Y 1 y 1,..., Y N y N ) = F (y n )dq(f ) F n=1 1.2 Bayesian inference update of prior density by Bayes theorem p(θ y) = posterior p(y θ)p(θ) p(y) p(y θ) p(θ) likelihood prior 3

1.3 Specification of priors subjective prior knowledge informative prior ignorance non-informative objective neutral prior non-informative priors do not exist (cp. Bernardo, 1996) needed: non-subjective priors inducing dominance of data in the posterior may depend on sampling model and quantity of interest priors may be improper (yielding proper posteriors) improper priors are merely a technical device, not interpretable in terms of probability/beliefs yardstick for sensivity analyses 4

2. Conventional objective priors 2.1 Uniform/flat priors θ univariate (i) uniform priors definition Θ = {θ 1,..., θ L } : p(θ) = 1/L Θ R continuous p(θ) 1 interpretation equal weights to all parameters principle of insufficient reason properties / problems appropriate for finite Θ if Θ not compact, posterior p(θ y) may be improper (useless) lack of invariance w.r.t. 1 : 1 transformations may induce inadmissable estimators of θ (ii) limits of flat (proper) priors e.g. conjugate priors like information from former experiment with sample size m consider m 0 do not solve problems 5

example sampling prior Y B(θ, N), p(y θ) = ( N y ) θ y (1 θ) N y, θ (0, 1) ϑ Beta(α, β), p(θ) θ α 1 (1 θ) β 1, α, β > 0, m = α + β posterior ϑ y Beta(α + y, β + N y) (i) proper prior p(θ) 1 ϑ Beta(1, 1) p(θ y) proper (ii) lack of invariance, improper posterior φ = logit θ = log θ 1 θ R+ p(φ) 1 p(θ) = p(logit θ) dlogit θ = θ 1 (1 θ) 1 dθ ϑ Beta(0, 0) improper ϑ y Beta(y, N y) improper if y {0, N} (iii) limit m = α + β 0 α, β 0 improper posterior α,β>0 (ii) 6

θ multivariate example (Stein s paradox) sampling Y n N(θ, I q ), iid Y sufficient, Y N(θ, N 1 I q ) prior posterior p(θ) 1, θ R q yields bad estimate of φ = θ 2 (if q large and N small) E(φ y) = y 2 + q N φ = y 2 q N best estimate 7

2.2 Jeffreys prior θ univariate definition p J (θ) [ p(y θ) d2 log p(y θ) dθ 2 dy] 1/2 = [ E y θ ( d2 log p(y θ) dθ 2 )] 1/2 = : J(θ) 1/2 interpretation root of expected Fisher information KL(p(y θ), p(y θ + θ)) 1 2 J(θ)( θ)2 favouring θ with large J(θ) enhancing discriminatory potential of p(y θ) minimizing influence of prior example (continued) p J (θ) θ 1/2 (1 θ) 1/2 ϑ Beta(1/2, 1/2) 8

properties / problems Jeffreys priors may be improper invariance w.r.t. 1 : 1 transformations φ = g(θ) J(φ) = J(θ) dg 1 dφ 2 p J (φ) = p J (g 1 (θ)) dg 1 dφ 9

special cases location parameters P loc = {p(y θ) = p 0 (y θ) θ Θ} translation invariant, i.e. Y = Y θ and p(y) P loc p(y ) = p 0 (y (θ θ )) P loc Jeffreys prior translation invariant, i.e. p J (θ) 1 p J (θ) = p J (θ θ ) example: Y N(µ, σ 2 0), p J (µ) 1 scale parameters scale invariant, i.e. P scale = {p(y θ) = 1 θ p 0( y ) θ > 0} θ Y = Y θ and p(y) P scale p(y ) = p 0 (y ) Jeffreys prior p J (θ) 1 θ scale invariant, i.e. p J (θ) = 1 c p J( θ c ), c > 0 example: Y N(θ 0, σ 2 ), p J (σ) σ 1 10

invariance w.r.t. sufficiency if t(y) = t sufficient for θ sampling prior posterior p(y θ) p(t θ) p J,y (θ) p J,t (θ) p(θ y) = p(θ t) violation of likelihood principle For inferences or decisions about θ having observed y, all relevant information is contained in the likelihood function. Proportional likelihood functions contain the same information about θ. expectation w.r.t. y (in J(θ)) problematic but: lack of knowledge relative to that provided by the experiment changes with the experiment 11

example flipping a coin in a series of trials yielding 9 heads and 3 tails (i) Y = number of heads, number of trials N = 12 predetermined Y B(θ, 12) p J (θ) θ 1/2 (1 θ) 1/2 (proper) (ii) coin flipped until 3 tails were observed, N random N NegBin(1 θ, 3) p J (θ) θ 1 (1 θ) 1/2 (improper) likelihood in both set-ups θ 9 (1 θ) 3 12

θ multivariate definition expected Fisher information matrix J(θ) = (( E y θ 2 log p(y θ) θ i θ j )) Jeffreys prior p J (θ) det(j(θ)) 1/2 example Y N(µ, σ 2 ), θ = (µ, σ) p J (θ) 1/σ 2 if prior independence assumed p J (θ) 1/σ 13

problem marginalization paradoxes: marginal of the joint posterior posterior based on marginal (sampling) example Y N(( µ 1 µ 2 ), ( σ1 2 ρσ 1 σ 2 ρσ 1 σ 2 σ2 2 )) θ = (µ 1, µ 2, σ 1, σ 2, ρ) Jeffreys prior p J,y (θ) (1 ρ 2 ) 3 2 σ 2 1 σ2 2 r empirical correlation coeffcient with R q(r ρ) depending only on ρ but Jeffreys prior p J,r (ρ) (1 ρ 2 ) 1. p J,y (ρ y) p J,r (ρ r) for p J,y (ρ y) = p J,r (ρ r) p(θ) (1 ρ 2 ) 1 σ 2 1 σ 2 2 p J,y (θ). 14

3. Reference priors θ univariate 3.1 Idea and definition idea information about θ is given in prior p(θ) experiment e maximize minimize effect of data prior on posterior p(θ y) amount of information about θ that experiment e is expected to provide I(e, p(θ)) = E y [KL(p(θ y), p(θ))] = p(y) p(θ y) log p(θ y) p(θ) dθdy direct maximization w.r.t. p(θ) gives unappealing results (discrete support) 15

asymptotic approach: consider k independent repititions of experiment yielding I(e(k), p(θ)) maximize the missing information about θ (H(ϑ) = H(ϑ Z) + I(Z, ϑ)) I(e( ), p(θ)) := lim k I(e(k), p(θ)) problem possibly I(e( ), p(θ)) = solution find and take the limit π k (θ) = arg max I(e(k), p(θ)) π k (θ) k π(θ) 16

definition Let π k (θ) = arg max I(e(k), p(θ)) and π k (θ y) the corresponding posterior density. The reference posterior density π(θ y) is defined to be the intrinsic limit of π k (θ y), i.e. KL(π k (θ y), π(θ y)) k 0. A reference prior function π(θ) is any positive function generating the reference posterior density, i.e. π(θ y) p(y θ)π(θ). 17

3.2 Explicit form k independent repititions of the experiment e yield z k = (y (1),...y (k) ), y (l) = (y (l) 1,..., y (l) N ) re-expression of I(e(k), p(θ)) I(e(k), p(θ)) = where f k (θ) = exp( p(θ) log f k(θ) dθ (1) p(θ) p(z k θ) log p(θ z k )dz k ) maximization w.r.t. p(θ) given f k (θ) yields π k (θ) f k (θ) but f k (θ) implicitly depends on p(θ) through p(θ y) asymptotic approximation p (θ y) yields fk (θ) and π k(θ) f k (θ) pragmatic (algorithmic) determination of reference prior π(θ) π(θ) lim k f k (θ) f k (θ 0) division by f k (θ 0) eliminates constants intrinsic limit only checked if problems become apparent 18

proof of (1) I(e(k), p(θ)) = = = p(z k ) p(θ) p(θ) p(θ z k ) log p(θ z k) p(θ) dθdz k p(z k θ) log p(θ z k) p(θ) dz kdθ p(z k θ) log p(θ z k )dz k dθ p(θ) p(z k θ) log p(θ)dz k dθ = p(θ) log exp p(z k θ) log p(θ z k )dz k dθ }{{} f k (θ) p(θ) p(z k θ) log p(θ)dz k dθ = p(θ) log f k (θ)dθ p(θ) log p(θ)dθ = p(θ) log f k(θ) p(θ) dθ = KL(p(θ), f k (θ)) 19

3.3 Special case: Θ finite Θ = {θ 1,..., θ L }. and lim p(θ i z k ) = 1 if θ i true k 0 if θ i not true I(e(k), p(θ)) = E zk H(ϑ z k ) + H(ϑ) H(ϑ). k hence i.e. π k(θ) maximum entropy prior on Θ π(θ) uniform on Θ 20

3.4 Special case: Θ continuous starting point π k(θ) f k (θ) = exp E zk θ[log p (θ z k )] with sufficient estimate θ k : replace z k by θ k f k (θ) exp E θk θ [log p (θ θ k )] with consistent estimate θ k : θk θ k f k (θ) k p (θ θ k ) θk =θ often θ k mle with asymptotically Normal posterior distribution ϑ z k N( θ k, (kj( θ k )) 1 ) p (θ z k ) = 1 2πk 1/2 J( θ k ) exp( 1 1/2 2 ( θ θ k (kj( θ k )) 1/2 )2 ) and p (θ θ k ) θk =θ = 1 2π k 1/2 J( θ k ) 1/2 θk =θ hence (under regularity conditions) the reference prior is Jeffreys prior π(θ) lim k f k (θ) f k (θ 0) J(θ)1/2 21

3.5 Restricted reference priors restrictions E θ (g i (θ)) = β i with Lagrange multipliers λ i π r (θ) π(θ) exp( i λ i g i (θ)) 22

3.6 Examples (i) Θ = {θ 1,..., θ L }, no restriction, π(θ) 1/L Θ = {θ 1,..., θ 4 }, restriction p(θ 1 ) = 2p(θ 2 ), π r (θ) = {0.324, 0.162, 0.257, 0.257} (ii) Y θ B(θ, N), π(θ) = Beta(1/2, 1/2) memo: Beta(1, 1) = uniform for θ Beta(0, 0) = uniform for logit θ 23

(iii) (a) no restriction Y θ N(θ, σ 2 0), π(θ) 1 (b) no restriction Y σ N(0, σ 2 ), π(σ) 1/σ (π(σ 2 ) 1/σ 2 ) (c) with restrictions Y θ N(θ, σ 2 ) g 1 (θ) = θ E(ϑ) = m 0 g 2 (θ) = (θ µ 0 ) 2 var(ϑ) = τ 2 0 π r (θ) 1 exp(λ 1 θ + λ 2 (θ µ 0 ) 2 ) = N(m 0, τ 2 0 ) 24

4. Discussion principled: priors should represent subjective knowledge violation of likelihood principle model dependence crucial involved asymptotic definition versus default/automated procedure by and large heuristic, formal elaboration still under work general criterion for derivation of default priors claim: represent lack of prior knowledge about the quantity of interest relative to that provided by the data matching frequentist coverage probabilities quantity of interest - parameter θ - future observation ỹ reference priors for prediction (Kuboki, 1988; work in progress by Sweeting/Datta/Ghosh) 25

References [1] Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). Springer: New York. (Chapter 3.3) [2] Bernardo, J.M. and Smith, A.F.M. (1994). Bayesian Theory. Wiley: New York. (Chapter 5). [3] Bernardo, J.M. (1997). Noninformative Priors Do Not Exist: A Discussion J. Statist. Pl. Inf. 65, 159-189 (with discussion). [4] Bernardo, J.M. (1998). Bayesian Reference Analysis. A Postgraduate Tutorial Course. Available from: www.uv.es/ bernardo [5] Bernardo, J.M. and Berger, J.O. (1992). On the Development of Reference Priors. In: Bernardo et al. (Eds.). Bayesian Statistics 4. Oxford University Press: London, 35-60. [6] Kass, R.E. and Wasserman, L. (1996). The Selection of Prior Distributions by Formal Rules. J. Amer. Statist. Ass. 91, 1343-1370. [7] Kuboki, H. (1998). Reference Priors for Prediction. J. Statist. Pl. Inf. 69, 295-317. [8] Robert, C.P. (1994). The Bayesian Choice. Springer: New York. (Chapter 3.4). 26