Non-informative prior distributions

Σχετικά έγγραφα
Other Test Constructions: Likelihood Ratio & Bayes Tests

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Solution Series 9. i=1 x i and i=1 x i.

6. MAXIMUM LIKELIHOOD ESTIMATION

Statistical Inference I Locally most powerful tests

ST5224: Advanced Statistical Theory II

5. Choice under Uncertainty

Theorem 8 Let φ be the most powerful size α test of H

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

lecture 10: the em algorithm (contd)

Introduction to the ML Estimation of ARMA processes

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Lecture 21: Properties and robustness of LSE

172,,,,. P,. Box (1980)P, Guttman (1967)Rubin (1984)P, Meng (1994), Gelman(1996)De la HorraRodriguez-Bernal (2003). BayarriBerger (2000)P P.. : Casell

Empirical best prediction under area-level Poisson mixed models

5.4 The Poisson Distribution.

An Introduction to Signal Detection and Estimation - Second Edition Chapter II: Selected Solutions

Problem Set 3: Solutions

2 Composition. Invertible Mappings

Lecture 12: Pseudo likelihood approach

The Simply Typed Lambda Calculus

Introduction to Bayesian Statistics

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Lecture 34 Bootstrap confidence intervals

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

Inverse trigonometric functions & General Solution of Trigonometric Equations

Exercises to Statistics of Material Fatigue No. 5

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Asymptotic distribution of MLE

SPECIAL FUNCTIONS and POLYNOMIALS

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Abstract Storage Devices

Reminders: linear functions

Lecture 7: Overdispersion in Poisson regression

Various types of likelihood

Example Sheet 3 Solutions

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

P AND P. P : actual probability. P : risk neutral probability. Realtionship: mutual absolute continuity P P. For example:

Bayesian modeling of inseparable space-time variation in disease risk

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Applying Markov Decision Processes to Role-playing Game

Probability and Random Processes (Part II)

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Objective Priors for the Bivariate Normal Model with Multivariate Generalizations 1

Exercise 2: The form of the generalized likelihood ratio

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

ON NEGATIVE MOMENTS OF CERTAIN DISCRETE DISTRIBUTIONS

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Μηχανική Μάθηση Hypothesis Testing

Partial Differential Equations in Biology The boundary element method. March 26, 2013

The ε-pseudospectrum of a Matrix

Written Examination. Antennas and Propagation (AA ) April 26, 2017.

Uniform Convergence of Fourier Series Michael Taylor

General 2 2 PT -Symmetric Matrices and Jordan Blocks 1

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Exponential Families

C.S. 430 Assignment 6, Sample Solutions

Fractional Colorings and Zykov Products of graphs

Fundamentals of Probability: A First Course. Anirban DasGupta

Congruence Classes of Invertible Matrices of Order 3 over F 2

Outline Analog Communications. Lecture 05 Angle Modulation. Instantaneous Frequency and Frequency Deviation. Angle Modulation. Pierluigi SALVO ROSSI

4.6 Autoregressive Moving Average Model ARMA(1,1)

Homework 3 Solutions

Lecture 2. Soundness and completeness of propositional logic

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Module 5. February 14, h 0min

Topic Modeling with Latent Dirichlet Allocation

Space-Time Symmetries

Monte Carlo Methods. for Econometric Inference I. Institute on Computational Economics. July 19, John Geweke, University of Iowa

Local Approximation with Kernels

Generating Set of the Complete Semigroups of Binary Relations

Concrete Mathematics Exercises from 30 September 2016

Arithmetical applications of lagrangian interpolation. Tanguy Rivoal. Institut Fourier CNRS and Université de Grenoble 1

On Independent Reference Priors

Υπολογιστική Φυσική Στοιχειωδών Σωματιδίων

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Math 6 SL Probability Distributions Practice Test Mark Scheme

The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality

LAD Estimation for Time Series Models With Finite and Infinite Variance

= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

Numerical Analysis FMN011

These derivations are not part of the official forthcoming version of Vasilaky and Leonard

Homework 8 Model Solution Section

On the general understanding of the empirical Bayes method

ESTIMATION OF SYSTEM RELIABILITY IN A TWO COMPONENT STRESS-STRENGTH MODELS DAVID D. HANAGAL

Elements of Information Theory

Every set of first-order formulas is equivalent to an independent set

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Survival Analysis: One-Sample Problem /Two-Sample Problem/Regression. Lu Tian and Richard Olshen Stanford University

MA 342N Assignment 1 Due 24 February 2016

Transcript:

Non-informative prior distributions ANGELIKA VAN DER LINDE University of Bremen March 2004 1. Introduction 2. Standard non-informative priors 3. Reference priors (univariate parameters) 4. Discussion 1

Preliminaries random vector Y generates data y single observations correspond to Y n R q Y = (Y 1,..., Y N ) T, sample size N model M i : p i (y θ) and prior p i (θ), θ Θ R s i θ varies with M i but subscripts most often omitted s i = 1 : θ univariate s i > 1 : θ multivariate special case iid data: model M i : p i (y θ) and prior p i (θ), θ Θ R s i with p i (y θ) = n p i (y n θ) 2

1. Introduction 1.1 Existence of prior distributions de Finetti s representation theorem Y 1, Y 2,... Y N real valued, exchangeable w.r.t. P ( i.e permutations of finite subsets have the same distribution) = there is a measure Q on set of distribution functions F such that N P (Y 1 y 1,..., Y N y N ) = F (y n )dq(f ) F n=1 1.2 Bayesian inference update of prior density by Bayes theorem p(θ y) = posterior p(y θ)p(θ) p(y) p(y θ) p(θ) likelihood prior 3

1.3 Specification of priors subjective prior knowledge informative prior ignorance non-informative objective neutral prior non-informative priors do not exist (cp. Bernardo, 1996) needed: non-subjective priors inducing dominance of data in the posterior may depend on sampling model and quantity of interest priors may be improper (yielding proper posteriors) improper priors are merely a technical device, not interpretable in terms of probability/beliefs yardstick for sensivity analyses 4

2. Conventional objective priors 2.1 Uniform/flat priors θ univariate (i) uniform priors definition Θ = {θ 1,..., θ L } : p(θ) = 1/L Θ R continuous p(θ) 1 interpretation equal weights to all parameters principle of insufficient reason properties / problems appropriate for finite Θ if Θ not compact, posterior p(θ y) may be improper (useless) lack of invariance w.r.t. 1 : 1 transformations may induce inadmissable estimators of θ (ii) limits of flat (proper) priors e.g. conjugate priors like information from former experiment with sample size m consider m 0 do not solve problems 5

example sampling prior Y B(θ, N), p(y θ) = ( N y ) θ y (1 θ) N y, θ (0, 1) ϑ Beta(α, β), p(θ) θ α 1 (1 θ) β 1, α, β > 0, m = α + β posterior ϑ y Beta(α + y, β + N y) (i) proper prior p(θ) 1 ϑ Beta(1, 1) p(θ y) proper (ii) lack of invariance, improper posterior φ = logit θ = log θ 1 θ R+ p(φ) 1 p(θ) = p(logit θ) dlogit θ = θ 1 (1 θ) 1 dθ ϑ Beta(0, 0) improper ϑ y Beta(y, N y) improper if y {0, N} (iii) limit m = α + β 0 α, β 0 improper posterior α,β>0 (ii) 6

θ multivariate example (Stein s paradox) sampling Y n N(θ, I q ), iid Y sufficient, Y N(θ, N 1 I q ) prior posterior p(θ) 1, θ R q yields bad estimate of φ = θ 2 (if q large and N small) E(φ y) = y 2 + q N φ = y 2 q N best estimate 7

2.2 Jeffreys prior θ univariate definition p J (θ) [ p(y θ) d2 log p(y θ) dθ 2 dy] 1/2 = [ E y θ ( d2 log p(y θ) dθ 2 )] 1/2 = : J(θ) 1/2 interpretation root of expected Fisher information KL(p(y θ), p(y θ + θ)) 1 2 J(θ)( θ)2 favouring θ with large J(θ) enhancing discriminatory potential of p(y θ) minimizing influence of prior example (continued) p J (θ) θ 1/2 (1 θ) 1/2 ϑ Beta(1/2, 1/2) 8

properties / problems Jeffreys priors may be improper invariance w.r.t. 1 : 1 transformations φ = g(θ) J(φ) = J(θ) dg 1 dφ 2 p J (φ) = p J (g 1 (θ)) dg 1 dφ 9

special cases location parameters P loc = {p(y θ) = p 0 (y θ) θ Θ} translation invariant, i.e. Y = Y θ and p(y) P loc p(y ) = p 0 (y (θ θ )) P loc Jeffreys prior translation invariant, i.e. p J (θ) 1 p J (θ) = p J (θ θ ) example: Y N(µ, σ 2 0), p J (µ) 1 scale parameters scale invariant, i.e. P scale = {p(y θ) = 1 θ p 0( y ) θ > 0} θ Y = Y θ and p(y) P scale p(y ) = p 0 (y ) Jeffreys prior p J (θ) 1 θ scale invariant, i.e. p J (θ) = 1 c p J( θ c ), c > 0 example: Y N(θ 0, σ 2 ), p J (σ) σ 1 10

invariance w.r.t. sufficiency if t(y) = t sufficient for θ sampling prior posterior p(y θ) p(t θ) p J,y (θ) p J,t (θ) p(θ y) = p(θ t) violation of likelihood principle For inferences or decisions about θ having observed y, all relevant information is contained in the likelihood function. Proportional likelihood functions contain the same information about θ. expectation w.r.t. y (in J(θ)) problematic but: lack of knowledge relative to that provided by the experiment changes with the experiment 11

example flipping a coin in a series of trials yielding 9 heads and 3 tails (i) Y = number of heads, number of trials N = 12 predetermined Y B(θ, 12) p J (θ) θ 1/2 (1 θ) 1/2 (proper) (ii) coin flipped until 3 tails were observed, N random N NegBin(1 θ, 3) p J (θ) θ 1 (1 θ) 1/2 (improper) likelihood in both set-ups θ 9 (1 θ) 3 12

θ multivariate definition expected Fisher information matrix J(θ) = (( E y θ 2 log p(y θ) θ i θ j )) Jeffreys prior p J (θ) det(j(θ)) 1/2 example Y N(µ, σ 2 ), θ = (µ, σ) p J (θ) 1/σ 2 if prior independence assumed p J (θ) 1/σ 13

problem marginalization paradoxes: marginal of the joint posterior posterior based on marginal (sampling) example Y N(( µ 1 µ 2 ), ( σ1 2 ρσ 1 σ 2 ρσ 1 σ 2 σ2 2 )) θ = (µ 1, µ 2, σ 1, σ 2, ρ) Jeffreys prior p J,y (θ) (1 ρ 2 ) 3 2 σ 2 1 σ2 2 r empirical correlation coeffcient with R q(r ρ) depending only on ρ but Jeffreys prior p J,r (ρ) (1 ρ 2 ) 1. p J,y (ρ y) p J,r (ρ r) for p J,y (ρ y) = p J,r (ρ r) p(θ) (1 ρ 2 ) 1 σ 2 1 σ 2 2 p J,y (θ). 14

3. Reference priors θ univariate 3.1 Idea and definition idea information about θ is given in prior p(θ) experiment e maximize minimize effect of data prior on posterior p(θ y) amount of information about θ that experiment e is expected to provide I(e, p(θ)) = E y [KL(p(θ y), p(θ))] = p(y) p(θ y) log p(θ y) p(θ) dθdy direct maximization w.r.t. p(θ) gives unappealing results (discrete support) 15

asymptotic approach: consider k independent repititions of experiment yielding I(e(k), p(θ)) maximize the missing information about θ (H(ϑ) = H(ϑ Z) + I(Z, ϑ)) I(e( ), p(θ)) := lim k I(e(k), p(θ)) problem possibly I(e( ), p(θ)) = solution find and take the limit π k (θ) = arg max I(e(k), p(θ)) π k (θ) k π(θ) 16

definition Let π k (θ) = arg max I(e(k), p(θ)) and π k (θ y) the corresponding posterior density. The reference posterior density π(θ y) is defined to be the intrinsic limit of π k (θ y), i.e. KL(π k (θ y), π(θ y)) k 0. A reference prior function π(θ) is any positive function generating the reference posterior density, i.e. π(θ y) p(y θ)π(θ). 17

3.2 Explicit form k independent repititions of the experiment e yield z k = (y (1),...y (k) ), y (l) = (y (l) 1,..., y (l) N ) re-expression of I(e(k), p(θ)) I(e(k), p(θ)) = where f k (θ) = exp( p(θ) log f k(θ) dθ (1) p(θ) p(z k θ) log p(θ z k )dz k ) maximization w.r.t. p(θ) given f k (θ) yields π k (θ) f k (θ) but f k (θ) implicitly depends on p(θ) through p(θ y) asymptotic approximation p (θ y) yields fk (θ) and π k(θ) f k (θ) pragmatic (algorithmic) determination of reference prior π(θ) π(θ) lim k f k (θ) f k (θ 0) division by f k (θ 0) eliminates constants intrinsic limit only checked if problems become apparent 18

proof of (1) I(e(k), p(θ)) = = = p(z k ) p(θ) p(θ) p(θ z k ) log p(θ z k) p(θ) dθdz k p(z k θ) log p(θ z k) p(θ) dz kdθ p(z k θ) log p(θ z k )dz k dθ p(θ) p(z k θ) log p(θ)dz k dθ = p(θ) log exp p(z k θ) log p(θ z k )dz k dθ }{{} f k (θ) p(θ) p(z k θ) log p(θ)dz k dθ = p(θ) log f k (θ)dθ p(θ) log p(θ)dθ = p(θ) log f k(θ) p(θ) dθ = KL(p(θ), f k (θ)) 19

3.3 Special case: Θ finite Θ = {θ 1,..., θ L }. and lim p(θ i z k ) = 1 if θ i true k 0 if θ i not true I(e(k), p(θ)) = E zk H(ϑ z k ) + H(ϑ) H(ϑ). k hence i.e. π k(θ) maximum entropy prior on Θ π(θ) uniform on Θ 20

3.4 Special case: Θ continuous starting point π k(θ) f k (θ) = exp E zk θ[log p (θ z k )] with sufficient estimate θ k : replace z k by θ k f k (θ) exp E θk θ [log p (θ θ k )] with consistent estimate θ k : θk θ k f k (θ) k p (θ θ k ) θk =θ often θ k mle with asymptotically Normal posterior distribution ϑ z k N( θ k, (kj( θ k )) 1 ) p (θ z k ) = 1 2πk 1/2 J( θ k ) exp( 1 1/2 2 ( θ θ k (kj( θ k )) 1/2 )2 ) and p (θ θ k ) θk =θ = 1 2π k 1/2 J( θ k ) 1/2 θk =θ hence (under regularity conditions) the reference prior is Jeffreys prior π(θ) lim k f k (θ) f k (θ 0) J(θ)1/2 21

3.5 Restricted reference priors restrictions E θ (g i (θ)) = β i with Lagrange multipliers λ i π r (θ) π(θ) exp( i λ i g i (θ)) 22

3.6 Examples (i) Θ = {θ 1,..., θ L }, no restriction, π(θ) 1/L Θ = {θ 1,..., θ 4 }, restriction p(θ 1 ) = 2p(θ 2 ), π r (θ) = {0.324, 0.162, 0.257, 0.257} (ii) Y θ B(θ, N), π(θ) = Beta(1/2, 1/2) memo: Beta(1, 1) = uniform for θ Beta(0, 0) = uniform for logit θ 23

(iii) (a) no restriction Y θ N(θ, σ 2 0), π(θ) 1 (b) no restriction Y σ N(0, σ 2 ), π(σ) 1/σ (π(σ 2 ) 1/σ 2 ) (c) with restrictions Y θ N(θ, σ 2 ) g 1 (θ) = θ E(ϑ) = m 0 g 2 (θ) = (θ µ 0 ) 2 var(ϑ) = τ 2 0 π r (θ) 1 exp(λ 1 θ + λ 2 (θ µ 0 ) 2 ) = N(m 0, τ 2 0 ) 24

4. Discussion principled: priors should represent subjective knowledge violation of likelihood principle model dependence crucial involved asymptotic definition versus default/automated procedure by and large heuristic, formal elaboration still under work general criterion for derivation of default priors claim: represent lack of prior knowledge about the quantity of interest relative to that provided by the data matching frequentist coverage probabilities quantity of interest - parameter θ - future observation ỹ reference priors for prediction (Kuboki, 1988; work in progress by Sweeting/Datta/Ghosh) 25

References [1] Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). Springer: New York. (Chapter 3.3) [2] Bernardo, J.M. and Smith, A.F.M. (1994). Bayesian Theory. Wiley: New York. (Chapter 5). [3] Bernardo, J.M. (1997). Noninformative Priors Do Not Exist: A Discussion J. Statist. Pl. Inf. 65, 159-189 (with discussion). [4] Bernardo, J.M. (1998). Bayesian Reference Analysis. A Postgraduate Tutorial Course. Available from: www.uv.es/ bernardo [5] Bernardo, J.M. and Berger, J.O. (1992). On the Development of Reference Priors. In: Bernardo et al. (Eds.). Bayesian Statistics 4. Oxford University Press: London, 35-60. [6] Kass, R.E. and Wasserman, L. (1996). The Selection of Prior Distributions by Formal Rules. J. Amer. Statist. Ass. 91, 1343-1370. [7] Kuboki, H. (1998). Reference Priors for Prediction. J. Statist. Pl. Inf. 69, 295-317. [8] Robert, C.P. (1994). The Bayesian Choice. Springer: New York. (Chapter 3.4). 26