Introduction to Bayesian Statistics

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models Rung-Ching Tsai Department of Mathematics National Taiwan Normal University May 6, 2015

Example Data: Weekly weights of 30 young rats (Gelfand, Hills, Racine-Poon, & Smith, 1990). Model: Day 8 15 22 29 36 Rat 1 151 199 246 283 320 Rat 2 145 199 249 293 354 Rat 30 153 200 244 286 324 Y ij = α + βx j + ɛ ij, where Y ij : weight of i-th rat on day x j ; ɛ ij Normal(0, σ 2 ) What is the assumption on the growth of the 30 rats in this model? 2 of 22

Example Data: Number of Failures and length of operation time of 10 power plant pumps (George, Makov, & Smith, 1993). Pump 1 2 3 4 5 6 7 8 9 10 time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5 failure 5 1 5 14 3 19 1 1 4 22 Model: X ij Poisson(λt i ) where X ij is the number of power failures, λ is the failure rate, and t i is the length of operation time of pump i (in 1000s of hours). What is the assumption on the failure rates of the 10 power plant pumps in this model? 3 of 22

Possible problems with above approaches A single (α, β) may be inadequate to fit all the rats. Likewise, a common failure rate for all the power plant pumps may not be suitable. Separate unrelated (α i, β i ) for each rat, or λ i for each pump are likely to overfit the data. Some information about the parameters of one rat or one pump can be obtained from others data. 4 of 22

Motivation for hierarchical models A thought naturally arises by assuming that (α i, β i ) s or λ i s are samples from a common population distribution. The distribution of observed outcomes are conditional on parameters which themselves have a probability specification, known as a hierarchical or multilevel model. The new parameters introduced to govern the population distribution of the parameters are called hyperparameters. Thus, we would need to estimate the parameters governing the population distribution of (α i, β i ) rather than each (α i, β i ) separately. 5 of 22

Bayesian approach to hierarchical models Model specification specify the sampling distribution of data: p(y θ) specify the population distribution of θ: p(θ φ) where φ is the hyperparameter Bayesian estimation specify the prior for hyperparameter: p(φ); Many levels are possible. The hyperprior distribution at highest level is often chosen to be non-informative consider the above model specification: p(y θ) and p(θ φ) find the joint posterior distribution of parameter θ and hyperparameter φ: p(θ, φ y) p(θ, φ)p(y θ, φ) = p(θ, φ)p(y θ) p(φ)p(θ φ)p(y θ) Point and Credible interval estimations for φ and θ Predictive distribution for ỹ 6 of 22

Analytical derivation of conditional/marginal dist. Write put the joint posterior distribution: p(θ, φ y) p(φ)p(θ φ)p(y θ) Determine analytically the conditional posterior density of θ given φ: p(θ φ, y) Obtain the marginal posterior distribution of φ: p(φ y) = p(θ, φ y)dθ or p(φ y) = p(θ, φ y) p(θ φ, y). 7 of 22

Simulations from the posterior distributions 1. Two steps to simulate a random draw from the joint posterior distribution of θ and φ: p(θ, φ y) Draw φ from its marginal posterior distribution: p(φ y) Draw parameter θ from its conditional posterior p(θ φ, y) 2. If desired, draw predictive values ỹ from the posterior predictive distribution given the drawn θ 8 of 22

Example: Rat tumors Goal: Estimating the risk of tumor in a group of rats Data (number of rats developed some kind of tumor): 1. 70 historical experiments: 0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/19 0/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/19 1/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/20 2/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/20 3/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/20 4/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/20 6/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24 2. Current experiment: 4/14 9 of 22

Bayesian approach to hierarchical models Model specification sampling distribution of data: y j binomial(, θ j ), j = 1, 2,, 71. the population distribution of θ: θ j Beta(α, β) where α and β are the hyperparameters. Bayesian estimation non-informative prior for hyperparameters: p(α, β) consider the above model specification: p(θ α, β) find the joint posterior distribution of parameter θ and hyperparameters α and β: p(θ, α, β y) p(α, β)p(θ α, β)p(y θ, α, β) J Γ(α + β) J p(α, β) Γ(α)Γ(β) θα 1 j (1 θ j ) β 1 θ y i j (1 θ j ) y j 10 of 22

Analytical derivation of conditional/marginal dist. the joint posterior distribution: p(θ, α, β y) p(α, β) J Γ(α + β) Γ(α)Γ(β) θα 1 j (1 θ j ) β 1 the conditional posterior density of θ given α and β: p(θ α, β, y) = J θ y i j (1 θ j ) y j J Γ(α + β + ) Γ(α + y j )Γ(β + y j ) θα+y j 1 j (1 θ j ) β+ y j 1 the marginal posterior distribution of α and β: p(α, β y) = 11 of 22 p(θ, α, β y) J p(θ α, β, y) p(α, β) Γ(α + β) Γ(α + y j )Γ(β + y j ) Γ(α)Γ(β) Γ(α + β + )

Choice of hyperprior distribution Idea: To set up a non-informative hyperprior distribution ( ) p logit( α α+β ) = log( α β ), log(α + β) 1 NO( GOOD because ) it leads to improper posterior. α p α+β, α + β 1 or p(α, β) 1 NO GOOD because the posterior density is not integrable in the limit. ( ) α p, (α + β) 1/2 1 p(α, β) (α + β) 5/2 α + β p (log( αβ ) ), log(α + β) αβ(α + β) 5/2 OK because it leads to proper posterior. 12 of 22

Computing marginal posterior of the hyperparameters Computing the relative (unnormalized) posterior density on a grid of values that cover the ) effective range of (α, β) (log( αβ ), log(α + β) [ 1, 2.5] [1.5, 3] ) (log( αβ ), log(α + β) [ 1.3, 2.3] [1, 5] Drawing contour plot ) of the marginal density of (log( α β ), log(α + β) contour lines are at 0.05, 0.15,, 0.95 times the density at the mode. Normalizing by approximating the posterior distribution as a step function over a grid and setting total probability in the grid to 1. Computing the posterior moments based on the grid of (log( α β ), log(α + β)). For example, E(α y) is estimated by α = αp(log( ), log(α + β) y) β log( α β ),log(α+β) 13 of 22

Sampling from the joint posterior 1. Simulation 1000 draws of (log( α β ), log(α + β)) from their posterior distribution using the discrete-grid sampling procedure. 2. For l = 1,, 1000 Transform the l-th draw of (log( α β ), log(α + β)) to the scale of (α, β) to yield a draw of the hyperparameters from their marginal posterior distribution. For each j = 1,, J, sample θ j from its conditional posterior distribution θ j α, β, y Beta(α + y j, β + y j ). 14 of 22

Displaying the results Plot the posterior means and 95% intervals for the θ j s (Figure 5.4 on page 131) Rate θ j s are shrunk from their sample point estimates, y j, towards the population distribution, with approximate mean. Experiment with few observation are shrunk more and have higher posterior variances. Note that posterior variability is higher in the full Bayesian analysis, reflecting posterior uncertainty in the hyperparameters. 15 of 22

Hierarchical normal models (I) Model specification Sampling distribution of data: y ij θ j Normal(θ j, σ 2 ), i = 1,,, j = 1, 2,, J. σ 2 known the population distribution of θ: θ j Normal(µ, τ 2 ) where µ and τ are the hyperparameters. That is, J p(θ 1,, θ J µ, τ) = N(θ j µ, τ 2 ) J p(θ 1,, θ J ) = [N(θ j µ, τ 2 )]p(µ, τ)d(µ, τ). 16 of 22

Hierarchical normal models (II) Bayesian estimation non-informative prior for hyperparameters: p(µ, τ) = p(µ τ)p(τ) p(τ) consider the above model specification: p(θ µ, τ) find the joint posterior distribution of parameter θ and hyperparameters µ and τ: p(θ, µ, τ y) p(µ, τ)p(θ µ, τ)p(y θ) J J p(µ, τ) N(θ j µ, τ 2 ) N(ȳ.j θ j, σ 2 / ) 17 of 22

Conditional posterior of θ given (µ, τ), p(θ µ, τ, y) where θ j µ, τ Normal(µ, τ 2 ), θ j µ, τ, y Normal(ˆθ j, V j ), ˆθ j = V j = σ ȳ 2.j + 1 τ µ 2 σ + 1 2 τ 2 1 σ 2 + 1 τ 2 18 of 22

Marginal posterior of µ and τ, p(µ, τ y) Therefore, p(µ, τ y) p(µ, τ)p(y µ, τ) ȳ.j µ, τ Normal(µ, σ2 + τ 2 ) p(µ, τ y) p(µ, τ) J N(ȳ.j µ, σ2 + τ 2 ) 19 of 22

Posterior of µ given τ, p(µ τ, y) Therefore, p(µ, τ y) = p(µ τ, y)p(τ y) p(µ τ, y) = p(µ, τ y) p(τ y) µ τ, y Normal(ˆµ, V µ ), where ˆµ = J J 1 σ 2 +τ 2 ȳ.j 1 σ 2 +τ 2 1 and Vµ = J 1 σ 2 + τ 2 20 of 22

Posterior distribution of τ, p(τ y) p(τ y) = p(µ, τ y) p(µ τ, y p(τ) J N(ȳ.j µ, σ2 + τ 2 ) N(µ ˆµ, V µ ) p(τ) J N(ȳ.j ˆµ, σ2 + τ 2 ) N(ˆµ ˆµ, V µ ) p(τ)v 1/2 µ J + τ 2 ) 1/2 exp (ȳ.j ˆµ) 2 2( σ2 + τ 2 ) ( σ2 21 of 22

Prior distribution of τ, p(τ) p(τ y) = p(µ, τ y) p(µ τ, y p(τ) J N(ȳ.j µ, σ2 + τ 2 ) N(µ ˆµ, V µ ) p(τ) J N(ȳ.j ˆµ, σ2 + τ 2 ) N(ˆµ ˆµ, V µ ) p(τ)v 1/2 µ J + τ 2 ) 1/2 exp (ȳ.j ˆµ) 2 2( σ2 + τ 2 ) ( σ2 22 of 22