Modern Bayesian Statistics Part III: high-dimensional modeling Example 3: Sparse and time-varying covariance modeling Hedibert Freitas Lopes 1 hedibert.org 13 a amostra de Estatística IME-USP, October 2018 1 Professor of Statistics and Econometrics at Insper, São Paulo. 1
Example 3: Sparse and time-varying covariance modeling Consider the Gaussian linear model y = X β + ε, ε N (0, σ 2 I ), where y is n-length vector and X is a n q design matrix. Ridge Regression (l 2 penalty on β) ˆβ ridge = arg min β { y X β 2 + λ β 2 2 leading to ˆβ ridge = (X X + λi ) 1 X y. LASSO (l 1 penalty on β) }, λ 0, ˆβ lasso = arg min{ y X β 2 + λ β 1 }, λ 0, β which can be solved by using quadratic programming techniques such as coordinate gradient descent. 2
Bayesian regularization in linear regression problems Hierarchical scale mixture of normals: β ψ N (0, ψ) and ψ θ p(ψ), Maximum a posteriori (MAP): arg max{p(y β, σ 2 )p(β ψ)} β A few cases: Prior p(ψ) p(β) Bayesian Lasso ψ E(λ 2 /2) Laplace Ridge ψ IG(a, b) Scaled-t Normal-Gamma ψ G(λ, 1/(2γ 2 )) below p(β λ, γ 2 ) = 1 π2 λ 1/2 γ λ+1/2 Γ(λ) β λ 1/2 K λ 1/2 ( β /γ), where Var(β λ, γ 2 ) = 2λγ 2 and excess kurtosis 3/λ. 3
The Normal-Gamma prior High mass close to zero and heavy tails Figure: λ = 0.1 (dot), λ = 0.33 (dot-dashed), λ = 1 (solid). 4
Spike-and-slab priors Stochastic search variable selection (SSVS) SSVS places independent mixture priors directly on the coefficients β J (1 J) N (0, τ 2 ) +J N (0, c 2 τ 2 ), }{{}}{{} spike slab with c > 1 large, τ > 0 small and J Ber(ω). SMN representation β ψ N (0, ψ) ψ J (1 J)δ τ 2(.) + Jδ c 2 τ 2(.) 5
Spike-and-slab priors on the scale parameter ψ Normal mixture of Inverse-Gammas (NMIG) β K N (0, Kτ 2 ), K ω (1 ω)δ υ0 (.) + ωδ υ1 (.), υ 0 /υ 1 1, τ 2 IG(a τ, b τ ). (1) SMN representation β ψ N (0, ψ) and ψ (1 ω)ig(a τ, υ 0 b τ ) + ωig(a τ, υ 1 b τ ) The resulting marginal distribution of β is a two component mixture of scaled Student s t distributions. 6
Spike-and-slab priors on the scale parameter ψ Figure: Conditional density for hypervariance ψ for NMIG mixture prior where υ 0 = 0.005, υ 1 = 1, a τ = 5, b τ = 50 and (a) ω = 0.5 (black line), (b) ω = 0.95 (red line). Note that as ω has a Uniform prior, (a) also corresponds to the marginal density of ψ. Observe that only the height of the density changes as ω varied. 7
Spike-and-slab priors: summary Prior Spike ψ J = 0 Slab ψ J = 1 Marginal β ω Constant c SSVS ψ J = 0 = δrq(.) ψ J = 1 = δq(.) ωn (0, Q) + (1 ω)n (0, rq) 1 NMIG IG(ν, rq) IG(ν, Q) ωt2ν(0, Q/ν) + (1 ω)t2ν(0, rq/ν) 1/(ν 1) Mixture of Laplaces E(1/2rQ) E(1/2Q) ωlap( Q) + (1 ω)lap( rq) 2 Mixture of Normal-Gammas G(a, 1/2rQ) G(a, 1/2Q) ωng(βj a, Q) + (1 ω)ng(βj a, r, Q) 2a Laplace-t E(1/2rQ) IG(ν, Q) ωt2ν(0, Q/ν) + (1 ω)lap( rq) c1 = 2, c2 = 1/(ν 1) 8
Gaussian dynamic regression problems Consider the univariate Gaussian dynamic linear model (DLM) expressed by y t = F tβ t + ν t, ν t N (0, V t ) (2) β t = G t β t 1 + ω t, ω t N (0, W t ), (3) where β is of length q and β 0 N (m 0, C 0 ). Dynamic regression model: F t = X t and G t = I q. Static regression model: W t = 0 for all t. 9
Shrinkage in dynamic regression problems Two main obstacles: 1. Time-varying parameters (states), and 2. A large number of predictors q. Two sources of sparsity: 1. horizontal sparsity: β j,t = 0, t for some coefficients j. 2. vertical sparsity: β j,t = 0 for several js at time t. 10
Shrinkage in dynamic regression problems Two main obstacles: 1. Time-varying parameters (states), and 2. A large number of predictors q. Two sources of sparsity: 1. horizontal sparsity: β j,t = 0, t for some coefficients j. 2. vertical sparsity: β j,t = 0 for several js at time t. Illustration: q = 5 and T = 12 jan feb mar apr may jun jul aug sep oct nov x 0 β 1,1 β 1,2 β 1,3 β 1,4 β 1,5 β 1,6 β 1,7 β 1,8 β 1,9 β 1,10 β 1,11 x 1 0 0 0 0 0 0 0 0 0 0 0 x 2 β 3,1 β 3,2 β 3,3 β 3,4 β 3,5 0 0 0 β 3,9 β 3,10 β 3,11 x 3 0 0 β 4,3 β 4,4 β 4,5 β 4,6 β 4,7 β 4,8 β 4,9 β 4,10 β 4,11 x 4 β 5,1 β 5,2 β 5,3 β 5,4 β 5,5 0 0 0 β 5,9 β 5,10 β 5,11 11
Our contribution: a dynamic spike-and-slab model Our contribution is defining a a spike-and-slab prior that not only shrinks time-varying coefficients in dynamic regression problems but allows for dynamic variable selection. We use a non-centered parametrization: y t = F tβ t + ν t, ν t N (0, σt 2 ) β t = G t β t 1 + ω t, ω t N (0, W t ), where β t = ( ) β 1,t β q,t,..., ψ1,t ψq,t G t = diag(ϕ 1,..., ϕ q ) W t = diag((1 ϕ 2 1),..., (1 ϕ 2 q)), F t = (X 1,t ψ1,t,..., X q,t ψq,t ). 12
Our contribution: a dynamic spike-and-slab model For shrinking the states β 1,..., β T, for any j coefficient, we place independent priors for each ψ t = τ 2 K t as τ 2 iid p(τ 2 θ), (K t K t 1 = υ i ) ind ω 1,i δ 1 (.) + (1 ω 1,i )δ r (.), ω 1,i = p (K t = 1 K t 1 = υ i ), where υ i {r, 1}, p(k 1 = r) = p(k 1 = 1) = 1/2, r = Var spike (β θ)/ Var slab (β θ) 1 and p(τ 2 θ) is one of priors from the previous Table. Markov switching process for K t That is, K t is a binary random latent variable that can assume binary values (regimes) υ 0 = r or υ 1 = 1, depending only on the previous value of K t 1 and the prior transition probabilities {ω 0,0 ; ω 0,1 ; ω 1,1 ; ω 1,0 }. 13
Our contribution: direct acyclic graph 14