On the general understanding of the empirical Bayes method Judith Rousseau 1, Botond Szabó 2 1 Paris Dauphin, Paris, France 2 Budapest University of Technology and Economics, Budapest, Hungary ERCIM 2014, Pisa, 8. 12. 2014.
Table of contents 1 Introduction 2 General Theorem on EB 3 Examples Gaussian white noise model Nonparametric regression Density function problem 4 Epilogue
Introduction General Theorem on EB Examples Motivation Applications: Genetics Clark & Swanson (2005). Contextual region classification Lazebnik et al. (2009). High dimensional classification Chen et al. (2008). Robotics Schauerte et al. (2013). Although it is widely used in practice, it does not have full theoretical underpinning. Epilogue
Bayes vs Frequentist approach Statistical model: Consider a collection of distributions P = {P θ : θ Θ}. Schools: Frequentist Bayes Model: X (n) P θ0, θ 0 Θ θ Π (prior), X (n) θ P θ Goal: Try to recover θ 0 : Update our belief about θ: Estimator ˆθ(X (n) ) Posterior: θ X (n)
Bayes vs Frequentist approach Statistical model: Consider a collection of distributions P = {P θ : θ Θ}. Schools: Frequentist Bayes Model: X (n) P θ0, θ 0 Θ θ Π (prior), X (n) θ P θ Goal: Try to recover θ 0 : Update our belief about θ: Estimator ˆθ(X (n) ) Posterior: θ X (n) Frequentist Bayes Investigate Bayesian techniques from frequentist perspective, i.e. assume that there exists a true θ 0 and investigate the behaviour of the posterior θ X (n).
Adaptive Bayes Assume that we have a family of prior distributions indexed by a hyper-parameter λ. {Π λ : λ Λ}, Problem: In (nonparametric models) the posterior crucially depends on the prior, hence on the hyper-parameter.
Adaptive Bayes Assume that we have a family of prior distributions indexed by a hyper-parameter λ. {Π λ : λ Λ}, Problem: In (nonparametric models) the posterior crucially depends on the prior, hence on the hyper-parameter. Question: How to choose λ? Fixed λ: without strong belief misleading. Use the data to find λ, i.e. adaptive techniques: Hierarchical Bayes: endow λ with hyper-prior π(λ). Empirical Bayes: estimate λ from the data X (n).
Empirical Bayes method EB Method: Frequentist estimator for the hyper-parameter λ : Marginal likelihood empirical Bayes: Plug in the marginal maximum likelihood estimator ˆλ n = arg max{λ : e ln(θ) Π λ (dθ)}, where l n (θ) is the log-likelihood, into the posterior ( X (n) ) = Π Πˆλn λ ( X (n) ). λ=ˆλn Θ
Empirical Bayes method EB Method: Frequentist estimator for the hyper-parameter λ : Marginal likelihood empirical Bayes: Plug in the marginal maximum likelihood estimator ˆλ n = arg max{λ : e ln(θ) Π λ (dθ)}, where l n (θ) is the log-likelihood, into the posterior ( X (n) ) = Π Πˆλn λ ( X (n) ). λ=ˆλn Mimics the HB method. Θ
Empirical Bayes method EB Method: Frequentist estimator for the hyper-parameter λ : Marginal likelihood empirical Bayes: Plug in the marginal maximum likelihood estimator ˆλ n = arg max{λ : e ln(θ) Π λ (dθ)}, where l n (θ) is the log-likelihood, into the posterior ( X (n) ) = Π Πˆλn λ ( X (n) ). λ=ˆλn Mimics the HB method. Other frequentist estimators for ˆλ n : MM, MRE,... Widely used in the literature, BUT missing full theoretical justification. Θ
Theoretical investigation Frequentist analysis: Consider a loss L and a collection of nested sub-classes {Θ β : β B}. Minimax risk: r n,β = inf ˆθn T n sup θ Θ β E θ L(ˆθ n, θ). Do we have adaptive contraction rate: inf E θ0 (θ : L(θ, θ Πˆλn 0 ) Mr n,β X (n) ) 1 θ 0 Θ β for all β B and a large enough constant M > 0?
Theoretical investigation Frequentist analysis: Consider a loss L and a collection of nested sub-classes {Θ β : β B}. Minimax risk: r n,β = inf ˆθn T n sup θ Θ β E θ L(ˆθ n, θ). Do we have adaptive contraction rate: inf E θ0 (θ : L(θ, θ Πˆλn 0 ) Mr n,β X (n) ) 1 θ 0 Θ β for all β B and a large enough constant M > 0? Literature: Specific models: Florens & Simoni (2012), Knapik et al.(2012), Sz. et al. (2013), Serra & Krivobokova (2014). Comparing EB and HB in parametric models: Petrone et al. (2014). General nonparametric models, BUT for well behaved estimators ˆλ n : Donnet et al. (2014).
The set of possible hyper-parameters λ Determine the location of λ: Define ε n (λ) = ε n (λ, θ 0 ), such that Π λ (θ : θ θ 0 2 Kε n (λ)) = e nε2 n (λ), for some K > 0 (specified later).
The set of possible hyper-parameters λ Determine the location of λ: Define ε n (λ) = ε n (λ, θ 0 ), such that Π λ (θ : θ θ 0 2 Kε n (λ)) = e nε2 n (λ), for some K > 0 (specified later). Let us denote by m n = inf β B r n,β and assume that m n 1/ n. Define the set: Λ n = {λ : ε n (λ) m n }. Let ε n,0 = min λ {ε n (λ) : λ Λ n }.
The set of possible hyper-parameters λ Determine the location of λ: Define ε n (λ) = ε n (λ, θ 0 ), such that Π λ (θ : θ θ 0 2 Kε n (λ)) = e nε2 n (λ), for some K > 0 (specified later). Let us denote by m n = inf β B r n,β and assume that m n 1/ n. Define the set: Λ n = {λ : ε n (λ) m n }. Let ε n,0 = min λ {ε n (λ) : λ Λ n }. Finally define the set of probable hyper-parameters Λ 0 = {λ : ε n (λ) M n ε n,0 } Λ c n, for some M n tending to infinity. Our first goal is to show that ˆλ n Λ 0.
Conditions Following Donet et al. (2014) we introduce some assumptions: Entropy (hyper): Discretize Λ c 0 into balls B(λ i, u n, ) and assume that for some w n M n : log N n = o(w 2 n nε 2 n,0 )
Conditions Following Donet et al. (2014) we introduce some assumptions: Entropy (hyper): Discretize Λ c 0 into balls B(λ i, u n, ) and assume that for some w n M n : log N n = o(w 2 n nε 2 n,0 ) Transformation: Let ψ λ,λ : Θ Θ such that if θ Π λ ( ) then ψ λ,λ (θ) Π λ ( ) for λ, λ Λ, and introduce the notation dq θ λ,n(x (n) ) = sup e ln(ψλ,λ (θ))(x (n)) dµ(x (n) ). λ λ u n
Conditions Following Donet et al. (2014) we introduce some assumptions: Entropy (hyper): Discretize Λ c 0 into balls B(λ i, u n, ) and assume that for some w n M n : log N n = o(w 2 n nε 2 n,0 ) Transformation: Let ψ λ,λ : Θ Θ such that if θ Π λ ( ) then ψ λ,λ (θ) Π λ ( ) for λ, λ Λ, and introduce the notation dq θ λ,n(x (n) ) = sup e ln(ψλ,λ (θ))(x (n)) dµ(x (n) ). λ λ u n Boundedness: For all θ B ( θ 0, ε n (λ), 2 ) Q θ λ,n (X n ) e cnε2 n (λ), c < 1.
Conditions Following Donet et al. (2014) we introduce some assumptions: Entropy (hyper): Discretize Λ c 0 into balls B(λ i, u n, ) and assume that for some w n M n : log N n = o(w 2 n nε 2 n,0 ) Transformation: Let ψ λ,λ : Θ Θ such that if θ Π λ ( ) then ψ λ,λ (θ) Π λ ( ) for λ, λ Λ, and introduce the notation dq θ λ,n(x (n) ) = sup e ln(ψλ,λ (θ))(x (n)) dµ(x (n) ). λ λ u n Boundedness: For all θ B ( θ 0, ε n (λ), 2 ) Q θ λ,n (X n ) e cnε2 n (λ), c < 1. Sieve: For all λ Λ c 0 assume that there exists Θ n(λ) such that Qλ,n(X θ n )Π λ (dθ) e w 2 n nε2 n,0. Θ n(λ) c
Conditions II Tests: For all λ Λ c 0 and all θ Θ n(λ) there exist tests ϕ n,i (θ) : E θ0 (ϕ n,i (θ)) e c1nd 2 (θ,θ 0), sup Qλ θ i,n(1 ϕ n,i (θ)) e c1nd 2 (θ,θ 0), d(θ,θ ) ζd(θ,θ 0) with 0 < ζ < 1, and for some large enough C > 0 { θ θ 0 2 > Cε n (λ), θ Θ n (λ)} {d(θ, θ 0 ) ε n (λ), θ Θ n (λ)}.
Conditions II Tests: For all λ Λ c 0 and all θ Θ n(λ) there exist tests ϕ n,i (θ) : E θ0 (ϕ n,i (θ)) e c1nd 2 (θ,θ 0), sup Qλ θ i,n(1 ϕ n,i (θ)) e c1nd 2 (θ,θ 0), d(θ,θ ) ζd(θ,θ 0) with 0 < ζ < 1, and for some large enough C > 0 { θ θ 0 2 > Cε n (λ), θ Θ n (λ)} {d(θ, θ 0 ) ε n (λ), θ Θ n (λ)}. Entropy: For all u Cε n (λ): N(ζu, {u d(θ, θ 0 ) 2u} Θ n (λ), d(, )) c 1 nu 2 /2
Conditions II Tests: For all λ Λ c 0 and all θ Θ n(λ) there exist tests ϕ n,i (θ) : E θ0 (ϕ n,i (θ)) e c1nd 2 (θ,θ 0), sup Qλ θ i,n(1 ϕ n,i (θ)) e c1nd 2 (θ,θ 0), d(θ,θ ) ζd(θ,θ 0) with 0 < ζ < 1, and for some large enough C > 0 { θ θ 0 2 > Cε n (λ), θ Θ n (λ)} {d(θ, θ 0 ) ε n (λ), θ Θ n (λ)}. Entropy: For all u Cε n (λ): N(ζu, {u d(θ, θ 0 ) 2u} Θ n (λ), d(, )) c 1 nu 2 /2 Local Metric Exchange: There exists M 1, M 2 > 0 and λ n Λ 0 satisfying ε n (λ n ) M 1 ε n,0 such that { θ θ 0 2 ε n (λ n )} B n (θ 0, M 2 ε n (λ n ), {KL(p θ, p θ0 ), V 0,k (p θ, p θ0 )}).
Main theorem Theorem: Assume that all the above conditions hold. Then for all θ 0 Θ with P θ0 -probability tending to one we have ˆλ n Λ 0.
Main theorem Theorem: Assume that all the above conditions hold. Then for all θ 0 Θ with P θ0 -probability tending to one we have ˆλ n Λ 0. Our next goal is to give upper bounds for the EB contraction rates. Following Donet et al. (2014) assume that Uniform likelihood ratio: sup sup λ Λ 0 θ B(θ 0,λ) P θ0 { inf l n (ψ λ,λ (θ)) l n (θ 0 ) K 5 nε 2 1 n,0} = o( λ λ u n N n (u n ) Stronger Entropy (hyper): N n (u n ) = o((nε 2 n,0 )k/2 ).
Main theorem Theorem: Assume that all the above conditions hold. Then for all θ 0 Θ with P θ0 -probability tending to one we have ˆλ n Λ 0. Our next goal is to give upper bounds for the EB contraction rates. Following Donet et al. (2014) assume that Uniform likelihood ratio: sup sup λ Λ 0 θ B(θ 0,λ) P θ0 { inf l n (ψ λ,λ (θ)) l n (θ 0 ) K 5 nε 2 1 n,0} = o( λ λ u n N n (u n ) Stronger Entropy (hyper): N n (u n ) = o((nε 2 n,0 )k/2 ). Theorem: Under the preceding conditions the empirical Bayes posterior distribution contracts around the truth with a rate M n ε n,0 : ( θ : θ θ Πˆλn 0 2 M n ε n,0 X (n)) P θ0 1.
Gaussian white noise model Gaussian white noise model Model: Let us observe the sequence X (n) = (X 1, X 2,...) satisfying X i = θ 0,i + (1/ n)z i, i = 1, 2,..., where θ 0 = (θ 0,1, θ 0,2,...) the unknown infinite dimensional parameter and Z i are iid standard normal random variables. Sub-classes: Θ β (M) = {θ l 2 : θ 2 i Mi 1 2β }.
Gaussian white noise model Gaussian white noise model Model: Let us observe the sequence X (n) = (X 1, X 2,...) satisfying X i = θ 0,i + (1/ n)z i, i = 1, 2,..., where θ 0 = (θ 0,1, θ 0,2,...) the unknown infinite dimensional parameter and Z i are iid standard normal random variables. Sub-classes: Θ β (M) = {θ l 2 : θ 2 i Mi 1 2β }. Priors: Π α ( ) = i=1 N(0, i 1 2α ), see Knapik et al. (2012). Π τ ( ) = i=1 N(0, τ 2 i 1 2α ), see Sz. et al. (2013). Π N ( ) = N i=1 g( ), where G 1e G2 t α g(t) G 3 e G4 t α, see Arbel et. al (2012) for HB. Π γ ( ) = i=1 N(0, e γi ), see Castillo et al. (2014) for HB.
Gaussian white noise model GWN model with regularity hyper-parameter Prior: Π α ( ) = i=1 N(0, i 1 2α ) All the conditions of our theorems are met.
Gaussian white noise model GWN model with regularity hyper-parameter Prior: Π α ( ) = i=1 N(0, i 1 2α ) All the conditions of our theorems are met. Upper bound on ε n (α) (for θ 0 Θ β ): Concentration inequality vd Vaart & v Zanten (2008): nε 2 n(α) = log Π α (θ : θ θ 0 Kε n (α)) ϕ α,θ0 (Kε n (α)/2), Centered small ball Li & Shao (2001): log Π α (θ : θ 2 Kε n (α)/2) (K/2) 1/α ε n (α) 1/α. RKHS term: Cε 1/β n inf h 2 h H α H i 1+2α θ : h θ α 0,i 2 ε n (α) 1+2α 2β β 0 0 2 ε n Solution: ε n (α) n β α 1+2α. i=1
Gaussian white noise model GWN model with regularity hyper-parameter Prior: Π α ( ) = i=1 N(0, i 1 2α ) All the conditions of our theorems are met. Upper bound on ε n (α) (for θ 0 Θ β ): Concentration inequality vd Vaart & v Zanten (2008): nε 2 n(α) = log Π α (θ : θ θ 0 Kε n (α)) ϕ α,θ0 (Kε n (α)/2), Centered small ball Li & Shao (2001): log Π α (θ : θ 2 Kε n (α)/2) (K/2) 1/α ε n (α) 1/α. RKHS term: Cε 1/β n inf h 2 h H α H i 1+2α θ : h θ α 0,i 2 ε n (α) 1+2α 2β β 0 0 2 ε n Solution: ε n (α) n β α 1+2α. i=1 EB rate: α 0 = β, hence ε n,0 = ε n (α 0 ) n β 1+2β.
Gaussian white noise model GWN model with scaling hyper-parameter Prior: Π τ ( ) = i=1 N(0, τ 2 i 1 2α ), with fixed α > 0. All the conditions of our theorems are met.
Gaussian white noise model GWN model with scaling hyper-parameter Prior: Π τ ( ) = i=1 N(0, τ 2 i 1 2α ), with fixed α > 0. All the conditions of our theorems are met. Upper bound on ε n (τ) (for θ 0 Θ β ): Similarly as in the regularity hyper-parameter case: nε 2 n(τ) log Π τ (θ : θ 2 Kε n (τ)/2) + inf h H τ : h θ h 2 H τ 0 2 ε n(τ) 1/β ε n(τ) τ 1/α ε n (τ) 1/a + τ 2 i=1 i 2(α β)
Gaussian white noise model GWN model with scaling hyper-parameter Prior: Π τ ( ) = i=1 N(0, τ 2 i 1 2α ), with fixed α > 0. All the conditions of our theorems are met. Upper bound on ε n (τ) (for θ 0 Θ β ): Similarly as in the regularity hyper-parameter case: nε 2 n(τ) log Π τ (θ : θ 2 Kε n (τ)/2) + inf h H τ : h θ h 2 H τ 0 2 ε n(τ) 1/β ε n(τ) τ 1/α ε n (τ) 1/a + τ 2 EB posterior rate: i=1 i 2(α β) n β 1+2β, if β < α + 1/2, ε n (τ 0 ) n β 1+2β (log n) 1/(1+2β), if β = α + 1/2, n 1/2+α 2+2α, if β > α + 1/2.
Nonparametric regression Nonparametric regression model Fixed design: Assume that we observe Y 1, Y 2,..., Y n satisfying Y i = f 0 (x i ) + Z i, i = 1, 2,..., n, where Z i are iid standard Gaussian random variables and x i = i/n.
Nonparametric regression Nonparametric regression model Fixed design: Assume that we observe Y 1, Y 2,..., Y n satisfying Y i = f 0 (x i ) + Z i, i = 1, 2,..., n, where Z i are iid standard Gaussian random variables and x i = i/n. Series decomposition: Let us denote by θ 0 = (θ 0,1, θ 0,2,..) the Fourier coefficients of the regression function f 0 L 2 (M): f 0 (t) = θ 0,j ψ j (t). j=1
Nonparametric regression Nonparametric regression model Fixed design: Assume that we observe Y 1, Y 2,..., Y n satisfying Y i = f 0 (x i ) + Z i, i = 1, 2,..., n, where Z i are iid standard Gaussian random variables and x i = i/n. Series decomposition: Let us denote by θ 0 = (θ 0,1, θ 0,2,..) the Fourier coefficients of the regression function f 0 L 2 (M): f 0 (t) = θ 0,j ψ j (t). j=1 Prior: Π α ( ) = i=1 N(0, i 1 2α ) on θ 0. EB posterior rate: for θ 0 S β (M) we have ε n (α 0 ) n β 1+2β.
Density function problem Density function Model: Let X 1, X 2,..., X n be iid sample from the density function f 0 on [0, 1]. Assume that the density takes the form: f 0 (x) = exp θ 0,j ϕ j (x) c(θ 0 ), θ 0 l 2, j=1 where (ϕ j ) is an orthonormal basis of L 2 ([0, 1]).
Density function problem Density function Model: Let X 1, X 2,..., X n be iid sample from the density function f 0 on [0, 1]. Assume that the density takes the form: f 0 (x) = exp θ 0,j ϕ j (x) c(θ 0 ), θ 0 l 2, j=1 where (ϕ j ) is an orthonormal basis of L 2 ([0, 1]). Prior: Log-linear priors Rivoirard & Rousseau (2012): f θ (x) = exp θ j ϕ j (x) c(θ), θ l 2, j=1 where the parameter θ l 2 follows Π τ ( ) = i=1 N(0, τ 2 i 1 2α ), Π N ( ) = N i=1 g( ).
Summary We characterized the set Λ 0 where the marginal likelihood estimator ˆλ n belongs (with probability tending to one). We gave an upper bound on the EB contraction rate. We investigated various examples: Gaussian white noise (reproduced multiple specific results from the literature), Nonparametric regression, Density function problem.
Future/Ongoing work Extensions: Consider other, more complex models. Consider other metrics (at the moment we have L 2 -norm). Lower bounds on the contraction rates of the EB posterior. Inverse problems. Investigate the coverage property of the EB credible sets. (Under polished tail assumption, see Sz. et al. (2014)). Using the EB results on the coverage to derive general theorems on hierarchical Bayes credible sets.