Tests and model choice : asymptotics

1/ 50 Tests and model choice : asymptotics J. Rousseau CEREMADE, Université Paris-Dauphine & Oxford University (soon) Greek Stochastics, Milos

Outline 2/ 50 1 Bayesian testing : the Bayes factor Bayes factor Consistency of the Bayes factor or of the model selection : 2 Non parametric tests Parametric vs Nonparametric Construction of nonparametric priors A special case : Point null hypothesis Parametric null hypothesis 3 General conditions for consistency of B 1/2 in non iid case for GOF types 4 Nonparametric/Nonparametric case 5 Conclusion

Bayes factor if BF 0/1 large H 0 else H 1 4/ 50 Model X : f ( θ), θ Θ ; θ Π Testing problem H 0 : θ Θ 0, H 1 : θ Θ 1, Θ 0 Θ 1 = ( or Π[Θ 0 Θ 1 ] = 0) 0-1 loss function Bayes estimate : δ π = 1 iff Π[Θ 1 X] > Π[Θ 0 X] Posterior odds ratio Π[Θ 0 X] Π[Θ 1 X] = Θ 0 f (X θ)dπ 0 (θ) π[θ 0] Θ 1 f (X θ)dπ 1 (θ) π[θ 1 ] }{{} m 0 (X) m 1 (X) BF 0/1 := m 0(X) m 1 (X), dπ j(θ) = 1I Θj dπ(θ) Π(Θ j )

Model choice 5/ 50 Multiple models M j : f j,θj, θ j Θ j Hierarchical Prior dπ(j, θ j ) = Π(j)dΠ j (θ j ) Posterior dπ(j, θ j X) Π(j)dΠ j (θ j )f j,θj (X), dπ(j X) Π(j) f j,θj (X)dΠ j (θ j ) Θ } j {{} m j (X) 0-1 loss function Bayes estimate δ π (X) = argmax{π(m j X) m j (X) Π(M j ), j J}

Consistency of the Bayes factor BF 0/1 = m 0(X) m 1 (X) 7/ 50 Definition 1 The Bayes factor is said consistent IFF BF 0/1 +, in proba when X f θ H 0 BF 0/1 0, in proba when X f θ H 1 Interpretation : As n goes to infinity you are bound to make the correct decision

Consistency of the model selection 8/ 50 true model True parameter θ j Θ j M = min{θ j ; θ Θ j }, Θ j Θ j Θ j Θ j e.g. Mixtures, Variable selections, graph selection etc... Consistency δ π is consistent iff P θ (δ π M ) 0 Whether BF or model selection procedure : based on the asymptotic behaviour of the marginal likelihoods m j (X)

BIC approximation of the marginal likelihood 9/ 50 Bayes factor : marginal likelihood ratio : acts like a penalized likelihood ratio BIC approximation in finite dim & regular models X = X n = (X 1,..., X n ) log m j (X n ) = log f (X n ^θ j ) d j 2 log n + O p(1) d j = dim(θ j ), ^θ j = MLE under Θ j. Different penalization if non regular or infinite dim

How does BIC work? 10/ 50 l n (θ; M) = log f (X n θ; M). m j (X n ) = e l n(^θ j,m j ) Θ j e l n(θ;m j ) l n (^θ j ;M j ) π j (θ)dθ, ^θ j = mle in Θ j Local asymptotic normality+ consistency l n (θ) l n (^θ j ) = n(θ ^θ j ) t I(^θ j )(θ ^θ j ) (1 + o p (1)) 2 Π j ( θ ^θ j > ɛ X n ) = o p (1) positive, continuous and finite prior density 0 < inf π j (θ) sup π j (θ) < + θ ^θ j <ɛ θ ^θ j <ɛ

Extension to irregular models : a more general approach 11/ 50 True parameter θ, Θ j R d j, m j (X n ) = e l n(θ j ) e l n(θ) l n (θ j ) π j (θ)dθ, θ j =KL - projtn onto Θ j Θ j Deterministic approximation l n (θ) l n (θ j ) nh(θ j, θ), H(θ j, θ) = E θ ( ) l n (θ j ) l n(θ) log m j (X n ) = l n (θ j }{{} ) + log Π j (H(θ j, θ) 1/n) +o p (log n) }{{} model bias prior penalty n

Understanding the prior penalty Π j (H(θ j, θ) 1/n) 12/ 50 Regular models H(θ j, θ) = (θ θ j )t I(θ j )(θ θ j ) (1 + o(1)) 2 Π j (H(θ j, θ) 1/n) π j(θ j )(C/n)d j /2 Irregular models d j 2 log n + O p(1) H(θ j, θ) (θ θ j )t I(θ j )(θ θ j ) (1 + o(1)) 2 [We need to understand the geometry associated to H(θ j, θ).]

Example of mixture models : overspecified case k k f θ (x) = p j g γj (x), γ j Γ, p j = 1 j=1 j=1 If true distribution k f θ (x) = pj g γ j (x), k < k j=1 k = 2 and k = 1 [Then the model is non identifyable] f θ (x) = p 1 g γ1 + (1 p 1 )g γ2, f (x) = g γ Then f θ = f θ IFF p 1 = 1, γ 1 = γ or γ 1 = γ 2 = γ Θ = {θ : f θ = f θ } {θ } even up to a permutation 13/ 50

14/ 50 Understanding H(θ j, θ) in mixtures f θ (x) = k j=1 p j g γ j (x), M k = {f θ (x) = k j=1 p jg γj (x), γ j Γ} k < k & θ k = θ H(θ, θ)? varies with the configuration t = (t 1,, t k, t k +1) t = (t j, j k + 1) is the partition of {1,, k} t j = {l; γ l γ j ɛ}, j k ; t k +1 = {l; min j k θ l θ j > ɛ} On t ; p(j) = l t j p l, q l = p l /p(j) if l t j k H(θ, θ) q l γ l γ j + p(j) p j + j=1 l tj l t j + l t k +1 p l p l p(j) γ l γ j 2

Π k (H(θ, θ) 1/n) 15/ 50 prior If α < d/2 (p 1,, p k ) D(α,, α), γ j iid πγ Π k (H(θ, θ) 1/n) n (k d+k 1+(k k )α)/2 [ (prior) sparsest configuration : emptying the extra states] If α > d/2 Π k (H(θ, θ) 1/n) n (k d+k 1+(k k )d/2)/2 [ (prior) sparsest configuration : merging the extra states] Prior penalty D < dim(θ k )

Impact on consistency 16/ 50 Θ BF k/k+1 = k f θ (X n )π k (θ)dθ Θ k+1 f θ (X n )π k+1 (θ)dθ If θ Θ k+1 \ Θ k (Regular models) log m k (X n ) = l n (θ k ) l n(θ ) + O p (log n) nkl(θ, θ k ) If θ Θ k Θ k+1 irregular model D = dim(θ k )+α, if α < d/2 & D = dim(θ k )+d/2if α > d/2 Then log BF k/k+1 = dim(θ k) D log n + o p (log n) + 2

Nonparametric cases : At least one of the models is NP Case 1 : Parametric vs Non parametric : M 0 : θ 0 Θ R d, M 1 : θ 1 Θ 1, dim(θ 1 ) = + e.g. Goodness of fit tests ; testing for linearity in regression etc... Case 2 : Nonparametric vs Nonparametric M 0 : θ 0 Θ 0, M 1 : θ 1 Θ 1, dim(θ 1 ), dim(θ 0 ) = + e.g. Testing for monotonicity, independence, two sample tests etc... Case 3 : High dimensional model selection M k : θ Θ k R d k, k K, card(k)large e.g. regression models (variable selection) 17/ 50

Parametric vs NP 19/ 50 X n = (X 1,..., X n ) fθ n, Test problem : Θ 0 R d, dim(θ 1 ) = + H 0 : fθ n F 0 = {fθ n, θ Θ 0}, H 1 : fθ n F 1 = {fθ n, θ Θ 1} Examples : iid Goodness of fit in iid cases : X i f, F 0 = {f θ, θ Θ 0 }, F 1 = F0 c. Linear/partially linear y i = α + β t w i + f (x i ) + ɛ i, ɛ i N(0, σ 2 ) H 0 : f = 0, H 1 : f 0

21/ 50 Non parametric priors on F 1 H 0 : f F 0 = {f θ, θ Θ}, H 1 : f F 1 Θ R d Parameter space F = F 0 F 1 Non parametric prior π 1 on F 1 Prior on H 1 Your favourite nonparametric prior e.g. Gaussian processes, Levy processes, Mixtures Check that Π 1 [F 0 ] = 0, e.g. F 0 = {N(µ, σ 2 ), (µ, σ) R R + }, Π 1 : k p j N(µ j, σ 2 j ), k > 1 j=1 Should we care for more? Yes : consistency

23/ 50 Case of point null hypothesis (case of iid observations) Prior on F 1 = F {f 0 } H 0 : f = f 0, H 1 : f f 0 Corresponds to a prior on F Π(f ) = p 0 δ (f0 )+(1 p 0 )Π 1 (f ), Π 1 = proba on F Π 1 ({f 0 }) = 0 Dass & Lee : BF 0/1 always consistent under f 0. if dπ 1 (. X n ) is consistent at f F {f 0 }, then the BF 0,1 is consistent under f.

Remark on Dass & Lee 24/ 50 Convergence under f 0 Comes from Doob theorem because Π[{f 0 }] = p 0 > 0 Π(f ) = p 0 δ (f0 ) + (1 p 0 )Π 1 (f ) Harder to detect H 0 than H 1 : f 0 F, can be approximated under F 1 typically H 1 : BF 0/1 e nc(f ) H 0 : BF 0/1 n q 0

On posterior concentration 25/ 50 Posterior concentration ɛ n = o(1) Prove that E θ [Π (d(θ, θ ) > Mɛ n Y n )] = o(1), l n (θ) = log f θ (Y n ) Conditions 1 KL condition [ ] π KL n (θ, θ) nɛ 2 n, V n (θ, θ) κnɛ 2 n e ncɛ2 n KL n (θ, θ) = E θ (l n (θ ) l n (θ)), V n (θ, θ) = V θ (l n (θ ) l n (θ)) 2 Testing condition Θ n Θ : π(θ c n) e n(c+2)ɛ2 n and φ n [0, 1] E θ (φ n ) = o(1), sup E θ (1 φ n ) e n(c+2)ɛ2 n d(θ,θ)>mɛ n,θ Θ n

Proof 26/ 50 Π (d(θ, θ ) > ɛ n Y n B ) = n e l n(θ) l n (θ ) dπ(θ) Θ el n(θ) l n (θ ) dπ(θ) := N n D n [ E θ [Π (d(θ, θ ) > ɛ n Y n )] = E θ φ n N ] [ n + E θ (1 φ n ) N ] n D n D n [ ] E θ [φ n ] + P θ D n < e n(c+3/2)ɛ2 n + e n(c+3/2)ɛ2 ne θ (N n (1 φ n )) [ ] = E θ [φ n ] + P θ D n < e n(c+3/2)ɛ2 n + e n(c+3/2)ɛ2 n E θ (1 φ n )dπ(θ B [ ] ( n ) = o(1) + P θ D n < e n(c+2)ɛ2 n + e n(c+3/2)ɛ2 n e n(c+2)ɛ2 n + Π(Θ c n) [ ] = o(1) + P θ D n < e n(c+2)ɛ2 n

27/ 50 Control of P θ [ ] D n < e n(c+2)ɛ2 n : KL condition D n = e l n(θ) l n (θ ) dπ(θ) e l n(θ) l n (θ ) dπ(θ) Θ S n 1I ln (θ) l n (θ )> 3/2nɛ 2el n(θ) l n (θ ) dπ(θ) n S n e 3/2nɛ2 nπ(s n {l n (θ) l n (θ ) > 3/2nɛ 2 n}) ( ) = e 3/2nɛ2 n π(s n ) π(s n {l n (θ) l n (θ ) 3/2nɛ 2 n}) and ( ) P θ π(s n {l n (θ) l n (θ ) 3/2nɛ 2 n}) > π(s n )/2 2 ( ) P θ l n (θ) l n (θ ) 3/2nɛ 2 n dπ(θ) 1 π(s n ) S n }{{} nɛ 2 n KL condition: 1 nɛ 2 n

28/ 50 What are these tests? The tests φ n depend on the metric d(θ, θ ) Case of iid data with weak topology U = {f ; ϕ j f (x)dx ϕ j f 0 (x)dx ɛ, j = 1,, J} : Tests always exist L 1 metric : d(f, f ) = f (x) f (x) dx φ n = max φ n,fl, φ n,fl = 1I n i=1 [1I yi A(f l,f 0 ) P f0 (A(f l,f 0 ))]>nδ f l f 0 1 with A(f l, f 0 ) = {y; f l (y) > f 0 (y)} [existence of tests Bound on the L 1 covering number N(.) e ncɛ2 n ]

29/ 50 Summary of the first day : BF 0/1 = m 0(Y n ) m 1 (Y n ) Parametric log m j (Y n ) = l n (θ j ) D j 2 log n +o p (log n) }{{} log π j (d(θ j,θ) 1/n) Regular model D j = dim(θ j ) Irregular model D j = D j (θ ) dim(θ j ) Important for convergence : if θ Θ j Θ j+1 D j (θ ) < D j+1 (θ )

Summary 2 : towards nonparametric 30/ 50 Posterior concentration rates π(b n Y n B ) = n e l n(θ) l n (θ ) dπ(θ) Θ el n(θ) l n (θ ) dπ(θ) := N n D n Lower bound on D n : P θ [D n > e nɛ2n(c+3/2) ] = 1 + o(1) If ( ) π {θ; KL n (θ, θ) nɛ 2 n, V n (θ, θ) nɛ 2 n} e ncɛ2 n Upper bound of N n = B n e l n(θ) l n (θ ) dπ(θ) P θ [N n = o(e nɛ2 n(c+3/2) ] = 1 + o(1) Either Existence of tests φ n : H 0 : θ = θ vs H 1 : θ B n Or π(b n ) = o(e n(c+3/2)ɛ2 n )

General result under non i.i.d observations : sufficient conditions 32/ 50 Context Under M 0 : Y f n 0,θ 0, θ 0 Θ 0 R d Under M 1 : Y f n 1,θ 1, θ 1 Θ 1 (infinite dim) d n (.,.) (semi) metric on the distributions (e.g. total variation, KL, etc. ) lim inf n lim inf n inf d n (f 1,θ1, f 0,θ0 ) := d(θ 1, M 0 ) θ 0 Θ 0 inf d n (f 0,θ0, f 1,θ1 ) := d(θ 0, M 1 ) θ 1 Θ 1 [ Typically M 0 M 1 ]

Conditions under M 1 (bigger model) : θ = θ 1 - often easier 33/ 50 bound from above m 0 (Y n ) and from below m 1 (Y n ) m j (Y n ) = e l n(j,θ) l n (θ ) dπ j (θ) Θ j d(θ, M 0 ) > 0 (separated case) If [ ɛ > 0, P θ m 1 (Y n ) < e nɛ2] = o(1) [Consistency under M 1 ] If, ɛ > 0, a > 0 and tests φ n and Θ 0,n Θ 0 s.t. E θ [φ n ] = o(1), sup θ 0 Θ 0,n E θ0 [1 φ n ] e an, π 0 (Θ c 0,n ) < e nɛ [Statistical separation between θ and M 0 ] Then B 0/1 < e an/2, with proba going to 1 under P θ

34/ 50 Conditions under M 0 (smaller model) ; θ M 0 M 1 Case : d(θ, M 1 ) = 0, & Θ 0 R d (parametric) If k 0 > 0 n k0/2 [ π 0 {θ Θ0 : KL(f0,θ n, f 0,θ n ) 1, V (f n 0,θ, f n 0,θ ) 1)}] C for some positive constants C. (NP) f θ = f 0,θ If ɛ n 0 with A ɛn (θ ) = {θ Θ 1 : d n (fθ n, f 1,θ n ) < ɛ n} such that 1. Pθ n [ [ π1 A c ɛn (θ ) Y ]] = o(1) 2. n k0/2 π 1 [A ɛn (θ )] = o(1). Then B 0/1 + under P θ

Application to Partial linear regressions 35/ 50 y i = α + β t w i + f (x i ) + ɛ i, ɛ i N(0, σ 2 ) (w i, x i ) G i.i.d, E[w] = 0, E[ww t ] > 0 M 0 : (α, β, f, σ) s.t. [ (α H(f ) := inf E + (β ) t w f (x) ) ] 2 = 0 (α,β ) R d+1 M 1 : (α, β, f, σ) s.t. y i = α 1 + β t 1 w i + ɛ i H(f ) := [ (α inf E + (β ) t w f (x) ) ] 2 > 0 (α,β ) R d+1

36/ 50 prior hierarchical (adaptive) Gaussian process prior on f K f = τ k Z k ξ k, k=1 (ξ k ) = BON K P, Z k N(0, 1) (α, β, σ) π 0 h j (x) = E[w j X = x], w = (w 1,, w d ) If j s.t. < h j, ξ 1 > 0 then B 0/1 is consistent under M 0 like n ρ

Nonparametric/Nonparametric testing problem B 0/1 = m 0 (Y )/m 1 (Y ) 37/ 50 Both M 0 and M 1 are non parametric & M 0 M 1 Lower bound Not too hard but sharp? m j (Y ) e (c j +2)nɛ 2 n,j, π j (B KL (θ, nɛ 2 n,j )) e nc j ɛ 2 n upper bound Harder Tests control : ok tools for it e l n(θ) l n (θ ) dπ j (θ) e c j nɛ2 n,j d(θ,θ)>ɛ n,j Control on d(θ,θ) ɛ n,j e l n(θ) l n (θ ) dπ j (θ) : damn it [ ] E θ [N n,j ] := E θ Ifd(θ d(θ e l n(θ) l n (θ ) dπ, M j ) = 0,θ) ɛn,j j (θ) Fubini = π j (d(θ, θ) ɛ n,j )? e c j nɛ2 n,j

38/ 50 Setups where this approach works Ghosal, Lembert, VdV & Rousseau, Szabo Multiple models Θ λ, λ Λ (possibly continuous) Aim ^λ = argmax{π(λ Y n ), λ Λ} π(λ Y n ) π(λ)m λ (Y n ) What is λ? Two contexts Almost finite case : Λ N, Θ λ Θ λ+1 and λ 0 s.t. θ Θ λ0, λ = min{λ; θ Θ λ } Boundary case θ Cl( λ Θ λ ) but λ, θ / Θ λ Claim : λ = Oracle where λ = min λ {ɛ n,λ, λ Λ} ɛ n,λ : π λ (d(θ, θ) ɛ n,λ ) e cnɛ2 n,λ, [Sharp tools like BIC ]

Where does the oracle come from? 39/ 50 m λ (Y n f θ,λ (Y n ) ) = Θ λ f θ (Y n ) dπ λ(θ) e l n(θ) l n (θ ) dπ λ (θ) d(θ,θ ) ɛ n,λ e c 1nɛ 2 nπ λ (d(θ, θ) ɛ n, ) e c 2nɛ 2 nπ λ (d(θ, θ) ɛ n ) + e nc 2ɛ 2 n Under stringent conditions π(λ : ɛ n,λ Mɛ n,λ Y n ) 1 ^λ {λ : ɛ n,λ Mɛ n,λ } [Only useful if λ λ ɛ n,λ >> ɛ n,λ ]

40/ 50 NP priors M 0 and M 1 Testing for monotonicity Y = (y 1,, y n ) iid f, or y i = f (x i ) + ɛ i, Model M 0 : f = decreasing Model M 1 : f has no shape constraints. 1 rst idea Prior on M 0 : Mixtures of U(0, θ) f P (y) = 0 1I y θ dp(θ), P DP(AG) θ Prior on smooth densities : e.g. Mixtures of Betas f P (y) = 0 g z,z/ɛ (y)dp(ɛ), (z, P) π z DP(AG) [Bad idea ]

1rst solution 41/ 50 y i [0, 1] k f (x) = f ω,k := kω j 1 [(j 1)/k,j/k[ (x), j=1 k ω j = 1 j=1 Prior k P; (ω 1,, ω k ) π k, e.g. D(α 1,, α k ) Posterior probabilities [ ] π[m 0 Y ] = π max(ω j ω j 1 ) 0 j Y Problem : if f M 0 but has flat parts

Result 42/ 50 Question :?? π[m 0 Y ] = 1 + o pf (1) if f M 0. Answers : Not always ({ }) f (x) f (y) g(δ) := Leb x; inf δ y>x y x For all f 0 s.t. g(δ) = 0, if P π [ k (n/ log n) 1/5 Y ] = 1 + o p (1) P π [M 0 Y ] = 1 + o p (1) Consistent test e.g. if k 3 P(λ). For all f 0 s.t. g(δ) = δ, idem with k 7 P(λ).

Inconsistencies 43/ 50 If f 0 is constant on [a, b] (0, 1) and if π(3/(b a) < k < n/ log nly n ) 1 then with probability greater than 1/2 (asymptotically) π(m 0 Y n ) < 1/2 [Inconsistency of the testing procedure]

Comments 44/ 50 Not satisfying : Cannot be consistent over whole set of decreasing functions. Increasing the set of functions by lowering k leads to poor power. Change of loss function new test based on Choice of ɛ? P π [d(f, M 1 ) ɛ Y ]

Moving away (slightly) from the 0-1 loss function for unseparated hypotheses [JB Salomond] 45/ 50 New loss function L τ (θ, δ) = Equivalent to testing { 0 if d(θ, M0 ) τ 1 if d(θ, M 0 ) > τ H 0 : d(θ, M 0 ) τ, H 1 : d(θ, M 0 ) > τ Bayes estimator { δ π 0 if π (d(θ, M0 ) τ Y = n ) > 1/2 1 if otherwise

46/ 50 Choice of τ If prior information on tolerance level τ chosen a priori Calibration by τ n smallest s.t. Type I error sup P θ [π (d(θ, M 0 ) τ n Y n ) 1/2] = o(1) θ M 0 Study of ρ n s.t. (separation rate) sup P θ [π (d(θ, M 0 ) τ n Y n ) < 1/2] = o(1) θ:d(θ,m 0 )>ρ n [Control of type II error over ρ n separated hypotheses] In quite a few examples : ρ n minimax separation rate

47/ 50 Difficulties : τ n chosen asymptotically Difficult analysis in some cases : Not necessarily to determine τ n. e.g. in the case of testing for monotonicity only for simple families of priors optimal τ n = τ 0 v n where v n known τ 0 arbitrary. Calibration of τ 0 for finite n? Can we trust purely asymptotic results to calibrate a procedure?

Semi-parametric : θ = (ψ, η) 48/ 50 M 1 : ψ = ψ 0, M 2 : ψ ψ 0 Cox hazard : h θ (x z) = e ψt z η(x), M 1 : ψ = 0 Long memory parameter : M 1 : h = 0 Y N(0, T n (f θ )), f θ (x) = (1 cos x) h η(x), η > 0 C b Saved by BvM π(v n (ψ ^ψ) A Y ) Pr(N d (0, 1) A) B 1/2 v d/2 n +

The 2 - samples test 49/ 50 X n = (X 1,, X n ) iid F X, Y n = (Y 1,, Y n ) iid F Y H 0 : F X = F Y = F H 1 : F X F Y Prior H 0 : F π and H 1 : F X, F Y π ind. B 1/2 = m 2n(X n, Y n ) m n (X n )m n (Y n ), consistency? Hardly any literature (Bayesian) on 2 - sample test Holmes et al. [PolyaT] Dunson & Lock [ parametric] Chen & Hansen [censored-polyat]

Other ways of getting away from 0-1 loss functions 50/ 50 distance penalized losses Robert & Rousseau, Rousseau Goodness of fit test : Y n = (y 1,, y n ), y i iid f H 0 : f F 0 = {f θ, θ Θ R d }, H 1 : f / F 0 L(θ, δ) = Bayes test { (d(θ, M0 ) ɛ)1i d(θ,m0 )>ɛ when δ = 0 (ɛ d(θ, M 0 ))1I d(θ,m0 ) ɛ when δ = 1 δ π = 1 E π [d(θ, M 0 ) Y n ] > ɛ Calibration of ɛ using a conditional predictive p-value p(y n ) = P θ (T (Y n ) > T (y n ) ^θ, y n )π 0 (θ)dθ Θ

Other ways 51/ 50 There are many other ways...

52/ 50 Point null hypothesis : Consistency of BF consistency of the posterior under H 1 Parametric null : more involved. Π 1 can be more favorable to f θ0 than Π 0. Nonparametric Null. e.g. F 0 = {f } need to be even more carefull Are BIC or BIC type formula better than BF?