Tests and model choice : asymptotics

Σχετικά έγγραφα
Other Test Constructions: Likelihood Ratio & Bayes Tests

Statistical Inference I Locally most powerful tests

ST5224: Advanced Statistical Theory II

Theorem 8 Let φ be the most powerful size α test of H

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Example Sheet 3 Solutions

Solution Series 9. i=1 x i and i=1 x i.

Lecture 34 Bootstrap confidence intervals

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Biostatistics for Health Sciences Review Sheet

An Introduction to Signal Detection and Estimation - Second Edition Chapter II: Selected Solutions

The Simply Typed Lambda Calculus

Lecture 12: Pseudo likelihood approach

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Local Approximation with Kernels

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

EE512: Error Control Coding

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Every set of first-order formulas is equivalent to an independent set

Fractional Colorings and Zykov Products of graphs

6. MAXIMUM LIKELIHOOD ESTIMATION

Models for Probabilistic Programs with an Adversary

Homework 8 Model Solution Section

D Alembert s Solution to the Wave Equation

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Durbin-Levinson recursive method

CHAPTER 101 FOURIER SERIES FOR PERIODIC FUNCTIONS OF PERIOD

STAT200C: Hypothesis Testing

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

Approximation of distance between locations on earth given by latitude and longitude

Lecture 21: Properties and robustness of LSE

4.6 Autoregressive Moving Average Model ARMA(1,1)

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Solutions to Exercise Sheet 5

Μηχανική Μάθηση Hypothesis Testing

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Areas and Lengths in Polar Coordinates

5. Choice under Uncertainty

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

2 Composition. Invertible Mappings

Abstract Storage Devices

Inverse trigonometric functions & General Solution of Trigonometric Equations

Repeated measures Επαναληπτικές μετρήσεις

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

On the general understanding of the empirical Bayes method

Math 6 SL Probability Distributions Practice Test Mark Scheme

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Arithmetical applications of lagrangian interpolation. Tanguy Rivoal. Institut Fourier CNRS and Université de Grenoble 1

Uniform Convergence of Fourier Series Michael Taylor

General 2 2 PT -Symmetric Matrices and Jordan Blocks 1

Various types of likelihood

The challenges of non-stable predicates

FORMULAS FOR STATISTICS 1

Homework 3 Solutions

Problem Set 3: Solutions

TMA4115 Matematikk 3

Figure A.2: MPC and MPCP Age Profiles (estimating ρ, ρ = 2, φ = 0.03)..

Introduction to the ML Estimation of ARMA processes

Areas and Lengths in Polar Coordinates

ECE 308 SIGNALS AND SYSTEMS FALL 2017 Answers to selected problems on prior years examinations

Section 8.3 Trigonometric Equations

About these lecture notes. Simply Typed λ-calculus. Types

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Probability and Random Processes (Part II)

12. Radon-Nikodym Theorem

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Exercises to Statistics of Material Fatigue No. 5

Lecture 2. Soundness and completeness of propositional logic

Concrete Mathematics Exercises from 30 September 2016

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Reminders: linear functions

Computable error bounds for asymptotic expansions formulas of distributions related to gamma functions

172,,,,. P,. Box (1980)P, Guttman (1967)Rubin (1984)P, Meng (1994), Gelman(1996)De la HorraRodriguez-Bernal (2003). BayarriBerger (2000)P P.. : Casell

Supplementary Materials: Proofs

Math221: HW# 1 solutions

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Instruction Execution Times

The Pohozaev identity for the fractional Laplacian

P AND P. P : actual probability. P : risk neutral probability. Realtionship: mutual absolute continuity P P. For example:

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Last Lecture. Biostatistics Statistical Inference Lecture 19 Likelihood Ratio Test. Example of Hypothesis Testing.

Bounding Nonsplitting Enumeration Degrees

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

C.S. 430 Assignment 6, Sample Solutions

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Section 9.2 Polar Equations and Graphs

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Homework for 1/27 Due 2/5

From the finite to the transfinite: Λµ-terms and streams

Υπολογιστική Φυσική Στοιχειωδών Σωματιδίων

557: MATHEMATICAL STATISTICS II RESULTS FROM CLASSICAL HYPOTHESIS TESTING

Transcript:

1/ 50 Tests and model choice : asymptotics J. Rousseau CEREMADE, Université Paris-Dauphine & Oxford University (soon) Greek Stochastics, Milos

Outline 2/ 50 1 Bayesian testing : the Bayes factor Bayes factor Consistency of the Bayes factor or of the model selection : 2 Non parametric tests Parametric vs Nonparametric Construction of nonparametric priors A special case : Point null hypothesis Parametric null hypothesis 3 General conditions for consistency of B 1/2 in non iid case for GOF types 4 Nonparametric/Nonparametric case 5 Conclusion

Outline 3/ 50 1 Bayesian testing : the Bayes factor Bayes factor Consistency of the Bayes factor or of the model selection : 2 Non parametric tests Parametric vs Nonparametric Construction of nonparametric priors A special case : Point null hypothesis Parametric null hypothesis 3 General conditions for consistency of B 1/2 in non iid case for GOF types 4 Nonparametric/Nonparametric case 5 Conclusion

Bayes factor if BF 0/1 large H 0 else H 1 4/ 50 Model X : f ( θ), θ Θ ; θ Π Testing problem H 0 : θ Θ 0, H 1 : θ Θ 1, Θ 0 Θ 1 = ( or Π[Θ 0 Θ 1 ] = 0) 0-1 loss function Bayes estimate : δ π = 1 iff Π[Θ 1 X] > Π[Θ 0 X] Posterior odds ratio Π[Θ 0 X] Π[Θ 1 X] = Θ 0 f (X θ)dπ 0 (θ) π[θ 0] Θ 1 f (X θ)dπ 1 (θ) π[θ 1 ] }{{} m 0 (X) m 1 (X) BF 0/1 := m 0(X) m 1 (X), dπ j(θ) = 1I Θj dπ(θ) Π(Θ j )

Model choice 5/ 50 Multiple models M j : f j,θj, θ j Θ j Hierarchical Prior dπ(j, θ j ) = Π(j)dΠ j (θ j ) Posterior dπ(j, θ j X) Π(j)dΠ j (θ j )f j,θj (X), dπ(j X) Π(j) f j,θj (X)dΠ j (θ j ) Θ } j {{} m j (X) 0-1 loss function Bayes estimate δ π (X) = argmax{π(m j X) m j (X) Π(M j ), j J}

Outline 6/ 50 1 Bayesian testing : the Bayes factor Bayes factor Consistency of the Bayes factor or of the model selection : 2 Non parametric tests Parametric vs Nonparametric Construction of nonparametric priors A special case : Point null hypothesis Parametric null hypothesis 3 General conditions for consistency of B 1/2 in non iid case for GOF types 4 Nonparametric/Nonparametric case 5 Conclusion

Consistency of the Bayes factor BF 0/1 = m 0(X) m 1 (X) 7/ 50 Definition 1 The Bayes factor is said consistent IFF BF 0/1 +, in proba when X f θ H 0 BF 0/1 0, in proba when X f θ H 1 Interpretation : As n goes to infinity you are bound to make the correct decision

Consistency of the model selection 8/ 50 true model True parameter θ j Θ j M = min{θ j ; θ Θ j }, Θ j Θ j Θ j Θ j e.g. Mixtures, Variable selections, graph selection etc... Consistency δ π is consistent iff P θ (δ π M ) 0 Whether BF or model selection procedure : based on the asymptotic behaviour of the marginal likelihoods m j (X)

BIC approximation of the marginal likelihood 9/ 50 Bayes factor : marginal likelihood ratio : acts like a penalized likelihood ratio BIC approximation in finite dim & regular models X = X n = (X 1,..., X n ) log m j (X n ) = log f (X n ^θ j ) d j 2 log n + O p(1) d j = dim(θ j ), ^θ j = MLE under Θ j. Different penalization if non regular or infinite dim

How does BIC work? 10/ 50 l n (θ; M) = log f (X n θ; M). m j (X n ) = e l n(^θ j,m j ) Θ j e l n(θ;m j ) l n (^θ j ;M j ) π j (θ)dθ, ^θ j = mle in Θ j Local asymptotic normality+ consistency l n (θ) l n (^θ j ) = n(θ ^θ j ) t I(^θ j )(θ ^θ j ) (1 + o p (1)) 2 Π j ( θ ^θ j > ɛ X n ) = o p (1) positive, continuous and finite prior density 0 < inf π j (θ) sup π j (θ) < + θ ^θ j <ɛ θ ^θ j <ɛ

Extension to irregular models : a more general approach 11/ 50 True parameter θ, Θ j R d j, m j (X n ) = e l n(θ j ) e l n(θ) l n (θ j ) π j (θ)dθ, θ j =KL - projtn onto Θ j Θ j Deterministic approximation l n (θ) l n (θ j ) nh(θ j, θ), H(θ j, θ) = E θ ( ) l n (θ j ) l n(θ) log m j (X n ) = l n (θ j }{{} ) + log Π j (H(θ j, θ) 1/n) +o p (log n) }{{} model bias prior penalty n

Understanding the prior penalty Π j (H(θ j, θ) 1/n) 12/ 50 Regular models H(θ j, θ) = (θ θ j )t I(θ j )(θ θ j ) (1 + o(1)) 2 Π j (H(θ j, θ) 1/n) π j(θ j )(C/n)d j /2 Irregular models d j 2 log n + O p(1) H(θ j, θ) (θ θ j )t I(θ j )(θ θ j ) (1 + o(1)) 2 [We need to understand the geometry associated to H(θ j, θ).]

Example of mixture models : overspecified case k k f θ (x) = p j g γj (x), γ j Γ, p j = 1 j=1 j=1 If true distribution k f θ (x) = pj g γ j (x), k < k j=1 k = 2 and k = 1 [Then the model is non identifyable] f θ (x) = p 1 g γ1 + (1 p 1 )g γ2, f (x) = g γ Then f θ = f θ IFF p 1 = 1, γ 1 = γ or γ 1 = γ 2 = γ Θ = {θ : f θ = f θ } {θ } even up to a permutation 13/ 50

14/ 50 Understanding H(θ j, θ) in mixtures f θ (x) = k j=1 p j g γ j (x), M k = {f θ (x) = k j=1 p jg γj (x), γ j Γ} k < k & θ k = θ H(θ, θ)? varies with the configuration t = (t 1,, t k, t k +1) t = (t j, j k + 1) is the partition of {1,, k} t j = {l; γ l γ j ɛ}, j k ; t k +1 = {l; min j k θ l θ j > ɛ} On t ; p(j) = l t j p l, q l = p l /p(j) if l t j k H(θ, θ) q l γ l γ j + p(j) p j + j=1 l tj l t j + l t k +1 p l p l p(j) γ l γ j 2

Π k (H(θ, θ) 1/n) 15/ 50 prior If α < d/2 (p 1,, p k ) D(α,, α), γ j iid πγ Π k (H(θ, θ) 1/n) n (k d+k 1+(k k )α)/2 [ (prior) sparsest configuration : emptying the extra states] If α > d/2 Π k (H(θ, θ) 1/n) n (k d+k 1+(k k )d/2)/2 [ (prior) sparsest configuration : merging the extra states] Prior penalty D < dim(θ k )

Impact on consistency 16/ 50 Θ BF k/k+1 = k f θ (X n )π k (θ)dθ Θ k+1 f θ (X n )π k+1 (θ)dθ If θ Θ k+1 \ Θ k (Regular models) log m k (X n ) = l n (θ k ) l n(θ ) + O p (log n) nkl(θ, θ k ) If θ Θ k Θ k+1 irregular model D = dim(θ k )+α, if α < d/2 & D = dim(θ k )+d/2if α > d/2 Then log BF k/k+1 = dim(θ k) D log n + o p (log n) + 2

Nonparametric cases : At least one of the models is NP Case 1 : Parametric vs Non parametric : M 0 : θ 0 Θ R d, M 1 : θ 1 Θ 1, dim(θ 1 ) = + e.g. Goodness of fit tests ; testing for linearity in regression etc... Case 2 : Nonparametric vs Nonparametric M 0 : θ 0 Θ 0, M 1 : θ 1 Θ 1, dim(θ 1 ), dim(θ 0 ) = + e.g. Testing for monotonicity, independence, two sample tests etc... Case 3 : High dimensional model selection M k : θ Θ k R d k, k K, card(k)large e.g. regression models (variable selection) 17/ 50

Outline 18/ 50 1 Bayesian testing : the Bayes factor Bayes factor Consistency of the Bayes factor or of the model selection : 2 Non parametric tests Parametric vs Nonparametric Construction of nonparametric priors A special case : Point null hypothesis Parametric null hypothesis 3 General conditions for consistency of B 1/2 in non iid case for GOF types 4 Nonparametric/Nonparametric case 5 Conclusion

Parametric vs NP 19/ 50 X n = (X 1,..., X n ) fθ n, Test problem : Θ 0 R d, dim(θ 1 ) = + H 0 : fθ n F 0 = {fθ n, θ Θ 0}, H 1 : fθ n F 1 = {fθ n, θ Θ 1} Examples : iid Goodness of fit in iid cases : X i f, F 0 = {f θ, θ Θ 0 }, F 1 = F0 c. Linear/partially linear y i = α + β t w i + f (x i ) + ɛ i, ɛ i N(0, σ 2 ) H 0 : f = 0, H 1 : f 0

Parametric vs NP 19/ 50 X n = (X 1,..., X n ) fθ n, Test problem : Θ 0 R d, dim(θ 1 ) = + H 0 : fθ n F 0 = {fθ n, θ Θ 0}, H 1 : fθ n F 1 = {fθ n, θ Θ 1} Examples : iid Goodness of fit in iid cases : X i f, F 0 = {f θ, θ Θ 0 }, F 1 = F0 c. Linear/partially linear y i = α + β t w i + f (x i ) + ɛ i, ɛ i N(0, σ 2 ) H 0 : f = 0, H 1 : f 0

Outline 20/ 50 1 Bayesian testing : the Bayes factor Bayes factor Consistency of the Bayes factor or of the model selection : 2 Non parametric tests Parametric vs Nonparametric Construction of nonparametric priors A special case : Point null hypothesis Parametric null hypothesis 3 General conditions for consistency of B 1/2 in non iid case for GOF types 4 Nonparametric/Nonparametric case 5 Conclusion

21/ 50 Non parametric priors on F 1 H 0 : f F 0 = {f θ, θ Θ}, H 1 : f F 1 Θ R d Parameter space F = F 0 F 1 Non parametric prior π 1 on F 1 Prior on H 1 Your favourite nonparametric prior e.g. Gaussian processes, Levy processes, Mixtures Check that Π 1 [F 0 ] = 0, e.g. F 0 = {N(µ, σ 2 ), (µ, σ) R R + }, Π 1 : k p j N(µ j, σ 2 j ), k > 1 j=1 Should we care for more? Yes : consistency

21/ 50 Non parametric priors on F 1 H 0 : f F 0 = {f θ, θ Θ}, H 1 : f F 1 Θ R d Parameter space F = F 0 F 1 Non parametric prior π 1 on F 1 Prior on H 1 Your favourite nonparametric prior e.g. Gaussian processes, Levy processes, Mixtures Check that Π 1 [F 0 ] = 0, e.g. F 0 = {N(µ, σ 2 ), (µ, σ) R R + }, Π 1 : k p j N(µ j, σ 2 j ), k > 1 j=1 Should we care for more? Yes : consistency

21/ 50 Non parametric priors on F 1 H 0 : f F 0 = {f θ, θ Θ}, H 1 : f F 1 Θ R d Parameter space F = F 0 F 1 Non parametric prior π 1 on F 1 Prior on H 1 Your favourite nonparametric prior e.g. Gaussian processes, Levy processes, Mixtures Check that Π 1 [F 0 ] = 0, e.g. F 0 = {N(µ, σ 2 ), (µ, σ) R R + }, Π 1 : k p j N(µ j, σ 2 j ), k > 1 j=1 Should we care for more? Yes : consistency

Outline 22/ 50 1 Bayesian testing : the Bayes factor Bayes factor Consistency of the Bayes factor or of the model selection : 2 Non parametric tests Parametric vs Nonparametric Construction of nonparametric priors A special case : Point null hypothesis Parametric null hypothesis 3 General conditions for consistency of B 1/2 in non iid case for GOF types 4 Nonparametric/Nonparametric case 5 Conclusion

23/ 50 Case of point null hypothesis (case of iid observations) Prior on F 1 = F {f 0 } H 0 : f = f 0, H 1 : f f 0 Corresponds to a prior on F Π(f ) = p 0 δ (f0 )+(1 p 0 )Π 1 (f ), Π 1 = proba on F Π 1 ({f 0 }) = 0 Dass & Lee : BF 0/1 always consistent under f 0. if dπ 1 (. X n ) is consistent at f F {f 0 }, then the BF 0,1 is consistent under f.

23/ 50 Case of point null hypothesis (case of iid observations) Prior on F 1 = F {f 0 } H 0 : f = f 0, H 1 : f f 0 Corresponds to a prior on F Π(f ) = p 0 δ (f0 )+(1 p 0 )Π 1 (f ), Π 1 = proba on F Π 1 ({f 0 }) = 0 Dass & Lee : BF 0/1 always consistent under f 0. if dπ 1 (. X n ) is consistent at f F {f 0 }, then the BF 0,1 is consistent under f.

Remark on Dass & Lee 24/ 50 Convergence under f 0 Comes from Doob theorem because Π[{f 0 }] = p 0 > 0 Π(f ) = p 0 δ (f0 ) + (1 p 0 )Π 1 (f ) Harder to detect H 0 than H 1 : f 0 F, can be approximated under F 1 typically H 1 : BF 0/1 e nc(f ) H 0 : BF 0/1 n q 0

On posterior concentration 25/ 50 Posterior concentration ɛ n = o(1) Prove that E θ [Π (d(θ, θ ) > Mɛ n Y n )] = o(1), l n (θ) = log f θ (Y n ) Conditions 1 KL condition [ ] π KL n (θ, θ) nɛ 2 n, V n (θ, θ) κnɛ 2 n e ncɛ2 n KL n (θ, θ) = E θ (l n (θ ) l n (θ)), V n (θ, θ) = V θ (l n (θ ) l n (θ)) 2 Testing condition Θ n Θ : π(θ c n) e n(c+2)ɛ2 n and φ n [0, 1] E θ (φ n ) = o(1), sup E θ (1 φ n ) e n(c+2)ɛ2 n d(θ,θ)>mɛ n,θ Θ n

On posterior concentration 25/ 50 Posterior concentration ɛ n = o(1) Prove that E θ [Π (d(θ, θ ) > Mɛ n Y n )] = o(1), l n (θ) = log f θ (Y n ) Conditions 1 KL condition [ ] π KL n (θ, θ) nɛ 2 n, V n (θ, θ) κnɛ 2 n e ncɛ2 n KL n (θ, θ) = E θ (l n (θ ) l n (θ)), V n (θ, θ) = V θ (l n (θ ) l n (θ)) 2 Testing condition Θ n Θ : π(θ c n) e n(c+2)ɛ2 n and φ n [0, 1] E θ (φ n ) = o(1), sup E θ (1 φ n ) e n(c+2)ɛ2 n d(θ,θ)>mɛ n,θ Θ n

Proof 26/ 50 Π (d(θ, θ ) > ɛ n Y n B ) = n e l n(θ) l n (θ ) dπ(θ) Θ el n(θ) l n (θ ) dπ(θ) := N n D n [ E θ [Π (d(θ, θ ) > ɛ n Y n )] = E θ φ n N ] [ n + E θ (1 φ n ) N ] n D n D n [ ] E θ [φ n ] + P θ D n < e n(c+3/2)ɛ2 n + e n(c+3/2)ɛ2 ne θ (N n (1 φ n )) [ ] = E θ [φ n ] + P θ D n < e n(c+3/2)ɛ2 n + e n(c+3/2)ɛ2 n E θ (1 φ n )dπ(θ B [ ] ( n ) = o(1) + P θ D n < e n(c+2)ɛ2 n + e n(c+3/2)ɛ2 n e n(c+2)ɛ2 n + Π(Θ c n) [ ] = o(1) + P θ D n < e n(c+2)ɛ2 n

27/ 50 Control of P θ [ ] D n < e n(c+2)ɛ2 n : KL condition D n = e l n(θ) l n (θ ) dπ(θ) e l n(θ) l n (θ ) dπ(θ) Θ S n 1I ln (θ) l n (θ )> 3/2nɛ 2el n(θ) l n (θ ) dπ(θ) n S n e 3/2nɛ2 nπ(s n {l n (θ) l n (θ ) > 3/2nɛ 2 n}) ( ) = e 3/2nɛ2 n π(s n ) π(s n {l n (θ) l n (θ ) 3/2nɛ 2 n}) and ( ) P θ π(s n {l n (θ) l n (θ ) 3/2nɛ 2 n}) > π(s n )/2 2 ( ) P θ l n (θ) l n (θ ) 3/2nɛ 2 n dπ(θ) 1 π(s n ) S n }{{} nɛ 2 n KL condition: 1 nɛ 2 n

28/ 50 What are these tests? The tests φ n depend on the metric d(θ, θ ) Case of iid data with weak topology U = {f ; ϕ j f (x)dx ϕ j f 0 (x)dx ɛ, j = 1,, J} : Tests always exist L 1 metric : d(f, f ) = f (x) f (x) dx φ n = max φ n,fl, φ n,fl = 1I n i=1 [1I yi A(f l,f 0 ) P f0 (A(f l,f 0 ))]>nδ f l f 0 1 with A(f l, f 0 ) = {y; f l (y) > f 0 (y)} [existence of tests Bound on the L 1 covering number N(.) e ncɛ2 n ]

28/ 50 What are these tests? The tests φ n depend on the metric d(θ, θ ) Case of iid data with weak topology U = {f ; ϕ j f (x)dx ϕ j f 0 (x)dx ɛ, j = 1,, J} : Tests always exist L 1 metric : d(f, f ) = f (x) f (x) dx φ n = max φ n,fl, φ n,fl = 1I n i=1 [1I yi A(f l,f 0 ) P f0 (A(f l,f 0 ))]>nδ f l f 0 1 with A(f l, f 0 ) = {y; f l (y) > f 0 (y)} [existence of tests Bound on the L 1 covering number N(.) e ncɛ2 n ]

29/ 50 Summary of the first day : BF 0/1 = m 0(Y n ) m 1 (Y n ) Parametric log m j (Y n ) = l n (θ j ) D j 2 log n +o p (log n) }{{} log π j (d(θ j,θ) 1/n) Regular model D j = dim(θ j ) Irregular model D j = D j (θ ) dim(θ j ) Important for convergence : if θ Θ j Θ j+1 D j (θ ) < D j+1 (θ )

29/ 50 Summary of the first day : BF 0/1 = m 0(Y n ) m 1 (Y n ) Parametric log m j (Y n ) = l n (θ j ) D j 2 log n +o p (log n) }{{} log π j (d(θ j,θ) 1/n) Regular model D j = dim(θ j ) Irregular model D j = D j (θ ) dim(θ j ) Important for convergence : if θ Θ j Θ j+1 D j (θ ) < D j+1 (θ )

29/ 50 Summary of the first day : BF 0/1 = m 0(Y n ) m 1 (Y n ) Parametric log m j (Y n ) = l n (θ j ) D j 2 log n +o p (log n) }{{} log π j (d(θ j,θ) 1/n) Regular model D j = dim(θ j ) Irregular model D j = D j (θ ) dim(θ j ) Important for convergence : if θ Θ j Θ j+1 D j (θ ) < D j+1 (θ )

Summary 2 : towards nonparametric 30/ 50 Posterior concentration rates π(b n Y n B ) = n e l n(θ) l n (θ ) dπ(θ) Θ el n(θ) l n (θ ) dπ(θ) := N n D n Lower bound on D n : P θ [D n > e nɛ2n(c+3/2) ] = 1 + o(1) If ( ) π {θ; KL n (θ, θ) nɛ 2 n, V n (θ, θ) nɛ 2 n} e ncɛ2 n Upper bound of N n = B n e l n(θ) l n (θ ) dπ(θ) P θ [N n = o(e nɛ2 n(c+3/2) ] = 1 + o(1) Either Existence of tests φ n : H 0 : θ = θ vs H 1 : θ B n Or π(b n ) = o(e n(c+3/2)ɛ2 n )

Summary 2 : towards nonparametric 30/ 50 Posterior concentration rates π(b n Y n B ) = n e l n(θ) l n (θ ) dπ(θ) Θ el n(θ) l n (θ ) dπ(θ) := N n D n Lower bound on D n : P θ [D n > e nɛ2n(c+3/2) ] = 1 + o(1) If ( ) π {θ; KL n (θ, θ) nɛ 2 n, V n (θ, θ) nɛ 2 n} e ncɛ2 n Upper bound of N n = B n e l n(θ) l n (θ ) dπ(θ) P θ [N n = o(e nɛ2 n(c+3/2) ] = 1 + o(1) Either Existence of tests φ n : H 0 : θ = θ vs H 1 : θ B n Or π(b n ) = o(e n(c+3/2)ɛ2 n )

Summary 2 : towards nonparametric 30/ 50 Posterior concentration rates π(b n Y n B ) = n e l n(θ) l n (θ ) dπ(θ) Θ el n(θ) l n (θ ) dπ(θ) := N n D n Lower bound on D n : P θ [D n > e nɛ2n(c+3/2) ] = 1 + o(1) If ( ) π {θ; KL n (θ, θ) nɛ 2 n, V n (θ, θ) nɛ 2 n} e ncɛ2 n Upper bound of N n = B n e l n(θ) l n (θ ) dπ(θ) P θ [N n = o(e nɛ2 n(c+3/2) ] = 1 + o(1) Either Existence of tests φ n : H 0 : θ = θ vs H 1 : θ B n Or π(b n ) = o(e n(c+3/2)ɛ2 n )

Summary 2 : towards nonparametric 30/ 50 Posterior concentration rates π(b n Y n B ) = n e l n(θ) l n (θ ) dπ(θ) Θ el n(θ) l n (θ ) dπ(θ) := N n D n Lower bound on D n : P θ [D n > e nɛ2n(c+3/2) ] = 1 + o(1) If ( ) π {θ; KL n (θ, θ) nɛ 2 n, V n (θ, θ) nɛ 2 n} e ncɛ2 n Upper bound of N n = B n e l n(θ) l n (θ ) dπ(θ) P θ [N n = o(e nɛ2 n(c+3/2) ] = 1 + o(1) Either Existence of tests φ n : H 0 : θ = θ vs H 1 : θ B n Or π(b n ) = o(e n(c+3/2)ɛ2 n )

Outline 31/ 50 1 Bayesian testing : the Bayes factor Bayes factor Consistency of the Bayes factor or of the model selection : 2 Non parametric tests Parametric vs Nonparametric Construction of nonparametric priors A special case : Point null hypothesis Parametric null hypothesis 3 General conditions for consistency of B 1/2 in non iid case for GOF types 4 Nonparametric/Nonparametric case 5 Conclusion

General result under non i.i.d observations : sufficient conditions 32/ 50 Context Under M 0 : Y f n 0,θ 0, θ 0 Θ 0 R d Under M 1 : Y f n 1,θ 1, θ 1 Θ 1 (infinite dim) d n (.,.) (semi) metric on the distributions (e.g. total variation, KL, etc. ) lim inf n lim inf n inf d n (f 1,θ1, f 0,θ0 ) := d(θ 1, M 0 ) θ 0 Θ 0 inf d n (f 0,θ0, f 1,θ1 ) := d(θ 0, M 1 ) θ 1 Θ 1 [ Typically M 0 M 1 ]

Conditions under M 1 (bigger model) : θ = θ 1 - often easier 33/ 50 bound from above m 0 (Y n ) and from below m 1 (Y n ) m j (Y n ) = e l n(j,θ) l n (θ ) dπ j (θ) Θ j d(θ, M 0 ) > 0 (separated case) If [ ɛ > 0, P θ m 1 (Y n ) < e nɛ2] = o(1) [Consistency under M 1 ] If, ɛ > 0, a > 0 and tests φ n and Θ 0,n Θ 0 s.t. E θ [φ n ] = o(1), sup θ 0 Θ 0,n E θ0 [1 φ n ] e an, π 0 (Θ c 0,n ) < e nɛ [Statistical separation between θ and M 0 ] Then B 0/1 < e an/2, with proba going to 1 under P θ

34/ 50 Conditions under M 0 (smaller model) ; θ M 0 M 1 Case : d(θ, M 1 ) = 0, & Θ 0 R d (parametric) If k 0 > 0 n k0/2 [ π 0 {θ Θ0 : KL(f0,θ n, f 0,θ n ) 1, V (f n 0,θ, f n 0,θ ) 1)}] C for some positive constants C. (NP) f θ = f 0,θ If ɛ n 0 with A ɛn (θ ) = {θ Θ 1 : d n (fθ n, f 1,θ n ) < ɛ n} such that 1. Pθ n [ [ π1 A c ɛn (θ ) Y ]] = o(1) 2. n k0/2 π 1 [A ɛn (θ )] = o(1). Then B 0/1 + under P θ

34/ 50 Conditions under M 0 (smaller model) ; θ M 0 M 1 Case : d(θ, M 1 ) = 0, & Θ 0 R d (parametric) If k 0 > 0 n k0/2 [ π 0 {θ Θ0 : KL(f0,θ n, f 0,θ n ) 1, V (f n 0,θ, f n 0,θ ) 1)}] C for some positive constants C. (NP) f θ = f 0,θ If ɛ n 0 with A ɛn (θ ) = {θ Θ 1 : d n (fθ n, f 1,θ n ) < ɛ n} such that 1. Pθ n [ [ π1 A c ɛn (θ ) Y ]] = o(1) 2. n k0/2 π 1 [A ɛn (θ )] = o(1). Then B 0/1 + under P θ

Application to Partial linear regressions 35/ 50 y i = α + β t w i + f (x i ) + ɛ i, ɛ i N(0, σ 2 ) (w i, x i ) G i.i.d, E[w] = 0, E[ww t ] > 0 M 0 : (α, β, f, σ) s.t. [ (α H(f ) := inf E + (β ) t w f (x) ) ] 2 = 0 (α,β ) R d+1 M 1 : (α, β, f, σ) s.t. y i = α 1 + β t 1 w i + ɛ i H(f ) := [ (α inf E + (β ) t w f (x) ) ] 2 > 0 (α,β ) R d+1

36/ 50 prior hierarchical (adaptive) Gaussian process prior on f K f = τ k Z k ξ k, k=1 (ξ k ) = BON K P, Z k N(0, 1) (α, β, σ) π 0 h j (x) = E[w j X = x], w = (w 1,, w d ) If j s.t. < h j, ξ 1 > 0 then B 0/1 is consistent under M 0 like n ρ

Nonparametric/Nonparametric testing problem B 0/1 = m 0 (Y )/m 1 (Y ) 37/ 50 Both M 0 and M 1 are non parametric & M 0 M 1 Lower bound Not too hard but sharp? m j (Y ) e (c j +2)nɛ 2 n,j, π j (B KL (θ, nɛ 2 n,j )) e nc j ɛ 2 n upper bound Harder Tests control : ok tools for it e l n(θ) l n (θ ) dπ j (θ) e c j nɛ2 n,j d(θ,θ)>ɛ n,j Control on d(θ,θ) ɛ n,j e l n(θ) l n (θ ) dπ j (θ) : damn it [ ] E θ [N n,j ] := E θ Ifd(θ d(θ e l n(θ) l n (θ ) dπ, M j ) = 0,θ) ɛn,j j (θ) Fubini = π j (d(θ, θ) ɛ n,j )? e c j nɛ2 n,j

Nonparametric/Nonparametric testing problem B 0/1 = m 0 (Y )/m 1 (Y ) 37/ 50 Both M 0 and M 1 are non parametric & M 0 M 1 Lower bound Not too hard but sharp? m j (Y ) e (c j +2)nɛ 2 n,j, π j (B KL (θ, nɛ 2 n,j )) e nc j ɛ 2 n upper bound Harder Tests control : ok tools for it e l n(θ) l n (θ ) dπ j (θ) e c j nɛ2 n,j d(θ,θ)>ɛ n,j Control on d(θ,θ) ɛ n,j e l n(θ) l n (θ ) dπ j (θ) : damn it [ ] E θ [N n,j ] := E θ Ifd(θ d(θ e l n(θ) l n (θ ) dπ, M j ) = 0,θ) ɛn,j j (θ) Fubini = π j (d(θ, θ) ɛ n,j )? e c j nɛ2 n,j

38/ 50 Setups where this approach works Ghosal, Lembert, VdV & Rousseau, Szabo Multiple models Θ λ, λ Λ (possibly continuous) Aim ^λ = argmax{π(λ Y n ), λ Λ} π(λ Y n ) π(λ)m λ (Y n ) What is λ? Two contexts Almost finite case : Λ N, Θ λ Θ λ+1 and λ 0 s.t. θ Θ λ0, λ = min{λ; θ Θ λ } Boundary case θ Cl( λ Θ λ ) but λ, θ / Θ λ Claim : λ = Oracle where λ = min λ {ɛ n,λ, λ Λ} ɛ n,λ : π λ (d(θ, θ) ɛ n,λ ) e cnɛ2 n,λ, [Sharp tools like BIC ]

38/ 50 Setups where this approach works Ghosal, Lembert, VdV & Rousseau, Szabo Multiple models Θ λ, λ Λ (possibly continuous) Aim ^λ = argmax{π(λ Y n ), λ Λ} π(λ Y n ) π(λ)m λ (Y n ) What is λ? Two contexts Almost finite case : Λ N, Θ λ Θ λ+1 and λ 0 s.t. θ Θ λ0, λ = min{λ; θ Θ λ } Boundary case θ Cl( λ Θ λ ) but λ, θ / Θ λ Claim : λ = Oracle where λ = min λ {ɛ n,λ, λ Λ} ɛ n,λ : π λ (d(θ, θ) ɛ n,λ ) e cnɛ2 n,λ, [Sharp tools like BIC ]

Where does the oracle come from? 39/ 50 m λ (Y n f θ,λ (Y n ) ) = Θ λ f θ (Y n ) dπ λ(θ) e l n(θ) l n (θ ) dπ λ (θ) d(θ,θ ) ɛ n,λ e c 1nɛ 2 nπ λ (d(θ, θ) ɛ n, ) e c 2nɛ 2 nπ λ (d(θ, θ) ɛ n ) + e nc 2ɛ 2 n Under stringent conditions π(λ : ɛ n,λ Mɛ n,λ Y n ) 1 ^λ {λ : ɛ n,λ Mɛ n,λ } [Only useful if λ λ ɛ n,λ >> ɛ n,λ ]

40/ 50 NP priors M 0 and M 1 Testing for monotonicity Y = (y 1,, y n ) iid f, or y i = f (x i ) + ɛ i, Model M 0 : f = decreasing Model M 1 : f has no shape constraints. 1 rst idea Prior on M 0 : Mixtures of U(0, θ) f P (y) = 0 1I y θ dp(θ), P DP(AG) θ Prior on smooth densities : e.g. Mixtures of Betas f P (y) = 0 g z,z/ɛ (y)dp(ɛ), (z, P) π z DP(AG) [Bad idea ]

40/ 50 NP priors M 0 and M 1 Testing for monotonicity Y = (y 1,, y n ) iid f, or y i = f (x i ) + ɛ i, Model M 0 : f = decreasing Model M 1 : f has no shape constraints. 1 rst idea Prior on M 0 : Mixtures of U(0, θ) f P (y) = 0 1I y θ dp(θ), P DP(AG) θ Prior on smooth densities : e.g. Mixtures of Betas f P (y) = 0 g z,z/ɛ (y)dp(ɛ), (z, P) π z DP(AG) [Bad idea ]

1rst solution 41/ 50 y i [0, 1] k f (x) = f ω,k := kω j 1 [(j 1)/k,j/k[ (x), j=1 k ω j = 1 j=1 Prior k P; (ω 1,, ω k ) π k, e.g. D(α 1,, α k ) Posterior probabilities [ ] π[m 0 Y ] = π max(ω j ω j 1 ) 0 j Y Problem : if f M 0 but has flat parts

Result 42/ 50 Question :?? π[m 0 Y ] = 1 + o pf (1) if f M 0. Answers : Not always ({ }) f (x) f (y) g(δ) := Leb x; inf δ y>x y x For all f 0 s.t. g(δ) = 0, if P π [ k (n/ log n) 1/5 Y ] = 1 + o p (1) P π [M 0 Y ] = 1 + o p (1) Consistent test e.g. if k 3 P(λ). For all f 0 s.t. g(δ) = δ, idem with k 7 P(λ).

Inconsistencies 43/ 50 If f 0 is constant on [a, b] (0, 1) and if π(3/(b a) < k < n/ log nly n ) 1 then with probability greater than 1/2 (asymptotically) π(m 0 Y n ) < 1/2 [Inconsistency of the testing procedure]

Comments 44/ 50 Not satisfying : Cannot be consistent over whole set of decreasing functions. Increasing the set of functions by lowering k leads to poor power. Change of loss function new test based on Choice of ɛ? P π [d(f, M 1 ) ɛ Y ]

Moving away (slightly) from the 0-1 loss function for unseparated hypotheses [JB Salomond] 45/ 50 New loss function L τ (θ, δ) = Equivalent to testing { 0 if d(θ, M0 ) τ 1 if d(θ, M 0 ) > τ H 0 : d(θ, M 0 ) τ, H 1 : d(θ, M 0 ) > τ Bayes estimator { δ π 0 if π (d(θ, M0 ) τ Y = n ) > 1/2 1 if otherwise

46/ 50 Choice of τ If prior information on tolerance level τ chosen a priori Calibration by τ n smallest s.t. Type I error sup P θ [π (d(θ, M 0 ) τ n Y n ) 1/2] = o(1) θ M 0 Study of ρ n s.t. (separation rate) sup P θ [π (d(θ, M 0 ) τ n Y n ) < 1/2] = o(1) θ:d(θ,m 0 )>ρ n [Control of type II error over ρ n separated hypotheses] In quite a few examples : ρ n minimax separation rate

46/ 50 Choice of τ If prior information on tolerance level τ chosen a priori Calibration by τ n smallest s.t. Type I error sup P θ [π (d(θ, M 0 ) τ n Y n ) 1/2] = o(1) θ M 0 Study of ρ n s.t. (separation rate) sup P θ [π (d(θ, M 0 ) τ n Y n ) < 1/2] = o(1) θ:d(θ,m 0 )>ρ n [Control of type II error over ρ n separated hypotheses] In quite a few examples : ρ n minimax separation rate

46/ 50 Choice of τ If prior information on tolerance level τ chosen a priori Calibration by τ n smallest s.t. Type I error sup P θ [π (d(θ, M 0 ) τ n Y n ) 1/2] = o(1) θ M 0 Study of ρ n s.t. (separation rate) sup P θ [π (d(θ, M 0 ) τ n Y n ) < 1/2] = o(1) θ:d(θ,m 0 )>ρ n [Control of type II error over ρ n separated hypotheses] In quite a few examples : ρ n minimax separation rate

46/ 50 Choice of τ If prior information on tolerance level τ chosen a priori Calibration by τ n smallest s.t. Type I error sup P θ [π (d(θ, M 0 ) τ n Y n ) 1/2] = o(1) θ M 0 Study of ρ n s.t. (separation rate) sup P θ [π (d(θ, M 0 ) τ n Y n ) < 1/2] = o(1) θ:d(θ,m 0 )>ρ n [Control of type II error over ρ n separated hypotheses] In quite a few examples : ρ n minimax separation rate

47/ 50 Difficulties : τ n chosen asymptotically Difficult analysis in some cases : Not necessarily to determine τ n. e.g. in the case of testing for monotonicity only for simple families of priors optimal τ n = τ 0 v n where v n known τ 0 arbitrary. Calibration of τ 0 for finite n? Can we trust purely asymptotic results to calibrate a procedure?

47/ 50 Difficulties : τ n chosen asymptotically Difficult analysis in some cases : Not necessarily to determine τ n. e.g. in the case of testing for monotonicity only for simple families of priors optimal τ n = τ 0 v n where v n known τ 0 arbitrary. Calibration of τ 0 for finite n? Can we trust purely asymptotic results to calibrate a procedure?

47/ 50 Difficulties : τ n chosen asymptotically Difficult analysis in some cases : Not necessarily to determine τ n. e.g. in the case of testing for monotonicity only for simple families of priors optimal τ n = τ 0 v n where v n known τ 0 arbitrary. Calibration of τ 0 for finite n? Can we trust purely asymptotic results to calibrate a procedure?

Semi-parametric : θ = (ψ, η) 48/ 50 M 1 : ψ = ψ 0, M 2 : ψ ψ 0 Cox hazard : h θ (x z) = e ψt z η(x), M 1 : ψ = 0 Long memory parameter : M 1 : h = 0 Y N(0, T n (f θ )), f θ (x) = (1 cos x) h η(x), η > 0 C b Saved by BvM π(v n (ψ ^ψ) A Y ) Pr(N d (0, 1) A) B 1/2 v d/2 n +

Semi-parametric : θ = (ψ, η) 48/ 50 M 1 : ψ = ψ 0, M 2 : ψ ψ 0 Cox hazard : h θ (x z) = e ψt z η(x), M 1 : ψ = 0 Long memory parameter : M 1 : h = 0 Y N(0, T n (f θ )), f θ (x) = (1 cos x) h η(x), η > 0 C b Saved by BvM π(v n (ψ ^ψ) A Y ) Pr(N d (0, 1) A) B 1/2 v d/2 n +

The 2 - samples test 49/ 50 X n = (X 1,, X n ) iid F X, Y n = (Y 1,, Y n ) iid F Y H 0 : F X = F Y = F H 1 : F X F Y Prior H 0 : F π and H 1 : F X, F Y π ind. B 1/2 = m 2n(X n, Y n ) m n (X n )m n (Y n ), consistency? Hardly any literature (Bayesian) on 2 - sample test Holmes et al. [PolyaT] Dunson & Lock [ parametric] Chen & Hansen [censored-polyat]

Other ways of getting away from 0-1 loss functions 50/ 50 distance penalized losses Robert & Rousseau, Rousseau Goodness of fit test : Y n = (y 1,, y n ), y i iid f H 0 : f F 0 = {f θ, θ Θ R d }, H 1 : f / F 0 L(θ, δ) = Bayes test { (d(θ, M0 ) ɛ)1i d(θ,m0 )>ɛ when δ = 0 (ɛ d(θ, M 0 ))1I d(θ,m0 ) ɛ when δ = 1 δ π = 1 E π [d(θ, M 0 ) Y n ] > ɛ Calibration of ɛ using a conditional predictive p-value p(y n ) = P θ (T (Y n ) > T (y n ) ^θ, y n )π 0 (θ)dθ Θ

Other ways 51/ 50 There are many other ways...

52/ 50 Point null hypothesis : Consistency of BF consistency of the posterior under H 1 Parametric null : more involved. Π 1 can be more favorable to f θ0 than Π 0. Nonparametric Null. e.g. F 0 = {f } need to be even more carefull Are BIC or BIC type formula better than BF?

52/ 50 Point null hypothesis : Consistency of BF consistency of the posterior under H 1 Parametric null : more involved. Π 1 can be more favorable to f θ0 than Π 0. Nonparametric Null. e.g. F 0 = {f } need to be even more carefull Are BIC or BIC type formula better than BF?

52/ 50 Point null hypothesis : Consistency of BF consistency of the posterior under H 1 Parametric null : more involved. Π 1 can be more favorable to f θ0 than Π 0. Nonparametric Null. e.g. F 0 = {f } need to be even more carefull Are BIC or BIC type formula better than BF?

52/ 50 Point null hypothesis : Consistency of BF consistency of the posterior under H 1 Parametric null : more involved. Π 1 can be more favorable to f θ0 than Π 0. Nonparametric Null. e.g. F 0 = {f } need to be even more carefull Are BIC or BIC type formula better than BF?