Supplementary Materials: Proofs

Supplementary Materials: Proofs The following lemmas prepare for the proof of concentration properties on prior density π α (β j /σ for j = 1,..., p n on m n -dimensional vector β j /σ. As introduced earlier, we consider the shrinkage prior β j λ j, σ 2 ind N mn (, λ j σ 2 I mn, (1 λ j ψ j ind exp(ψ, ψ j iid gamma(α, 2. (2 For j = 1,..., p n, we let p(λ j ψ j denote the prior density for λ j given ψ j and p α (ψ denote the prior density for ψ j given the hyper-parameter α. In the following lemmas on the multivariate Dirichlet-Laplace (DL prior, we want to show that it concentrates within a small neighborhood around zero with a dominated and steep peak, while retaining heavy tails away from zero. Lemma 1 d For any positive constant µ > and thresholding value a n m n ɛ n /p n, 1 π α (xdx p (1+µ n, x a n (3 if the Dirichlet parameter satisfies α p (1+ν n for ν > µ. Lemma 2 For a small constant η > and E defined in assumption A 2 (2, log ( ( log pn s x j η m nɛ n inf x π α(x m ne p n s π α (x j dx 1... dx pn s nɛ 2 n, (4 nɛ 2 n/s. (5 1

Lemma 3 The prior density π α (β j /σ satisfies max j ξ sup x 1, x 2 [ β j /σ m nc ɛ n, β j /σ + m nc ɛ n] π α (x 1 π α (x 2 < l n, (6 for a constant C and s log l n log p n, The three lemmas above specify the conditions on prior concentration. Lemma 1 and Lemma 2 establish that the prior probability that the L 2 -norm of each basis coefficient larger than a n is arbitrarily small with the constraint: a n d /m n ɛ n /p n. Therefore, the shrinkage prior on β j /σ has a dominated and steep peak around zero. Moreover, Lemma 3 controls the variation in the prior density of basis coefficients at points not too close to zero. The shrinkage prior density concentrates in a very small area around zero but also has heavy and sufficiently flat tails. Thus the prior mimics the composition of a continuous distribution and a point mass at zero. The proof of these lemmas follows. Proof of Lemma 1 For a vector x of dimension m n, the prior density is defined in (1 and (2. When λ is given, T def = x 2 /λ χ 2 m n. Next, we have 1 π α (xdx = x a n = a 2 n /p n + ( P T > a2 n λ ( P a 2 n /pn P p(λ ψp α (ψdψdλ [ ] T > a2 n p(λ ψp α (ψdψ dλ λ ( [ ] T > a2 n p(λ ψp α (ψdψ dλ (7 λ For any constant µ >, we can conclude that a 2 n /p n ( [ ] P T > a2 n p(λ ψp α (ψdψ dλ P(T > p n P α (λ < a 2 λ n/p n e pn/2, (8 where the last inequality is obtained from the tail property of χ 2 -distribution. 2

Next, from Lemma 3.3 in Bhattacharya et al. (215, it follows that a 2 n/p n P ( T > a2 n λ [ ] p(λ ψp α (ψdψ dλ P α (λ > a 2 n/p n C 1 α log(p n /a 2 n (9 for a constant C 1 >. Combining the inequalities in (7, (8 and (9, we have 1 π α (xdx e pn/2 + C 1 α log(p n /a 2 n p (1+µ n. x a n for α p (1+ν n where ν > µ. Proof of Lemma 2 When λ j is given, the m n -dimensional vector x j satisfies that x j / λ j iid χ mn for j = 1,..., p n. Therefore, for the prior of λ j defined in (2, E( x j = E [ E ( x j λj ] = E [ λj E( x j / λ j λj ] = 2Γ((mn + 1/2 E( λ j, (1 Γ(m n /2 and then E( λ j = E [ E( λ j ψ j ] = Γ(3/2E( ψ j = Γ(3/2 2Γ(α+1/2. Then for a con- Γ(α stant C 2 > free of α, E( x j = C 2 mn α. By the Central Limit Theorem, P ( pn s ( x j ηɛ n = P mn 1 p s p n s x j mn ηɛ n p s 1/2 (11 if α C 3ɛ n, which can be verified since α p (1+ν p s n for ν >. Therefore, we show that log ( pn s x j mn ηɛ n p n s π α (x j dx 1... dx pn s nɛ 2 n. (12 3

Next, we need to prove that log ( inf x mne π α (x nɛ 2 n/s. For s 1 /s 2 close to zero, inf x m ne π α(x = (2πλ mn/2 e mne2 /2λ p(λ ψp α (ψdλdψ (2πs 2 mn/2 e mne2 /2s 1 P α (s 1 < λ < s 2. (13 As λ follows prior distribution in (2 for a given hyper-parameter α, we have P α (s 1 < λ < s 2 = (e s 1/ψ e s 2/ψ p α (ψdψ e s 2/2 αe s 2/2 (s 2 s 1 s (1 α 2 C 4 α(1 s 1 /s 2 e s 2/2 1 t α e t dt Γ(α (s 2 s 1 s2 ψ α 2 e s 2/ψ dψ for a constant C 4 >. Therefore, when we have m n /s 1 nɛ 2 n/s and s 2 nɛ 2 n/s, we have ( log inf x π α(x m ne m n 2 log(2πs 2 + m ne 2 2s 1 + (1 + ν log p n + s 2 /2 nɛ 2 n/s. (14 Proof of Lemma 3 By assumption, min j ξ β j /σ mn > ɛ n, max j ξ β j /σ mn < E and π α (x = = φ(x λp(λ ψp α (ψdψdλ (2πλ mn/2 e x 2 /2λ p(λ ψp α (ψdψdλ where the function of λ in the integrand f (λ def = (2πλ mn/2 e x 2 /2λ is concave and it is increasing on (, x 2 /m n ] and decreasing on ( x 2 /m n,. Then we can separate the 4

integral by x 2 /m n log p n so that f (λ is increasing until the point x 2 /m n. Since x 2 /m n log p n f (λ x 2 /m n log p n f (λ ( 2π x 2 p(λ ψp α (ψdψdλ m n log p n x 2 /m n mn/2 e mn log pn/2 (15 p(λ ψp α (ψdψdλ p(λ ψp α (ψdψdλ f (λ 2 x 2 /m n log p n ( 2 x (2π x 2 /m n mn/2 e mn log pn/4 2 P α < λ < x 2 m n log p n m n C 5 (2π x 2 /m n mn/2 e mn log pn/4 αe x 2 /2m n (16 for C 5 > a constant free of n. The inquality in (16 is obtained from the conlcusion in the proof of Lemma 2. Then for m n > nɛ 2 n, we can prove that x 2 /m n log p n f (λ p(λ ψp α (ψdψdλ x 2 /m n log p n f (λ p(λ ψp α (ψdψdλ (log pnmn/2 log pn/2 e mn C 5 e mn log pn/4 αe x 2 /2m n { =exp log C 5 + m n 2 log log p n m } n log p n + (1 + ν log p n + x 2 4 2m n exp { } C 5nɛ 2 n for a constant C 5 >. The inequality on the ratio implies that, for a given x such that x 2 / m n (ɛ n, E, the prior density of x has the dominated mass on λ ( x 2 /m n log p n, : π α (x 1. (17 x 2 /m n log p n φ(x λp(λ ψp α (ψdψdλ From (17, we have the following results. For any x 1 and x 2 such that x 1 < x 2, contributions of λ to π α (x 1 and π α (x 2 outside of the interval λ (s 3, are negligible, 5

where s 3 def = min ( x 1 2 /m n log p n, x 2 2 /m n log p n. Then we have l n = max j ξ max j ξ = max j ξ max j ξ π α (x 1 sup x 1, x 2 β j /σ ±C mnɛn π α (x 2 ( φλ (x 1 s 3 φ λ (x 2 sup x 1, x 2 β j /σ ±C mnɛn s 3 s sup 3 e 1 x 1, x 2 β j /σ ±C mnɛn s 3 φ λ (x 2 p(λ ψp α (ψdψdλ φ λ (x 2 p(λ ψp α (ψdψdλ 2λ ( x 2 2 x 1 2 φ λ (x 2 p(λ ψp α (ψdψdλ φ λ (x 2 π(λ ψπ α (ψdψdλ { 1 ( sup exp x2 2 x x 1, x 2 β 1 2 } j /σ ±C o mnɛn 2s 3 { max exp 2m n ( β j/σ / m n + C ɛ n C ɛ n /s 3 }. j ξ Hence s log l n C 6 sm n Eɛ n /s 3 for constant C 6 >. In view of sm n ɛ n log p n, it follows that s log l n log p n. Proof of Theorem 4.1 The proof of Theorem 4.1 follows the results of Theorem A.1 and Theorem A.2 in Song and Liang (216, but the difference is that we have additional bias in the true model, Y = p f j (X j + σε = B(Xβ + σδ + σε, (18 where each additive function is estimated by m n -dimensional basis expansion. Therefore, Y includes both bias term and linear regression part analogous to the model in Song and Liang (216. The linear coefficient β = (β 1,... β pn T is a p n m n -dimensional basis coefficient vector where a multivariate DL prior will be imposed for each covariate. The true model is the additive non-parametric model p f j (X j +σ ε, where f j are κ-smooth functions for j ξ and f j = are zero effects for j / ξ. In our proposed model, we estimate each function by basis expansion B(X j β j, which imposes an extra bias term δ in (21 due to approximation error. The magnitude of δ is bounded by a multiplier of m κ n. 6

For B n = {At least p of β j /σ is larger than a n } with m n p nɛ 2 n/ log p n and m n p n nɛ 2 n, C n = { B(Xβ p f j (X j σ ɛ n } {σ 2 /σ 2 > (1 + ɛ n /(1 ɛ n } {σ 2 /σ 2 < (1 ɛ n /(1 + ɛ n } and A n = B n C n, we first consider test function φ n = max{φ n, φ n }, φ n = φ n = {ξ ξ, ξ p+s} {ξ ξ, ξ p+s} { Y T (I H ξ Y } 1 (n m n ξ σ 1 c 2 ɛ n (19 { 1 B(X ξ ( B(X ξ T B(X ξ 1 B(Xξ T Y fj (X j c σ } nɛ n /5 j ξ (2 for sufficiently large constants c and c, where H ξ = B(X ξ ( B(X ξ T B(X ξ 1 B(Xξ T is the hat matrix corresponding to the subgroup ξ, and j ξ f j (X j is the summation of true non-parametric functions for covariates X j within subgroup ξ. For any ξ that satisfies ξ ξ, m n ξ m n ( p + s nɛ 2 n, { } Y T (I H ξ Y E (f,σ 2 1 (n m n ξ σ 1 2 c ɛ n { ε T (I H ξ ε =E (f,σ 2 1 1 + 2 εt (I H ξ δ n m n ξ n m n ξ + δt (I H ξ δ n m n ξ } c ɛ n P ( ε T (I H ξ ε (n m n ξ + 2 ε T (I H ξ δ + δ T (I H ξ δ (n m n ξ c ɛ n. Since the magnitude of basis expansion bias is in the order of m κ n, we have δ T (I H ξ δ δ 2 nm 2κ n (n m n ξ ɛ n, 2ε T (I H ξ δ 2 δ ε 2 nm κ n ε. For normally distributed error ε, given the property of chi-square distribution, we have P ( 2 ε T (I H ξ δ ( (n m n ξ c ɛ n P ε c nm 2κ n ɛ 2 n exp{ c nɛ 2 n}. 7

Combining with the conclusion in (A.3 of Song and Liang (216, we can prove that { } Y T (I H ξ Y E (f,σ 2 1 (n m n ξ σ 1 2 c ɛ n exp{ c 3 nɛ 2 n}. (21 Next, fj (X j can be approximated by B-spline basis expansion B(X j β j with bias term δ, where β j is the B-spline basis coefficients for the true non-parametric function. We have pn B(Xβ fj (X j = B(X(β β δ B(X ξ (β ξ β ξ + B(X ξ cβ ξ c + δ, where B(Xξ (β ξ β ξ n βξ β ξ, B(Xξ cβ ξ c nan p n nɛ n and δ nɛn. By the condition that m n ( p + s nɛ 2 n, we have, for some constant c 3, E (f,σ 2 1 { B(X ξ ( B(X ξ T B(X ξ 1 B(Xξ T Y j ξ f j (X j c σ nɛ n /5 } =E (f,σ 2 1 { B(Xξ (β ξ β ξ ( fj (X j B(X ξ β ξ c σ nɛ n /5 } j ξ E (f,σ 2 1 { ( B(X ξ T B(X ξ 1 B(Xξ T Y β ξ c σ ɛ n /5 } E (f,σ 2 1 { B(Xξ ( B(X ξ T B(X ξ 1 B(Xξ T ε + ( B(X ξ T B(X ξ 1 B(Xξ T δ c ɛ n /5 } exp( c 3nɛ 2 n, (22 where the inequality is given by (A.4 of Song and Liang (216 and the fact that δ nm κ n. Therefore, combining (21 and (22, we can prove that E (f,σ 2 φ n exp( c 3 nɛ 2 n for some fixed c 3, according to (A.5 in Song and Liang (216. For the additive model, the process to prove that sup (β,σ 2 C n E β (1 φ n is similar to the proof of Theorem A.1 in Song and Liang (216, by replacing their covariate matrix X with 8

B(X. Therefore, we have inequalities E (f,σ 2 φ n exp( cnɛ 2 n, (23 sup E (β,σ 2 (1 φ n exp( c nɛ 2 n. (24 (β,σ 2 C n The next step is to show that lim n { m(dn } P L (D n exp( c 4nɛ 2 n > 1 exp{ c 5 nɛ 2 n}, (25 for any positive c 4, where m(d n is the marginal likelihood of the data after integrating out the parameters on their priors and L (D n is the likelihood under the truth so that m(d n L (D n = (σ n exp{ Y B(Xβ 2 /2σ 2 } σ n exp{ Y p n f j (X j 2 /σ 2 } π(β, σ2 dβdσ 2. From the proof of Theorem A.1 in Song and Liang (216, we only need to show that, conditional on σ 2 σ 2 η ɛ 2 n, for some constants c 5 and c 6, ( ( P π Y B(Xβ 2 /2σ 2 < p n Y fj (X j 2 / σ 2 + c 5 nɛ 2 n e c 6nɛ 2 n 1 e c 5nɛ 2 n For our proposed model, each element in B(X is smaller than 1 and δ 2 nm 2κ n nɛ 2 n. By the property of chi-square distribution and normal distribution, with probability larger than 1 exp{ c 5 nɛ 2 n}, we have ε 2 n(1+c 5, ε T B(X c 5nɛ n for constant c 5. Then 9

for model parameters (β, σ 2, we have { Y B(Xβ 2 /2σ 2 < Y ( f 1 (X 1 +... + f p n (X pn 2 /σ 2 + c 5 nɛ 2 n} ={ εσ /σ + B(X(β β/σ + δσ /σ 2 < ε 2 + 2c 5 nɛ 2 n} { B(X(β β/σ < ε 2 + 2c 5 nɛ 2 n (σ /σ ε (σ /σ δ } { pn (β j β j /σ 2η } m n ɛ n { β j /σ [ β j/σ η m n ɛ n /s, β j/σ + η m n ɛ n /s ] } for all j ξ { β j /σ η } m n ɛ n. j / ξ for some constant η. The second subset relation is given by Assumption A 1 (3 such that the eigenvalues of B(X T B(X are in the same order as n/m n. By Lemma 1 and Lemma 2, π α ( j / ξ β j /σ η m n ɛ n σ 2 σ 2 η ɛ 2 n p n s = π(x pn s x j η i dx 1... dx pn s exp{ c 6nɛ 2 n} m nɛ n π α ( β j /σ [ β j/σ η m n ɛ n /s, β j/σ + η m n ɛ n /s ] { σ 2 σ 2 η ɛ 2 n} ηɛ n inf x π α(x/s exp{ c 6nɛ 2 n}. m ne These verify the inequality in (25. Finally, from the proof of Theorem A.1 in Song and Liang (216, for B n defined above, we have P(B n < e c 7nɛ 2 n for some constants c7, given that x >a n π α (xdx p (1+µ n. Therefore, combining the results of P r(b n < e c 7nɛ 2 n with (23, (24 and (25, we have ( B(Xβ p n P (π fj (X j σ nɛ n or σ 2 σ 2 > c 4 ɛ n X, Y e c 1nɛ 2 n e c 2nɛ 2 n. In conclusion, the results in Theorem 4,1 is proved. 1

Proof of Theorem 4.2 The proof of Theorem 4.2 follows the proof of Theorem A.3 in Song and Liang (216. Note that in our proposed model, we have basis expansion matrix B(X instead of covariate matrix X, the coefficients vector β = (β T 1,..., β T p n T are shrunk within blocks β 1,..., β pn. Furthermore, our model has an additional bias term δ due to B-spline basis expansion. We first define the set S 1 = { β β c 1 ɛ n, σ 2 σ 2 c 2 ɛ n }, and π(β σ 2 = inf (β,σ 2 S 1 π(β, σ 2 /π(σ 2, π(β σ 2 = sup (β1,σ 2 S 1 π(β, σ 2 /π(σ 2. The maximum of L 2 - norm for each group within coefficient vector is defined as β max = max βj. We can conclude by proof of Theorem A.3 in Song and Liang (216, ( β P(ξ = ξ (ξ X, Y π c ( s a n π(β σ ξ σ 2 det(b(x ξ T B(X ξ 1 2π max Γ(a + (n s/2 inf ( β (ξ c max a n(σ +c 2 ɛ n SSE(β(ξ c/2 + b a +(n s/2, (26 where SSE(β (ξ c is the SSE when β (ξ c is given and fixed. Next, we let β min = min βj be the minimum of L 2 -norms for each predictor within coefficient vector. Similarly, we can show that ( β P(ξ = ξ ξ X, Y π \ξ > a n, β (ξ c a n π(β σ min σ ξ σ 2 det(b(x ξ T B(X ξ 1 max ( s Γ(a + (n s/2 2π sup ( β (ξ c max a n(σ +c 2 ɛ n SSE(β(ξ c/2 + b a +(n s/2, where SSE(β (ξ c is the SSE when β (ξ c is given and fixed. Using the fact that the basis expansion bias is δ such that δ 2 nmn 2κ, we can follow the derivation of (A.13 and (A.14 of Song and Liang (216. With probability larger than 11

1 4p n p c 6 n, SSE(β (ξ c =(Y B(X (ξ cβ (ξ ct (I H ξ (Y B(X (ξ cβ (ξ c σ 2 ε T (I H ξ ε + σ 2 δ T (I H ξ δ + B(X(ξ cβ (ξ 2 c + 2σ 2 2c 6n log p n β j ; (27 j (ξ c SSE(β (ξ c σ 2 ε T (I H ξ ε + σ 2 δ T (I H ξ δ 2σ 2c 6n log p n j (ξ c β j, (28 and with probability 1 p c 6 n for any constant c 6, we have ε T (I H ξ ε ε T (I H ξ ε c 7m n ξ \ ξ log p n. Since δ T (I H ξ δ δ T (I H ξ δ c 7m n ξ \ ξ nm 2κ n, we can conclude that sup β(ξ c max a n(σ +c 2 ɛ n(sse(β (ξ c/2 + b a +(n s/2 inf β(ξ c max a n(σ +c 2 ɛ n(sse(β (ξ c/2 + b a +(n s/2 exp{c 8m n ξ \ ξ log p n + c 8m n ξ \ ξ nm 2κ n }. (29 n If m n ( log p n 1/2κ, we have nm 2κ n log p n, so that the right hand side of (29 is bounded by c 8 m n ξ \ξ log p n for some constant c 8 >. Given (6 in Lemma 3, π(β ξ σ 2 /π(β ξ σ 2 ln. s We can proceed with P(ξ = ξ X, Y σ + c 2 ɛ n P(ξ = ξ X, Y ls n [p σ n (1+µ /(1 p (1+µ n ] ξ \ξ exp{c 8 m n ξ \ ξ log p n } c 2 ɛ n ln s σ + c 2 ɛ n p µ m n ξ \ξ σ n, c 2 ɛ n with proper choice of the constants. Therefore, with min j ξ β j m n ɛ n, following the proof of Theorem A.2 in Song and 12

Liang (216, we can prove that ( P (ξ = ξ X, Y π β β > min β j a n σ j ξ X, Y e cnɛ2 n ξ ξ { In conclusion, we have P π ( ξ(a n = ξ X, Y } > 1 p µ n µ, µ >. > 1 p µ n for some constants 13