Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression.

Dimesio-free PAC-Bayesia bouds for matrices vectors ad liear least squares regressio Olivier Catoi ad Ilaria Giulii December 31 2017 Abstract: This paper is focused o dimesio-free PAC-Bayesia bouds uder weak polyomial momet assumptios allowig for heavy tailed sample distributios It covers the estimatio of the mea of a vector or a matrix with applicatios to least squares liear regressio Special efforts are devoted to the estimatio of the Gram matrix due to its promiet role i high-dimesio data aalysis Key words: PAC-Bayesia bouds sub-gaussia mea estimator radom vector radom matrix least squares regressio dimesio-free bouds MSC2010: 62J10 62J05 62H12 62H20 62F35 15B52 1 Itroductio The subject of this paper is to discuss dimesio-free PAC- Bayesia bouds for matrices ad vectors It comes after Catoi 2016 ad Giulii 2017a the first paper discussig dimesio depedet bouds ad the secod oe dimesio-free bouds uder a kurtosis like assumptio about the data distributio Here i cotrast we evisio eve weaker assumptios ad focus o dimesio-free bouds oly Our mai objective is the estimatio of the mea of a radom vector ad of a radom matrix Fidig sub-gaussia estimators for the mea of a o ecessarily sub-gaussia radom vector has bee the subject of much research i the last few years with importat cotributios from Joly Lugosi ad Oliveira 2017 Lugosi ad Medelso 2017 ad Misker 2015 While i Joly Lugosi ad Oliveira 2017 the statistical error boud still has a residual depedece o the dimesio of the ambiet space i Lugosi ad Medelso 2017 this depedece is removed for a estimator of the media of meas type However this estimator is ot easy to compute ad the boud cotais large costats We propose here aother type of estimator that ca be see as a multidimesioal extesio of Catoi 2012 It provides a oasymptotic cofidece regio with the same diameter icludig the values of the costats as the Gaussia cocetratio iequality stated i equatio 11 of Lugosi ad Medelso 2017 although i our case the cofidece regio is ot ecessarily a ball but still a covex set The Gaussia boud cocers the CREST CNRS UMR 9194 Uiversité Paris Saclay Frace; e-mail: oliviercatoi@esaefr Laboratoire de Probabilités et Modèles Aléatoires Uiversité Paris Diderot Frace; e-mail: giulii@mathuiv-paris-diderotfr 1

estimatio of the expectatio of a Gaussia radom vector by the mea of a iid sample whereas i our case we oly assume that the variace is fiite a much weaker hypothesis I Misker 2016 the questio of estimatig the mea of a radom matrix is addressed The author uses expoetial matrix iequalities i order to exted Catoi 2012 to matrices ad to cotrol the operator orm of the error I the bouds at cofidece level 1 δ the complexity term is multiplied by logδ 1 Here we exted Catoi 2012 usig PAC-Bayesia bouds to measure complexity ad defie a estimator with a boud where the term logδ 1 is multiplied by some directioal variace term oly ad ot the complexity factor that is larger After recallig i Sectio 2 the PAC-Bayesia iequality that will be at the heart of may of our proofs we deal successively with the estimatio of a radom vector Sectio 3 ad of a radom matrix Sectio 4 Sectio 6 is devoted to the estimatio of the Gram matrix due to its promiet role i multidimesioal data aalysis I Sectio 7 we itroduce some applicatios to least squares regressio 2 Some well kow PAC-Bayesia iequality This is a prelimiary sectio where we state the PAC-Bayesia iequality that we will use throughout this paper to obtai deviatio iequalities holdig uiformly with respect to some parameter Cosider a radom variable X X ad a measurable parameter space Θ Let µ M 1 +Θ be a probability measure o Θ ad f : Θ X R a bouded measurable fuctio For ay other probability measure ρ o Θ defie the Kullback divergece fuctio Kρ µ as usual by the formula dρ log dρ ρ µ Kρ µ = dµ + otherwise Let X 1 X be idepedet copies of X Propositio 21 For ay δ ]01[ with probability at least 1 δ for ay probability measure ρ M 1 +Θ 1 f θ X i dρθ log [ E exp f θ X ] Kρ µ + logδ 1 dρθ + Proof It is a cosequece of equatio 521 page 159 of Catoi 2004 Ideed let us recall the idetity log exp hθ dµθ { = sup ρ 2 } hθ dρθ Kρ µ

where h may be ay bouded measurable fuctio extesios to ubouded h are possible but will ot be required i this paper ad where the supremum i ρ is take o all probability measures o the measurable parameter space Θ The proof may be foud i Catoi 2004 page 159 Combied with Fubii s lemma it yields { [ E exp sup ρ { = E = f θ X i log [ E exp f θ X ] ]} dρθ Kρ µ exp f θ X i log [ E exp f θ X ] } dµθ E exp f θ X i log [ E exp f θ X ] dµθ Sice EexpW 1 implies that = [ E exp f θ Xi ] E exp f θ X dµθ = 1 P W logδ 1 = E 1 [ δ expw 1 ] E δ expw δ we obtai the desired result cosiderig [ W = sup f θ X i log [ E exp f θ X ] ] dρθ Kρ µ ρ 3 Estimatio of the mea of a radom vector Let X R d be a radom vector ad let X 1 X be idepedet copies of X I this sectio we will estimate the mea EX ad obtai dimesio-free o-asymptotic bouds for the estimatio error Let S d = { θ R d : θ = 1 } be the uit sphere of R d ad let I d be the idetity matrix of size d d Let ρ θ = N θ β 1 I d be the ormal distributio cetered at θ R d whose covariace matrix is β 1 I d where β is a positive real parameter Istead of estimatig directly the mea vector EX our strategy will be rather to estimate its compoet θex i each directio θ S d of the uit sphere For this we itroduce the estimator of θex defied as Eθ = 1 λ ψ λ θ X i dρ θ θ θ S d λ > 0 3

where ψ is the symmetric ifluece fuctio t t 3 /6 2 t 2 1 ψt = 2 2/3 t > 2 2 2/3 t < 2 ad where the positive costats λ ad β will be chose afterward As stated i the followig lemma we chose this ifluece fuctio because it is close to the idetity i a eighborhood of zero ad is such that exp ψt is bouded by polyomial fuctios Lemma 31 For ay t R log 1 t + t 2 /2 ψt log 1 + t + t 2 /2 Proof Put f t = log 1 + t + t 2 /2 Remark that f 1 + t t = 1 + t + t 2 /2 for t R ad that ψ t = 1 t 2 /2 for t [ 2 2] As ψ0 = f 0 = 0 ad provig that [ f t ψ t ] 1 + t + t 2 /2 = t3 2 t 4 ψ t f t 0 t 2 ψ t f t 2 t 0 ψt f t 2 t 2 Sice f is icreasig o [ 2+ [ ad decreasig o ] 2] while ψ is costat o these two itervals the above iequality ca be exteded to all t R From the symmetry ψ t = ψt we deduce the coverse iequality f t ψt t R that eds the proof Sice λ θ X i follows a ormal distributio with mea λ θ X i ad stadard deviatio λ β 1/2 X i ad sice the ifluece fuctio ψ is piecewise polyomial the estimator E ca be computed explicitly i terms of the stadard ormal distributio fuctio This is doe i the followig lemma Lemma 32 Let W N0 1 be a stadard Gaussia real valued radom variable For ay m R ad ay R + defie ϕm = E [ ψ m + W ] 4

The fuctio ϕ ca be computed as ϕm = m 1 2 /2 m 3 /6 + rm where itroducig Fa = PW a a R the correctio term r is rm = 2 [ 2 2 + m 2 m ] F F 3 m m 3 /6 [ 2 + m 2 m ] F + F 1 m 2 /2 + [exp 1 2 + m 2 exp 1 2 m 2 ] 2π 2 2 { + m2 2 m 2 + m F + F 2 + 1 [ [ 2 + m exp 1 2 + m 2 ] + [ 2 m exp 1 2 m 2 ]} 2π 2 2 {[ + 3 6 2 m 2 ] [ + 2 exp 1 2 m 2 ] 2π 2 [ 2 + m 2 ] + 2 exp [ 1 2 2 + m 2 ]} Remark that the correctio term is small whe m is small ad is small sice { 1 F t mi t 2π 1 } exp t2 t R + 2 2 Proof The proof of this lemma is a simple computatio based o the expressio ψt = t t3 [1t ] 2 1t 2 6 o the idetities + 2 2 [ ] 1 1t 2 1t 2 t R 3 E [ 1W a ] = Fa E [ 1W aw ] = 1 exp a2 2π 2 5

E [ 1W aw 2] = Fa E [ 1W aw 3] = a2 + 2 2π a 2π exp exp a2 2 a2 2 ad o the fact that F t = 1 Ft Accordigly the estimator E ca be computed as Eθ = 1 λ = 1 ϕ λ θ X i λ β 1/2 X i θ X i 1 λ2 X i 2 λ2 θ X i 3 2β 6 + r λ θ X i λ β 1/2 X i 31 Estimatio without ceterig Propositio 33 Assume that ad E X 2 = Tr [ E X X ] T < sup E θ X 2 v T < θ S where T ad v are two kow costats ad where S S d is a arbitrary symmetric subset of the uit sphere meaig that if θ S the θ S Choose ay cofidece parameter δ ]01[ ad set the costats λ ad β used i the defiitio of the estimator E to 2 logδ λ = 1 v β = 2T logδ T λ = 1 v No asymptotic cofidece regio: With probability at least 1 δ sup Eθ θex θ S T 2v logδ + 1 Cosider a estimator m R d of EX satisfyig sup Eθ θ m T + θ S 6 2v logδ 1

With probability at least 1 δ such a vector exists ad sup θ m EX sup Eθ θ m + θ S θ S T 2v logδ 1 + 2 T + 2 2v logδ 1 Remark 31 I particular i the case whe S = S d is the whole uit sphere we obtai with probability at least 1 δ the boud T 2v logδ m EX = sup θ m EX 2 θ S d + 1 By choosig m as the middle of a diameter of the cofidece regio we could do a little better ad replace the factor 2 i this boud by a factor 3 Proof Accordig to the PAC-Bayesia iequality of Propositio 21 o page 2 with probability at least 1 δ for ay θ S Eθ 1 λ log [ E exp ψ λ θ X ] dρθ θ + Kρ θ ρ 0 + logδ 1 λ We ca the use the polyomial approximatio of expψt give by Lemma 31 o page 4 remarkig that Kρ θ ρ 0 = β/2 ad that log1 + z z to deduce that Eθ E θ X + λ E θ X 2 dρ θ θ + β + 2 logδ 1 2 2λ = E θ X + λ [ E θ X 2 + E X 2 ] + β + 2 logδ 1 2 β 2λ E θ X + λ β + 2 logδ 1 v + T/β + 2 2λ T 2v logδ = θex + + 1 We coclude by cosiderig both θ S ad θ S to get the reverse iequality usig the assumptio that S is symmetric ad remarkig that E θ = Eθ The existece with probability 1 δ of m satisfyig the required iequality is grated by the fact that o the evet defied by the above PAC-Bayesia iequality the expectatio EX belogs to the cofidece regio that as a result caot be empty 7

32 Cetered estimate The bouds i the previous sectio are simple but they are stated i terms of ucetered momets of order two where we would have expected a variace I this sectio we explai how to deduce cetered bouds from the ucetered bouds of the previous sectio through the use of a sample splittig scheme Assume that E X EX 2 T < ad sup θ S d E θ X EX 2 v T < where v ad T are kow costats Remark that whe these bouds hold the bouds 2 v = v + EX 2 ad T = T + EX 2 hold i the previous sectio Assume that we kow also some boud b such that EX 2 b Split the sample i two parts X 1 X k ad X k+1 X Use the first part to costruct a estimator m of EX as described i Propositio 33 o page 6 choosig S = S d Accordig to this propositio ad by equatio 2 with probability at least 1 δ m EX 2 T + b 2v + b logδ + 2 1 k k where we have put A = 4 T + b + 2v + b logδ 1 2 = A k We the costruct a estimator Eθ of θex θ S d built as described i Propositio 33 based o the sample X k+1 m X m ad o the costats T + A/k ad v + A/k With probability at least 1 2δ sup Eθ θex B k = θ S d T + A/k 2 v + A/k logδ + 1 k k ad we ca if eeded deduce from Eθ a estimator m such that with probability at least 1 2δ m EX 2B k 8

If we wat the correctio term A/k to behave as a secod order term whe teds to we ca for example take k = i which case k is equivalet to at ifiity so that B is equivalet to T 2 v logδ + 1 Let us also metio that a simpler estimator obtaied by shrikig the orm of X i is also possible It comes with a sub-gaussia deviatio boud uder the slightly stroger hypothesis that E X p < for some o ecessarily iteger expoet p > 2 ad is described i Catoi ad Giulii 2017 4 Mea matrix estimate Let M R p q be a radom matrix ad let M 1 M be idepedet copies of M I this sectio we will provide a estimator for EM From the previous sectio we already have a estimator m of EM with a bouded Hilbert-Schmidt orm m EM HS sice from the poit of view of the Hilbert-Schmidt orm M is othig but a radom vector of size pq Here we will be iterested i aother atural orm the operator orm Ideed recallig that M = M = sup θ S q Mθ sup ξ Mθ = sup M ξ = sup Tr θξ M θ S q ξ S p ξ S p θ S q ξ S p we see that we ca deduce results from the previous sectio o vectors cosiderig the scalar product betwee matrices ad the part of the uit sphere defied as m EM 2 M N = Tr M N M N R p q S = { ξθ : ξ S p θ S q } Doig so we obtai i the ucetered case a boud of the form E M HS 2 + 2 2 sup ξ S p θ S q E ξ Mθ 2 logδ 1 We will show i the ext sectio that the secod δ-depedet term is satisfactory whereas the first δ-idepedet term ca be improved 9

41 Estimatio without ceterig Cosider the ifluece fuctio ψ defied by equatio 1 o page 4 For ay ξ R p let ν ξ = N ξ β 1 I p where Ip is the idetity matrix of size p p I the same way let ρ θ = N θγ 1 I q θ R q Cosider the estimator of ξem θ defied as Eξθ = 1 λ ψ λ ξ M i θ dν ξ ξ dρ θ θ ξ R p θ R q Propositio 41 For ay parameters δ ]0 1[ λ β γ ]0 [ with probability at least 1 δ for ay ξ R p ad ay θ R q Eξθ E ξ Mθ λ [E ξ Mθ 2 + E Mθ 2 2 β + E M ξ 2 γ + E M HS 2 βγ + β + γ + 2 logδ 1 2λ Proof The PAC-Bayesia iequality of Propositio 21 o page 2 tells us that with probability at least 1 δ for ay ξ R p ad ay θ R q Eξθ λ 1 log { E [ exp ψ λ ξ Mθ ]} dνξ ξ dρ θ θ + Kν ξ ν 0 λ + Kρ θ ρ 0 λ + logδ 1 λ Usig the properties of ψ Lemma 31 o page 4 ad Fubii s lemma we get Eξθ ξemθ + λ 2 E ξ Mθ 2 dν ξ ξ dρ θ θ + β + γ + 2 logδ 1 2λ As ξ Mθ 2 dν ξ ξ dρ θ θ = ξ Mθ 2 + Mθ 2 β + M ξ 2 γ + M 2 HS βγ this cocludes the proof Let us ow discuss the questio of computig Eξ θ Remark that accordig to Lemma 32 o page 4 for ay x R p ψ ξ x dν ξ ξ = ϕ ξ x β 1/2 x 10

It is also easy to check that ξ x x 2 ξ x 3 = ξ x + r ξ x β 1/2 x 2β 6 M i θ 2 dρ θ θ = M i θ 2 + M i HS 2 γ ξ M i θ M i θ 2 dρ θ θ = ξ M i θ M i θ 2 + 1 γ ξ M iθ M i 2 HS ad ξ M i θ 3 dρ θ θ = ξ M i θ 3 + 3 γ ξ M iθ M i ξ 2 + 2 γ ξ M i M i M i θ Cosider a stadard radom vector W q N0 I q We obtai that ψ λ ξ M i θ dν ξ ξ dρ θ θ = λ ξ M i θ λ3 ξ M i θ M i θ 2 2β λ3 6 ξ M iθ 3 + r λ ξ M i θ λ β 1/2 M i θ dρ θ θ so that Eξθ = 1 ξ M i θ λ2 6 ξ M iθ 3 λ2 2β ξ M iθ M i θ 2 λ2 2γ ξ M iθ M i ξ 2 λ2 2βγ ξ M iθ M i 2 HS λ2 βγ ξ M i M i M i θ + 1 λ E [ r λ M i ξθ + γ 1/2 W q λ β 1/2 M i θ + γ 1/2 W q ] The last term is ot explicit sice it cotais a expectatio but should be most of the time a small remider ad ca be evaluated usig a Mote-Carlo umerical scheme This gives a more explicit ad efficiet method tha evaluatig directly Eξ θ usig a Mote-Carlo simulatio for the couple of radom variables ξ θ ν ξ ρ θ 11

Propositio 42 Assume that the followig fiite bouds are kow v ad choose sup E ξ Mθ 2 = sup ξ ξ E M M θ θ ξ S p θ S q ξ S p θ S q t sup θ S q E Mθ 2 = sup θ S q θe M M θ = E M M u sup ξ S p E M ξ 2 = sup ξ S p ξe M M ξ = E M M T E M 2 HS λ = β + γ + 2 logδ 1 v + t/β + u/γ + T/ βγ For ay values of δ ]01[ βγ ]0 [ with probability at least 1 δ for ay ξ S p ay θ S q v Eξθ ξemθ t B = + β + u γ + T β + γ + 2 logδ 1 βγ Cosider ow ay estimator m of EM With probability at least 1 δ m EM I particular if we choose m such that sup Eξθ ξ m θ + B ξ S p θ S q sup Eξθ ξ m θ B ξ S p θ S q with probability at least 1 δ this choice is possible ad Remark 41 The boud B is of the type with a complexity or dimesio term C equal to m EM 2B { t + u T I particular choosig β = γ = 2 max v v { } 2v t + u T B 2 logδ 1 + 4 max v v [ 2v C + logδ 1 ] { t + u T C = 4 max v v 12 } we get } + logδ 1

Remark 42 Let us evisio a simple case to compare the precisio of the bouds i a settig where dimesio-free ad dimesio-depedet bouds coicide Assume more specifically that the etries of the matrix M M i j 1 i p1 j q are cetered ad iid Assume that = EMi 2 j is kow ad take v = sup ξ S p θ S q E ξ Mθ 2 = 2 t = sup θ S q E Mθ 2 = p 2 u = sup ξ S p E M ξ 2 = q 2 T = E M 2 HS = pq 2 Choosig β = γ = 2p + q we get a complexity term equal to C = 4p + q + logδ 1 whereas the boud of the previous sectio made for vectors has a complexity factor equal to pq 42 Cotrollig both the operator orm error ad the Hilbert-Schmidt error There are situatios where it is desirable to cotrol both m EM ad m EM HS To do so we ca very easily combie Propositios 33 o page 6 ad Propositio 42 o page 11 sice these two propositios are based o the costructio of cofidece regios More precisely first cosider M R p q as a vector ad use the scalar product θ M HS = Tr θ M θ R p q Applyig Propositio 33 o page 6 we ca build a estimator E HS θ such that with probability at least 1 δ sup E HS θ Tr θ EM T 2v logδ A = + 1 θ R p q θ HS =1 O the other had we ca also apply Propositio 42 o page 11 ad build a estimator Eξθ ξ S p θ S q such that with probability at least 1 δ { } sup Eξθ ξemθ 2v t + u T B = 2 logδ ξ S p θ S q 1 + 4 max v v 13

Propositio 43 Cosider a matrix m such that sup θ R p q θ HS =1 ad E HS θ Tr θ m A sup Eξθ ξ m θ B ξ S p θ S q Combiig Propositios 33 ad 42 shows that with probability at least 1 2δ such a matrix m exists ad satisfies both m EM HS 2A ad m EM 2B Remark that B is typically smaller tha A as expected i iterestig large dimesio situatios 43 Cetered estimator As already doe i the case of the estimatio of the mea of a radom vector we deduce i this sectio cetered bouds from the ucetered bouds of the previous sectios usig sample splittig Put m = EM ad M = M m Assume that we kow fiite costats vtut such that sup E ξ Mθ 2 v < ξ S p θ S q sup E Mθ 2 t < θ S q sup E M ξ 2 u < ξ S p E M 2 HS T < Whe this is true we ca take for the previous ucetered costats v = v + m 2 t = t + m 2 u = u + m 2 T = T + m 2 HS I view of this it is suitable to assume that we also kow some fiite costats b ad c such that m 2 b ad m 2 HS c As we see that the Hilbert-Schmidt orm m HS comes ito play we will use the combied prelimiary estimate provided by Propositio 43 Give a iid matrix sample M 1 M first use M 1 M k to build a prelimiary estimator m as described i Propositio 43 With probability at least 1 δ/2 m m HS A k ad m m 14 B k

2 where A = 4 2v + b log4/δ + T + c { t + u + 2b ad B = 8v + b 2 log4/δ + 4 max v + b T + c 1/2 } v + b The use the sample M k+1 m M m to build a estimator Eξθ ξ S p θ S q based o the costructio described i Propositio 42 o page 11 at cofidece level 1 δ/2 It is such that with probability at least 1 δ Eξθ ξm θ C k { 2v + B/k t + u + 2B/k = 2 log2/δ + 4 max k v + B/k If we choose for istace k = we obtai that C 2 v { t + u T 1/2 } 2 log2/δ + 4 max v v T + A/k 1/2 } v + B/k 5 Adaptive estimators The results preseted i the previous sectios assume that there exist kow upper bouds for some quatities as E X 2 i the case of a mea vector estimate or E M HS 2 i the matrix case Here we would like to adapt to these quatities i the case whe those bouds are ot kow To do so we will use a asymmetric ifluece fuctio ψ : R + R + defied o the positive real lie oly as 3 ψt = t t 2 /2 0 t 1 1/2 1 t Lemma 51 For ay t R + log1 t + t 2 ψt log1 + t Proof Let us put f t = log1 t + t 2 ad gt = log1 + t Remark that f 0 = g0 = ψ0 = 0 Remark also that for ay t [01] f t = 1 2t 1 t + t 2 ψ t f t = t2 2 t 1 t + t 2 0 ad g t = 1 t 1 + t ψ t = 1 t As o the iterval [1 [ f is decreasig g is icreasig ad ψ is costat this proves the lemma 15

Similarly to the previous case cosiderig a stadard Gaussia real valued radom variable W N01 we ca itroduce the fuctio ϕm = E { ψ [ m + W + ]} where t + = max { t 0 } ad explicitly compute ϕ as [ ϕm = m m2 2 2 1 + m 1 F 2 + 1 m/2 exp m2 2π 2 2 usig the expressio ψt + = t t 2 /2 [ 1t 1 1t 0 ] + 1 2 F m ] + 1 1 + m 2 F 1 m 2 2π exp 1 m2 2 2 [ 1 1t 1 ] t R 51 Estimatio of the mea of a radom vector Cosider a discrete set Λ of values of λ ad a probability measure µ o Λ to be chose more precisely later o Let β be some positive parameter that we will also choose later ad put as previously ρ θ = Nθ β 1 I d Defie for ay θ S d 1 E + θ = sup λ Λ λ 1 E θ = sup λ Λ λ ad Eθ = E + θ E θ ψ λ θ X i + dρθ θ β + 2 log δ 1 µλ 1 2λ ψ λ θ X i dρθ θ β + 2 log δ 1 µλ 1 2λ Thoughtful readers may woder why we itroduce λ i this way ad do ot use istead ρ λθ to get a uiform result i λθ i oe shot without itroducig the discrete set Λ It is because this optio would produce the etropy factor λ β 2 istead β of requirig a value of β depedig o ukow momets of the distributio 2λ of X Accordig to the PAC-Bayesia iequality of Propositio 21 o page 2 with probability at least 1 2δ E θ X + dρθ θ { if λ E θ X 2 + dρθ θ + β + 2 log δ 1 µλ 1 } λ Λ λ 16

E + θ E θ X + dρθ θ More precisely to obtai the above iequalities we have used a uio boud with respect to λ Λ startig from the fact that whe we replace the ifimum i λ i the previous equatio with a fixed value of λ Λ it holds with probability at least 1 2µλδ Sice f θ dρ θ θ = f θ dρ θ θ this implies also that E θ X dρθ θ { if λ E θ X 2 dρθ θ + β + 2 log δ 1 µλ 1 } λ Λ λ E θ E θ X dρθ θ Therefore with probability at least 1 2δ { B θ = if λ λ Λ E θ X 2 + dρθ θ + β + 2 log δ 1 µλ 1 λ { λ E θ X 2 dρθ θ Eθ θex B + θ = if λ Λ } + β + 2 log δ 1 µλ 1 } λ This defies for θex a cofidece iterval of legth o greater tha { Bθ = if λ λ Λ E θ X 2 dρ θ θ + 2β + 4 log δ 1 µλ 1 Ufortuately either B + θ B θ or Bθ are observable But evertheless we ca build a estimator m such that sup θ S d { θ m Eθ } = if m R d It satisfies with probability at least 1 2δ λ sup θ S d { θm Eθ } } m EX = sup θ S d θ m EX 17

{ } { } { } sup θ m Eθ + sup Eθ θex 2 sup Eθ θex θ S d θ S d θ S d 2 sup B + θ sup θ θ S d = sup θ S d if 2λ λ Λ if 2λ λ Λ E θ X 2 dρ θ θ + 2β + 4 log δ 1 µλ 1 λ E θ X 2 + E X 2 + 2β + 4 log δ 1 µλ 1 β λ Lemma 52 Let us choose β = 2 log δ 1 ad put v = sup θ Sd E θ X 2 ad T = E X 2 With probability at least 1 2δ m EX if Bλ λ Λ where Bλ = { 2λ v + T 2 logδ 1 + 8 logδ 1 + 4 log µλ 1 To tur this lemma ito a explicit boud we eed ow to choose Λ ad µ M 1 + Λ Cosider for some real parameters > 0 ad α > 1 { } α k Λ = : k Z αk For ay λ k = Λ put ad remark that Put also 1 µλ = 2 k + 1 k 0 k + 2 1/2 k = 0 µλ k λ = 1 2 k + 2 2 k Z 4 logδ 1 T v + 2 logδ 1 The boud Bλ appearig i the previous lemma ca be writte as 2 Bλ = 4 2v logδ 1 + T [ λ cosh log + log µλ 1 ] λ λ 4 logδ 1 λ Sice logλ k = k logα log log/2 there exists k Z such that log λ k /λ logα/2 18 λ }

so that Therefore k log λ / logα + 1/2 if Bλ Bλ k 4C λ Λ where the costat C is equal to [ logα α C = cosh + 2 2 logδ 1 log 2 2v logδ 1 + T 1 2 logα log 2v logδ 1 + T 8 2 logδ 1 2 + 5 ] 2 We see that the costat 2 ca be iterpreted as our best guess of the ratio 2v logδ 1 + T 8 log δ 1 2 However this guess may be very loose without harmig the costat C too much Ideed to give a example if we choose α = e ad we assume that we made a error of magitude 10 6 o the choice of 2 compared to the optimal guess we get C cosh1/2 + exp1/2 2 logδ 1 log [ 1 2 log106 + 5 ] 113 + 2 22 logδ 1 so that if we work at the cofidece level correspodig to δ = 1/100 we obtai that C 16 I brief the message is that C is typically betwee oe ad two 52 Adaptive estimatio of the mea of a radom matrix We cosider here the same framework as i Sectio 3 o page 3 Let M R p q be a radom matrix ad M 1 M be a sample made of idepedet copies of M Usig the asymmetric ifluece fuctio ϕ defied by equatio 3 o page 15 give ξ S p θ S q we defie the estimators { 1 E + ξθ = sup λ Λ λ { 1 E ξθ = E + ξθ = sup λ Λ λ ψ [ λ ξ M i θ + ] dνξ ξ dρ θ θ β + γ + 2 log δ 1 µλ 1 } 2λ ψ [ λ ξ M i θ ] dνξ ξ dρ θ θ β + γ + 2 log δ 1 µλ 1 } 2λ ad Eξθ = E + ξθ E ξθ 19

Lemma 53 With probability at least 1 2δ for ay ξ S p θ S q { if λ λ Λ so that ad E ξ Mθ 2 + dνξ ξ dρ θ θ + β + γ + 2 log δ 1 µλ 1 } λ E + ξθ E ξ Mθ + dνξ ξ dρ θ θ 0 B + ξθ Eξθ ξem θ B + ξθ { = if λ E ξ Mθ 2 dνξ ξ dρ θ θ + β + γ + 2 log δ 1 µλ 1 } λ Λ λ { if λ [E ξ Mθ 2 + E Mθ 2 + E M ξ 2 λ Λ β γ + E M HS 2 ] βγ + β + γ + 2 log δ 1 µλ 1 } λ Choose β = γ = 2 χ logδ 1 with χ > 0 Let Λ = as i the previous sectio Put v = E ξ Mθ 2 { λ k = αk : k Z } µλ k 1 2 k + 2 2 sup E ξ Mθ 2 = v ξ S p θ S q t = E Mθ 2 sup θ S q E Mθ 2 = t u = E M ξ 2 sup ξ S p E M ξ 2 = u T = E M 2 HS l = logδ 1 λ 2 21 + χl 2 = lv + t + u χ + T l χ 2 Remark that i a similar way to the case of a vector treated i the previous sectio 20

B + ξθ = if λ v + t + u λ Λ χl + if λ Λ 2 21 + χ lv + t + u χ + T { l χ 2 cosh T χ 2 l 2 + 2 χ + 1l + 2 logµλ 1 λ [ log λ λ ] + λ log µλ 1 2λ1 + χl Replacig λ by its value choosig λ = λ k such that logλ/λ logα/2 ad remarkig that k log λ + 1 logα 2 we obtai Propositio 54 With probability at least 1 2δ for ay ξ S p ay θ S q Eξθ ξemθ Bξθ = 2C } 21 + χ v logδ 1 + t + u χ + T χ 2 logδ 1 where usig the abbreviatio l = logδ 1 logα C = cosh 2 α + 1 + χl log 1 2 logα log Let us ow cosider a estimator m such that lv + t + u χ + T l χ 2 2 2 1 + χl 2 + 5 2 sup Eξθ ξ m θ ξ S p θ S q With probability at least 1 2δ if m R p q m EM 2 sup Bξθ ξ S p θ S q sup Eξθ ξm θ ξ S p θ S q Remark that we ca boud sup ξ Sp θ S q Bξθ by the explicit expressio for Bξθ where v t ad u are replaced by their upper bouds v t ad u with respect to ξ S p ad θ S q Remark also that we ca weake the ifluece of T by choosig χ > 1 but that we ca reach the optimal boud for m EM oly if we kow a upper boud 21

for the ratio T/v Ideed if we kow T/v or a upper boud of the same order of magitude up to a costat we ca choose { } 1 T χ = max logδ 1 1 v I this case with probability at least 1 2δ m EM 8C v logδ 1 + t + u + v T Most likely we do ot kow T v = E M HS 2 sup ξ Sp θ S q E ξ Mθ 2 but we ca still choose χ greater tha oe to lower the ifluece of T = E M HS 2 i the boud 6 Adaptive Gram matrix estimate We devote a sectio to the adaptive estimatio of a Gram matrix sice it is a importat subject for applicatios to pricipal compoet aalysis ad to least squares regressio We recall that give a radom vector X R d the Gram matrix of X is defied as G = E X X R d d The geeral approach of the previous sectio uses a estimator that caot be computed explicitly without recourse to a Mote Carlo samplig algorithm I the special case of the Gram matrix we will produce a estimator that does ot suffer from this drawback Cosequeces of what is proved i this sectio regardig robust pricipal compoet aalysis ca easily be draw from the method exposed i Giulii 2017b We refer to this paper for further details Cosequeces regardig least squares regressio are discussed at the ed of this paper I this sectio we will use the asymmetric ifluece fuctio defied by equatio 3 o page 15 The explicit computatio of our estimator however will use the modified auxiliary fuctio ϕ 2 m = E [ ψ m + W 2] m R R + where W N01 is a stadard Gaussia radom variable Observe that it is possible to explicitly compute the fuctio ϕ 2 i terms of the Gaussia distributio fuctio Fa = PW a 22

Lemma 61 For ay m R ad R + ϕ 2 m = m 2 + 2 1 2 m 4 + 6m 2 2 + 3 4 + r 2 m where r 2 m = 1 [ m 2 1 2 + 6m 2 2 2 + 3 4] [ 1 m 1 + m ] F + F 2 + 2 [ 2 3 5m 1 + m1 m 2] 1 + m2 exp 2π 2 2 + 2 2π [ 2 3 + 5m 1 m1 + m 2] 1 m2 exp 2 2 Proof The proof is based o the expressio ad o the idetities ψt = t t 2 /2 + 1t 11 t 2 /2 t R + E [ 1 W a ] [ ] = Fa = 1 E 1 W a E [ W1 W a ] 1 = exp a2 = E [ W1 W a ] 2π 2 E [ W 2 1 W a ] a = exp a2 + Fa = 1 E [ W 2 1 W a ] 2π 2 E [ W 3 1 W a ] a 2 + 2 = exp a2 = E [ W 3 1 W a ] 2π 2 E [ W 4 1 W a ] a 3 + 3a = exp a2 + 3Fa = 3 E [ W 4 1 W a ] 2π 2 Let us put Gt = 1 exp t2 2π 2 E { ψ [ m + W 2] } [ = E m + W 2 1 m + W 4 ] + r2 m 2 where { [m 2r 2 m = E + W 4 2m + W 2 + 1 ] [ 1 W 1 m 23 + 1 W 1 m ]}

{ [ = E m 2 1 2 + 4mm 2 1W + 6m 2 2 2 W 2 + 4m 3 W 3 + 4 W 4] [ 1 W 1 m + 1 W 1 m ]} = m 2 1 [ 2 1 m 1 + m ] F + F [ 1 m 1 m ] + 4mm 2 1 G + G [ 1 + m 1 m + 6m 2 2 2 G + 1 m 1 m G 1 m 1 + m ] + F + F [ 1 + m 2 ] 1 m [ 1 m 2 ] 1 + m } + 4m { 3 + 2 G + + 2 G {[ 1 + m 3 + 4 31 + m ] 1 m [ 1 m 3 31 m ] 1 m + G + + G [ 1 m 1 + m ]} + 3 F + F so that r 2 m = 1 [ m 2 1 2 + 6m 2 2 2 + 3 4] [ 1 m 1 + m ] F + F 2 + 1 2 [ 2 3 5m 1 + m1 m 2] 1 m G + 1 2 [ 2 3 + 5m 1 m1 + m 2] 1 m G Observe ow that whe θ is distributed accordig to ρ θ = Nθ β 1 I d the real valued radom variable θ x is Gaussia with mea θ x ad stadard deviatio x / β Thus we ca state the followig Lemma 62 For ay θ x R d ψ θ x 2 dρ θ θ = ϕ 2 θ x x β Itroduce A λ β θ x = ϕ 2 λ 1/2 θ x x β log 24 1 + x 2 β

where λ R + is a costat modifyig the orm of θ Next propositio provides some upper ad lower bouds Propositio 63 With probability at least 1 δ for ay θ R d ay λ R + 1 λ A λ β θ X i β θ 2 2 logδ 1 λ E θ X 2 + E X 4 λ β 2 Moreover with probability at least 1 δ for ay θ R d ay λ R + 1 λ A λ β θ X i + β θ 2 2 + logδ 1 λ E θ X 2 λe θ X 4 6E X 2 θ X 2 β 3E X 4 λ β 2 Proof Accordig to Propositio 21 o page 2 with probability at least 1 δ for ay θ R d ad ay λ R + 1 λ [ ψ θ X i 2 dρ λ 1/2 θ θ log 1 + X i 2 ] β θ 2 β 2 1 { [ log E exp ψ θ X 2 X 2 log 1 + λ β + logδ 1 λ ]} dρ λ 1/2 θ θ Accordig to Lemma 51 o page 15 1 + t ψt log1 + u log 1 + u = log 1 u + t + u2 1 + u log1 + t u + u 2 tu R + Thus the right-had side of the previous iequality is ot greater tha 1 λ E θ X 2 dρ λ 1/2 θ θ E X 2 λ β + E X 4 λ β 2 I the same time due to Lemma 62 its left-had side is equal to = E θ X 2 + E X 4 λ β 2 1 λ A λ β θ X i β θ 2 2 25 logδ 1 λ

This achieves the proof for the upper boud Let us ow come to the lower boud As a cosequece of Lemma 51 o page 15 for ay t [01] ad ay y R + ψt + log1 + y log 1 t + t 2 + log1 + y = log 1 t + t 2 + 1 t + t 2 y log 1 t + t 2 + y Whe t [1 [ the same iequality is also obviously true: ψt + log1 + y log1 + y log1 t + t 2 + y As a cosequece for ay x R d ψ θ x 2 dρ θ θ + log 1 + x 2 β log 1 θ x 2 + θ x 4 + x 2 β dρ θ θ Thus accordig to the PAC-Bayesia iequality stated i Propositio 21 o page 2 with probability al least 1 δ for ay θ R d ad ay λ R + 1 λ [ ψ θ X i 2 dρ λ 1/2 θ θ + log 1 + X i 2 ] β θ 2 logδ 1 β 2 λ 1 { [ log E exp ψ θ X 2 X ]} 2 + log 1 + dρ λ β λ 1/2 θ θ 1 E θ X 2 + θ X 4 X 2 + dρ λ β λ 1/2 θ θ To coclude the proof it is eough to use the explicit expressio of the momets of a Gaussia radom variable rememberig that whe θ is distributed accordig to ρ λ 1/2 θ the distributio of θ X is equal to N λ 1/2 θ X X 2 /β The ext propositio defies a estimator of the quadratic form E θ X 2 Note that sice we itroduced a parameter λ that takes care of the orm of θ we will assume i the followig without loss of geerality that θ S d the uit sphere of R d Propositio 64 Let us assume that E X 4 T < 26

for a kow costat T For ay θ S d cosider the estimator of E θ X 2 defied as Eθ = sup λ R d 1 λ With probability at least 1 δ for ay θ S d A λ β θ X i β 2 logδ 1 λ Eθ E θ X 2 T λ β 2 Moreover with probability at least 1 δ for ay θ S d E θ X 2 Eθ + 2 2E θ X 4 2T β 2 + logδ 1 + 6E X 2 θ X 2 β + β Remark 61 Itroducig α = 2T we ca also express the previous boud as β2 E θ X 2 2 Eθ + 2 E θ X 4 [ α + logδ 1 ] 2α + 3 T E θ X 2 X 2 2T + α [ Eθ + 2 2E θ X 4 2 + 3 T E θ X 2 X 2 ] α 2T 2 + α + 2 E θ X 4 logδ 1 2α Eθ + 5 E θ X 4 2T 2 + α + 2 E θ X 4 logδ 1 where the last iequality is a cosequece of the Cauchy-Schwarz iequality E θ X 2 X 2 E θ X 4 E X 4 T E θ X 4 Proof Propositio 64 follows from Propositio 63 ad the defiitio of the estimator E To get the secod iequality observe that the value of λ miimizig λe θ X 4 + 6E X 2 θ X 2 β 27 + 4T λ β 2 + β + 2 logδ 1 λ

is give by λ = 2E θ X 4 1 2T β 2 + logδ 1 I the followig propositio we make the estimator adaptive i α as well as i λ ad we itroduce our estimator Ĝ of the Gram matrix G Propositio 65 Let us assume that E X 4 T < where T is a kow costat Cosider the estimator 1 Ẽθ = sup sup λ R + k N λ log 1 + [ ϕ 2 λ θ Xi expk 10T X i 2 With probability at least 1 δ for ay θ S d 1/4 expk X i 10T ] 1 Ẽθ E θ X 2 With probability at least 1 δ for ay θ S d where E θ X 2 Ẽθ + Bθ 5T 2 expk expk 10λ log[ k + 1k + 2/δ ] θ S d λ Bθ = 2 E θ X 4 33 T E θ X 4 1/4 Cosider a estimator Ĝ R d d such that ad 1 + 4 log 2 log T E θ X 4 + 5 2 0 if θ S d θĝ θ Ẽθ 28 + 2 logδ 1

{ sup θĝ θ Ẽθ = if sup θ M θ Ẽθ : M Rd d θ S d θ S d With probability at least 1 2δ Remark 62 kurtosis Ĝ G sup θ S d Bθ } M = M 0 if θ M θ Ẽθ θ S d It is iterestig to rephrase this result i terms of the directioal E θ X 4 κθ = E θ X 2 2 E θ X 2 > 0 1 otherwise We obtai with probability at least 1 2δ 1/4 κθ 1 2 T 33 κθe θ X 2 2 1 + 4 log 2 log T κθe θ X 2 + 5 + 2 logδ 2 2 1 Ẽθ E θ X 2 1 with the appropriate covetio that r/0 = + whe r > 0 ad 0/0 = 1 This iequality shows uder which circumstaces it is possible to estimate the order of magitude of E θ X 2 ad cosequetly the eigevalues of the Gram matrix G Ideed itroducig κ = sup θ Sd κθ we deduce with probability at least 1 2δ a boud of the form 1 f κ E θ X 2 Ẽθ E θ X 2 1 where the fuctio Fκ = 1 f κ / is o-decreasig Let us write G = E X X as d G = i e i e i where e 1 e d is a orthoormal basis of eigevectors ad where 1 2 d are the eigevalues of G couted with their multiplicities ad sorted i 29

decreasig order Itroducig L i the set of all liear subspaces of R d of dimesio i it is well kow that i = sup { if { θgθ θ L S d } L Li } A proof ca for istace be foud i Kato 1982 page 62 Based o this formula we ca itroduce the estimator It is such that i = sup { if { Ẽθ θ L S d } L Li } F κ i = F κ sup { if { θgθ θ L S d } L Li } = sup { if { F κ θgθ θ L S d } L Li } provig that with probability at least 1 2δ i sup { if { θgθ θ L S d } L Li } = i 1 f κ i i i 1 1 i d Proof of Propositio 65 o page 28 The optimal value of α i the last boud give i Remark 61 o page 27 is give by α = 1 T 5 E θ X 4 1 E X 4 5 E θ X 4 1 5 Accordig to the simplified iequality stated at the ed of Remark 61 with probability at least 1 δ for ay θ S d E θ X 2 Eθ + 2 = Eθ + 2 10 [ TE θ X 4 ] 1/4 α/α + α /α 2 2 10 + 2 E θ X 4 logδ 1 [ TE θ X 4 ] 1/4 1 cosh 2 logα/α + 2 2 E θ X 4 logδ 1 We will take a weighted uio boud o all values of α belogig to { expk/5 : k N } To perform this we have to modify accordigly the defiitio of the estimator ad cosider the estimator Ẽ defied i the propositio I this 30

10 T δ chage of defiitio we have replaced β with ad δ with expk k + 1k + 2 ad we have take the supremum i k N as well as i λ R + As k N δ k + 1k + 2 = δ we get from Propositio 63 o page 25 that with probability at least 1 δ for ay θ S d Ẽθ E θ X 2 Recallig that α = 2T β 2 = expk we get with 5 probability at least 1 δ for ay θ S d E θ X 2 Ẽθ + if k N 2 10 [ TE θ X 4 ] 1/4 1 expk cosh 2 log + 2 5α 2 E θ X 4 log [ k + 2 2 /δ ] We ca take the ifimum i k because the iequality holds with probability 1 δ for ay value of k N We ca ow choose k to be the closest iteger to log5α that is kow to be a o-egative quatity It is such that expk log 1 5α 2 ad therefore k + 2 log 5 5α + 2 = 1 2 log T E θ X 4 + 5 2 Remarkig that 10 cosh1/4 33 eds the proof 7 Liear least squares regressio Cosider a couple of radom variables XY R d R whose distributio is assumed to be ukow Let X 1 Y 1 X Y be a observed sample made of idepedet copies of XY I this sectio we cosider the questio of estimatig Itroduce the Gram matrix if θ E[ θ X Y 2] G = E X X R d d 31

the vector ad the risk fuctio Remark that V = E Y X R d Rθ = θgθ 2 θv E [ θ X Y 2] = EY 2 + Rθ θ R d so that miimizig the quadratic loss is equivalet to miimizig R We have see i the previous sectios various methods to estimate G ad V As a straightforward cosequece we state a first result cocerig the miimizatio over a bouded domai Propositio 71 Assume that Ĝ R d d ad V R d are such that 4 Ĝ G ɛ ad V V η Assume also that Ĝ is a symmetric positive semi-defiite matrix Let Θ be a closed bouded set i R d ad let B = sup θ Cosider the estimated risk θ Θ ad a estimator θ arg mi Θ Proof Remark that Rθ = θĝ θ 2 θ V θ R d R It is such that R θ if Θ R 2B ɛ B + 2η R θ R θ + B 2 ɛ + 2Bη = if Rθ + B 2 ɛ + 2Bη θ Θ if Rθ + θ Θ 2B2 ɛ + 4Bη Corollary 72 Assume that we kow costats vtv T such that sup E θ X 4 v < θ S d E X 4 T < sup E Y 2 θ X 2 v < θ S d 32

E Y 2 X 2 T < Usig Propositios 33 o page 6 ad 42 o page 11 we ca defie estimators Ĝ ad V such that with probability at least 1 2δ 2v Ĝ G ɛ = 2 2 logδ 1 + 12 T/v ad V V η = 2 T / + 2v logδ 1 / Cosequetly the estimator θ of the previous propositio based o Ĝ ad V is such that with probability at least 1 2δ logδ R θ if R O 1 Θ where the costat hidig behid the otatio O depeds oly o vtv T ad θ sup θ Θ Remark 71 We get oly a slow speed of order 1/2 ad ot 1 but we thik it is the price to pay to have a dimesio-free boud uder such hypotheses I the followig we will release the costrait that θ belogs to a bouded domai We will also propose coditios uder which a fast rate of order O logδ 1 / is possible We will be iterested first i defiig some o-asymptotic cofidece regio for θ arg mi θ R d Rθ We will broade our aalysis to the estimatio of the ridge regressio θ λ arg mi θ R d Rθ + λ θ 2 sice this extesio is quite atural i this cotext Ideed the ridge regressio problem cosists i miimizig R o a ball cetered at the origi ad ridge regressors as we will see will ayhow play a role i the defiitio of a robust estimator Propositio 73 Make the same assumptios as at the begiig of Propositio 71 o the precedig page ad cosider some parameter λ R + Itroduce the ridge regressio loss fuctio ad its empirical couterpart R λ θ = Rθ + λ θ 2 = θ G + λiθ 2 θv R λ θ = Rθ + λ θ 2 = θ Ĝ + λiθ 2 θ V Let θ λ arg mi θ R d R λ ad θ λ arg mi θ R d R λ θ Defie the cofidece regio Θ λ = { θ R d : Ĝ + λ θ θ λ θ ɛ + η } 33

O the evet defied by equatio 4 o page 32 θ λ Θ λ Moreover for ay estimator θ Θ λ the improved pick { θ arg mi R λ θ R λ θ + ɛ θ θ 2 + 2 θ θ ɛ θ + η } θ R d is such that ad more precisely such that R λ θ < R λ θ R λ θ R λ θ R λ θ R λ θ + θ θ ɛ θ + θ + 2η < 0 Proof Note that for ay θ ξ R d R λ ξ R λ θ = θ ξ G + λiθ + ξ 2 θ ξv R λ ξ R λ θ + ξ θ ɛ ξ + θ + 2η R λ ξ R λ θ + ɛ ξ θ 2 + 2 ξ θ ɛ θ + η def = γ λ θ ξ As ξ γ λ θ ξ is strictly covex if ξ R d γ λ θ ξ = 0 = γ λ θθ if ad oly if its subdifferetial satisfies 0 ξ ξ=θ γ λ θ ξ = 2Ĝ + λiθ 2 V + 2B d ɛ θ + η where B d is the uit ball of R d Remarkig that V = Ĝ + λi θ λ we see that this is equivalet to Ĝ + λiθ θ λ ɛ θ + η To complete the proof it is eough to remark that due to its defiitio 0 if ξ R d γ λ θ λ ξ if ξ R d R λ ξ R λ θ λ = 0 so that θ λ Θ λ Note that θ is the solutio of a strictly covex miimizatio problem It is characterized by the equatio Ĝ + λi θ V + ɛ θ θ + θ θ θ θ ɛ θ + η = 0 I view of the shape of the cofidece regio it is atural to cosider the estimator θ λ arg mi θ Θ λ θ 34

Propositio 74 Let ξ Θ λ be ay parameter value withi the above defied cofidece regio Uder the evet defied by equatio 4 o page 32 it is such that G + λi ξ θ λ 2 ɛ ξ + η I particular sice θ λ Θ λ we see from the defiitio of θ λ that θ λ θ λ ad therefore that G + λi θ λ θ λ 2 4 ɛ θ λ + η 2 Thus whe ɛ = O logδ 1 / ad η = O logδ 1 / we get a covergece speed of order O logδ 1 / but for a modified defiitio of the loss fuctio Usig a basis e i 1 i d of eigevectors of G with correspodig eigevalues 1 2 d 0 we see more precisely that for ay θ R d whereas R λ θ R λ θ λ = G + λi θ θ λ 2 = d i + λ θ θ λ e i 2 d i + λ 2 θ θ λ e i 2 = 1 4 R λ θ 2 The relatio betwee the two risks is that Cosequetly d + λ [ R λ θ R λ θ λ ] G + λiθ θ λ 2 R λ θ λ R λ θ λ Proof For ay ξ Θ λ 1 + λ [ R λ θ R λ θ λ ] 4 ɛ θ λ + η 2 d + λ 4 ɛ θλ + η 2 d + λ G+λIξ θ λ = G+λIξ V Ĝ+λIξ V +ɛ ξ +η 2 ɛ ξ +η from which the other statemets made i the propositio are straightforward cosequeces From this propositio we coclude that we have a dimesio-free boud for G + λi θ λ θ λ 2 whereas the boud we obtai for R λ θ λ R λ θ λ depeds o the dimesio through d + λ so that it is dimesio-free oly for large eough values of λ 35

For small values of λ depedig o we ca obtai a dimesio-free slow rate i the followig way Remark that sice i i + λ 2 4λ d R 0 θ λ R 0 θ 0 = i θ λ θ 0 e i 2 Sice V = Gθ 0 = G + λiθ λ d i + λ 2 θ λ θ 0 e i 2 4λ = 1 4λ G + λi θ λ θ 0 2 G + λi θ λ θ 0 = G + λi θ λ θ λ λθ 0 Moreover θ λ θ 0 ideed Therefore G + λi θ λ θ λ + λ θ 0 2 ɛ θ λ + η + λ θ 0 R λ θ λ = R 0 θ λ + λ θ λ 2 R 0 θ 0 + λ θ 0 2 R 0 θ λ + λ θ 0 2 ad comig back to R 0 Choose λ = 2ɛ + η to obtai G + λi θ λ θ 0 2 [ ɛ + λ/2 θ 0 + η ] R 0 θ λ R 0 θ 0 1 [ ɛ + λ/2 θ0 + η ]2 λ R 0 θ 2ɛ+η R 0 θ 0 [ θ 0 + 1/2 ] [ 2ɛ + η θ 0 + η ] This is a dimesio-free boud for R 0 θ λ R 0 θ 0 but it is of order O logδ 1 / istead of O logδ 1 / Notice that it is adaptive i θ 0 though To get faster dimesio-free rates for R 0 θ we eed to itroduce some restrictios First of all let us otice that the previous results hold uiformly i ay liear subspace of R d Propositio 75 Let us make the same assumptios as i Propositio 71 o page 32 For ay liear subspace L of R d defie θ L λ arg mi ξ L R λ ξ 36

Let θ L λ arg mi R λ ξ ξ L be the orthogoal projectio o L ad let π L θ = arg mi ξ θ ξ L Θ L λ = { ξ L : π L Ĝ + λiξ θ L λ ɛ ξ + η } ad θ L λ arg mi ξ Θ L λ ξ Fially itroduce the least eigevalue of π L Gπ L L = if { Gξ : ξ L ξ = 1 } Wheever equatio 4 o page 32 is satisfied for ay liear subspace L of R d ad ay parameter λ R + π L G + λi θ L λ θ L λ 2 4 ɛ θ L λ + η 2 4 ɛ θ L λ + η 2 4 ad R λ θ L λ R λ θ L λ ɛ θ L λ + η 2 L + λ 4 ɛ θl λ + η 2 L + λ Remark that we ca estimate L by It is such that for ay liear subspace L L = if { Ĝξ : ξ L ξ = 1 } L ɛ L L + ɛ Obtaiig a fast covergece rate for the miimizatio of R λ θ whe λ is small or ull ad d is small is possible i a sparse recovery framework Propositio 76 Cosider a family L of liear subspaces of R d Assume that θ λ L L ad that θ λ A a kow costat Cosider the cofidece regio Θ λ = { ξ R d : Ĝ + λi ξ θ λ ɛ ξ + η ξ A } 37

Defie the model selector L = { L L : Θ λ L } L arg max { L : L L } ad the estimator θ arg mi { ξ : ξ Θ λ L } Defie = if { L+Rθλ : L L L L 2ɛ } Uder the evet described by equatio 4 o page 32 + λ θ θ λ G + λi θ θ λ 2 ɛ θ + η 2 ɛ A + η ad R λ θ R λ θ λ 4 λ + ɛ θ + η 2 4 λ + ɛ A + η 2 Proof Sice θ Θ λ G + λi θ θ λ 2 ɛ θ + η 2 ɛ A + η O the other had G + λi θ θ λ π L+Rθ λ G + λi θ θ λ L+Rθ λ + λ θ θ λ Moreover L L sice θ λ Θ λ L Thus L L ɛ L ɛ L 2ɛ so that L+Rθ λ accordig to the defiitio of implyig that + λ θ θ λ G + λi θ θ λ 2 ɛ θ + η ad cosequetly that R λ θ R λ θ λ θ θ λ G + λi θ θ λ 4 ɛ θ + η 2 + λ Remark that the costat is defied i terms of restricted eigevalues of the Gram matrix a cocept that has bee used by other authors for example i Bickel Ritov ad Tsybakov 2009 to set the coditios of sparse recovery I the case of ested models we ca replace the costat with a simpler oe as i the followig propositio 38

Propositio 77 Cosider a ested family of liear subspaces of R d L = { L 1 L 2 L K } Assume that θ λ L L where L is ukow ad that θ λ A where A is kow Cosider the cofidece regio Θ λ = { ξ R d : Ĝ + λi ξ θ λ ɛ ξ + η ξ A } Defie the model selector k = arg mi { j : Θ λ L j } L = L k ad the estimator θ arg mi { ξ : ξ Θ λ L } Uder the evet described by equatio 4 o page 32 L + λ θ θ λ G + λi θ θ λ 2 ɛ θ + η 2ɛ A + η ad R λ θ R λ θ λ 4 λ + L ɛ θ + η 2 4 λ + L ɛ A + η 2 Proof As i the previous propositio θ Θ λ so that G + λi θ θ λ 2 ɛ θ + η Moreover L Θ λ so that L L implyig that + λ θ θ λ π L G + λi θ θ λ G + λi θ θ λ ad that R λ θ R λ θ λ Refereces 4 ɛ θ + η 2 + λ Bickel P J Ritov Y ad Tsybakov A 2009 Simultaeous aalysis of Lasso ad Datzig selector Aals of Statistics 37 1705 1732 Catoi O 2004 Statistical Learig Theory ad Stochastic Optimizatio Lectures o Probability Theory ad Statistics École d Été de Probabilités de Sait-Flour XXXI 2001 Lecture Notes i Mathematics 1851 Spriger pages 1 269 Catoi O 2012 Challegig the empirical mea ad empirical variace: a deviatio study A Ist Heri Poicaré 48 1148-1185 39

Catoi O 2016 PAC-Bayesia bouds for the Gram matrix ad least squares regressio with a radom desig preprit o ArXiv Catoi O ad Giulii I 2017 Dimesio free PAC-Bayesia bouds for the estimatio of the mea of a radom vector I NIPS 2017 to appear Giulii I 2017a Robust dimesio-free Gram operator estimates Beroulli to appear Giulii I 2017b Robust PCA ad pairs of projectios i a Hilbert space Electro J Statist 11 3903 3926 Joly E Lugosi G ad Oliveira R I 2017 O the estimatio of the mea of a radom vector Electroic Joural of Statistics 11 440 451 Kato T 1982 A Short Itroductio to Perturbatio Theory for Liear Operators Spriger-Verlag New York Lugosi G ad Medelso S 2017 Sub-Gaussia estimators of the mea of a radom vector Aals of Statistics to appear Misker S 2015 Geometric Media ad Robust Estimatio i Baach Spaces Beroulli 4 2308 2335 Misker S 2016 Sub-Gaussia estimators of the mea of a radom matrix with heavy-tailed etries Aals of Statistics to appear 40