ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Information bound Lecturer: Yihong Wu Scribe: Shiyu Liang, Feb 6, 06 [Ed. Mar 9] Recall the Chi-squared divergence and Hammersley-Chapman-Robbins (HCR) bound from last class. Suppose that P, Q are two probability distribution defined on R, that X X is random variable. The Chi-squared divergence is χ (P Q) Furthermore, choosing affine function g yields which gives the HCR bound. sup E P [g(x)] E Q [g (X)]. g:x R χ (P Q) (E P [X] E Q [X]) var Q [X] 7. HCR Lower Bound We are now continuing on the HCR lower bound from the last class. We here illustrate an example of HCR lower bound on estimation. Example 7. (Estimation). Let R be an unknown, deterministic parameter, and let X R be a random variable, interpreted as a measure of or data. Suppose ˆ is an unbiased estimate of based on X. The relationships can be shown as X ˆ. The estimation loss l(, ˆ) is defined as l(, ˆ) ( ˆ). Let P P, Q P, and then the risk is lower bounded by R (ˆ) var (ˆ) (E ˆ E ˆ) χ (P P ). Suppose ˆ is an unbiased estimate of, then R (ˆ) sup ( ) χ (P P ) lim ( ) χ (P P ). 7. Fisher Information The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown, deterministic parameter upon which the probability of the observatoin X depends. Assume the probability density function of random variable X conditional on the value of is p. The Fisher information is defined as
Definition 7. (Fisher Information). The Fisher information of the parameteric family of densitities {p : Θ} (with respect to µ) at is [ ( ) ] log ( ) p p I() E. (7.) p Theorem 7. (Fisher Information). If p is twice differentiable with respect to, the Fisher information can be written as [ ] I() E log p Proof. Since and Thus, we have log p p p ( p ) p [ p E ] p I() E [ ( log p ( ) p p log p p µ(dx) 0. ) ] E [ log p ]. Theorem 7. (Fisher Information: mutiple sample). Suppose random sample X,..., X n independently and identically drawn from a distribution p. The Fisher information I n () provided by random samples X,..., X n is I n () ni(), where I() is Fisher information provided by a single sample X. Proof. We first denote the joint pdf of X,..., X n as p (x,..., x n ) n p (x i ). Then the Fisher information I n () provided by X,..., X n is [ ( l ) ] (X,..., X n ) ( ) l (x,..., x n ) I n () E... p (x,..., x n )dx dx... dx n, which is an n-dimensional integral. Thus, by Theorem 7., the Fisher information provided by X,..., X n can be calculated as [ ] [ n ] log p (X,..., X n ) log p (X i ) n [ ] log p (X i ) I n () E E E ni(). i i i
7.3 Variantions of HCR/CR Lower Bound This section contains the following three versions of HCP/CR lower bound: Multiple Samples Version Multivariate Version Functional Version 7.3. Multiple Samples Version Suppose is some unknown, deterministic parameter and X,..., X n are n random variables independently and identically coming from the distribution P. The estimate ˆ comes from X,..., X n. The relationships is shown as follows: X,..., X n ˆ. Then the risk is lower bound by For the HCR lower bound, (E ˆ E ˆ) R (ˆ) var ˆ χ (P n P n ). We next show the counterpart for ( ) R (ˆ) sup ( + χ (P P )) n χ (P Q) (E P X E Q X). var Q X Suppose P, Q are two distributions defined on R p, then χ (P Q) Furthter, if g(x) a, X +, then If we further assume E Q X 0, then we have Therefore, we finally have ni(). sup [E P g(x) E Q g (X) ]. g:r p R χ (P Q) E P a, X + E Q ( a, X + ). χ (P Q) a, E P X a T E Q [XX T ]a. χ (P Q) (E P X E Q X) T cov Q (X)(E P X E Q X) 3
7.3. Multivariate Version Let the loss function l(, ˆ) ˆ and ˆ be the unbiased estimate of, i.e., E ˆ. Then ( ) T cov (ˆ)( ) χ (P P ) ( ) T I()( ) +, where the equality follows from the Taylor expansion and Fisher information matrix is given as I() If we take + ɛu, ɛ 0, then we have P ( P ) T P. which is equivalent to and further indicates Then we have where I i (I(P ) ii ) since u T cov (ˆ)u u T I()u, cov (ˆ) I (), R (ˆ) tr(cov (ˆ)) tr(i ()). E ˆ p i p E(ˆ i i ) i I i () tr(i ()). Note that if we apply the one-dimensional CRLB for each coordinate we would get the rightmost inequality which is weaker. In addition, the Fisher information matrix can be written as ( [ I() E [( log P )( log P ) T ]) log P ] cov ( log P ) E. i j p i I i, 7.3.3 Functional Version Assume that is an unknown parameter, that random variable X comes from the distribution P and that ˆT (X) is an estimation for T (), where T : Θ R. The relationship is shown as follows: X ˆT. If we further assume ˆT () is an unbiased estimation for T (), then var ( ˆT ) T I() 4
7.4 Bayesian Cramér-Rao Lower Bound The class will introduce two methods of proving Bayesian Cramér-Rao lower bound. Method : χ Bayesian HCR Bayesian CR Method : Classical Method The notation used in this section is shown as follows: Θ R l(, ˆ) ( ˆ). π is a nice prior on R The relationship can be described as follows: π X ˆ. Theorem 7.3 (Bayesian Cramér-Rao Lower Bound). Assuming suitable regularity conditions, then R R π inf ˆ where R π is the Bayes risk and I(π) π π. E π (, ˆ) E π I() + I(π), Let Q : π P Q X X ˆ, P : π P P X X ˆ. Then χ (P X Q X ) χ (P ˆ Q ˆ) data processing inequality χ (P ˆ Q ˆ) data processing inequality (E( ˆ) E Q ( ˆ)) var (ˆ ) δ var ( ˆ) Further, if we assume Q π, Q X P, P T δ π, P X P δ, then P X Q X which further indicates Pˆ Qˆ and the mean of ˆ under distribution of P equals to the mean under the distribution under Q. For the Bayesian HCR lower bound, R π sup δ 0 We give a short proof of (7.) here. δ χ (P X Q X ) lim δ 0 δ χ (P X Q X ) I(π) + E π [I()]. (7.) 5
Proof. χ (PX Q X ) [P (P X Q X ) + (P Q )Q X ] (P X Q X ) Q X Q X P (PX Q X ) (P Q ) P (P Q ) + Q Q X Q + Q [ ) ] χ (P Q ) + E χ (P X Q X ) ( P Q (P X Q X ) Then applying χ (P Q ) χ (T δπ π) δ [I(π) + o()] by Taylor expansion, χ (P X Q X ) [I() + o()]δ by Taylor expansion, we obtain (7.). 7.5 Information Bound In this section, we introduce the local version of the minimax lower bound. The local minimax risks is defined in a quadratic form: inf ˆ sup 0 ɛ E(ˆ ). Further, we have inf ˆ If I() is continuous, then sup E(ˆ ) 0 ɛ I() + ne π [I()] + o() ne π [I()] E π [I()] I( 0 ) + o() + o() ni(). Assume the random variable Z coming from the distribution π, Z π. Let I(Z) I(π). For constant α, β 0, then I(Z + α) I(Z) and I(βZ) I(Z). If the π has the distribution of form β cos πx, then min π:[,] I(π) π. If the distribution π has the form of cos π(x 0 ) ɛ, then I() π ɛ. Then we have inf sup E(ˆ ) R π ˆ 0 ɛ ne π [I()] + I(π). Now if we pick ɛ n /4, we have R inf ˆ sup E ( ˆ) 0 n /4 ni() + o( Optimize R n) + o() n inf 0 Θ I( 0 ). 6