Bayes Rule and its Applications

Bayes Rule and its Applications Bayes Rule: P (B k A) = P (A B k )P (B k )/ n i= P (A B i )P (B i ) Example : In a certain factory, machines A, B, and C are all producing springs of the same length. Of their production, machines A, B, and C produce %, %, and 3% defective springs, respectively. Of the total production of springs in the factory, machine A produces 35%, machine B produces 5%, and machine C produces 4%. Then we have P (D A) =., P (A) =.35; P (D B) =., P (B) =.5; P (D C) =.3, P (C) =.4. If one spring is selected at random from the total springs produced in a day, the probability that it is defective equals P (D) = X {A,B,C} P (D X)P (X) = 5/ If the selected spring is defective, the conditional probability that it was produced by machine A, B, or C can be calculated by P (A D) = P (D A)P (A)/P (D) = 7/5 P (B D) = P (D B)P (B)/P (D) = 5/5 P (C D) = P (D C)P (C)/P (D) = /5

Foundation for Normal Distributions Γ(α) = e t t α dt, for α > Γ(α + ) = αγ(α) Γ() = Γ(n + ) = n! integer n where!. e x dx = π Γ( ) = π X N(µ, σ ) means that f X (x) = πσ e (x µ) /σ, < x < Sampling X N(µ, σ ) by Matlab Y = random( Normal, µ, σ, SampleSize, )

3 Expectation and Covariance Matrix Let X, X,, X n be random variables such that the expectation, variance, and covariance are defined as follows. µ j = E[X j ], σ j = V ar(x j ) = E[(X j µ j ) ] () ρ ij σ i σ j = Cov(X i, X j ) = E[(X i µ i )(X j µ j )] () Suppose X = [X, X,, X n ] t be a random vector, then the expectation and covariance matrix of X is defined as E[X] = [µ, µ,, µ n ] t = µ (3) Cov(X) = [E[(X i µ i )(X j µ j )]] (4)

4 Bayes Decision Theory () p(ω i ): a priori probability () p(x ω i ): class conditional density function (3) p(ω i x): a posteriori probability (4) α(x): an action (a decision) (5) λ(α(x) ω j ): the loss function p(error) = x p(error x)p(x) = x p(α(x) ω i, x ω j, i j)p(x) R(α(x) x) = C j= λ(α(x) ω j )p(ω j x): conditional risk for pattern x x R(α(x) x)p(x): average error of probability (error rate) Bayes Decision Rule For each x, find α(x) which minimizes R(α(x) x) { if α(x) = ωj, For the - loss function, i.e. λ(α(x) ω j ) = otherwise Then the Bayes decision rule can be reduced to or min [R(α(x) x)] = min j [ p(ω j x)] = max i p(ω i x) Assign x to class ω i if p(ω i x) > p(ω j x) for j i

5 Example : X ω i N(µ i, σ i ) p(x ω i ) = πσi exp[ (x µ i ) /σ i ] The Bayes decision rule is to assign x to class ω i if p(ω i x) > p(ω j x) for j i iff iff p(x ω i )p(ω i ) > p(x ω j )p(ω j ) ln[σ j /σi ] + (x µ σj j ) (x µ σi i ) > ln[p(ω j )/p(ω i )] In particular, (a) if σ i = σ for each i, then the Bayes rule is to assign x to class ω i if (µ i µ j )x (µ i µ j ) > σ ln[p(ω j )/p(ω i )] (b) if σ i = σ and p(ω i ) = /C for each i, then the Bayes rule is to assign x to class ω i if (µ i µ j )[x (µ i + µ j )] >

6 Example 3: X ω i N(µ i, C i ) p(x ω i ) = π d/ C i / exp[ (x µ i) t C i (x µ i )/] The Bayes decision rule is to assign x to class ω i if iff [ ln( Cj / C i ) + ( x t C j In particular, p(ω i x) > p(ω j x) for j i x x t C j ) ( µ j + µ t j C µ j x t C x x t C > ln[p(ω j )/p(ω i )] (a) if C i = C for each i, then the Bayes decision rule is to assign x to class ω i if j (µ i µ j ) t C x [µt i C µ i µ t j C µ j ] > ln[p(ω j )/p(ω i )] i i )] µ i + µ t i C i µ i (b) if C i = σ I for each i, then the Bayes decision rule is to assign x to class ω i if (µ i µ j ) t x ( µ i µ j ) > σ ln[p(ω j )/p(ω i )]

7 Example 4: X ω i N(µ i, C), i =, Let Λ(x) = ln[p(x ω )/p(x ω )] = [x (µ + µ )] t C (µ µ ), and let r = ln[p(ω )/p(ω )]. The Bayes decision rule is to assign x to class ω if Λ(x) > r. Note that x ω Λ(X) N(, ) x ω Λ(X) N(, ) where = (µ µ ) t C (µ µ ) is called the square Mahalanobis distance. Hint: Prove that Λ(x)p(x ω )dx = and [Λ(x) ] p(x ω )dx = Then the Bayes error rate can be computed by E = p[x ω, Λ(x) > r] + p[x ω, Λ(x) < r] = p(ω ) r π exp[ (y + ) / ]dy + p(ω ) r π exp[ (y ) / ]dy

8 Proof of Λ(X) ω N(, ) X is a multivariate normal distribution so is its linear mapping Λ(X). E[Λ(X) ω ] = Λ(x)p(x ω )dx = {[x (µ + µ )] t C (µ µ )}p(x ω )dx = x t C (µ µ )p(x ω )dx (µ + µ ) t C (µ µ )p(x ω )dx = [x t p(x ω )dx]c (µ µ ) [(µ + µ ) t C (µ µ )] p(x ω )dx = µ t C (µ µ ) (µ + µ ) t C (µ µ ) = (µ µ ) t C (µ µ ) = V ar[λ(x) ω ] = [Λ(x) ] p(x ω )dx = [(x µ ) t C (µ µ )] p(x ω )dx = (µ µ ) t C { [(x µ )(x µ )] t p(x ω )dx}c (µ µ ) = (µ µ ) t C CC (µ µ ) = (µ µ ) t C (µ µ ) =

9 In Example 4, let µ = [ ], µ = [ ], C = I, and let p(ω ) = p(ω ) = / Then Λ(x) = (x ) t I [ ] = (x + x ) r = ln [p(ω )/p(ω )] = = (µ µ ) t I (µ µ ) = 8 The Bayes decision rule is to assign x to ω if x + x > The Bayes error rate is p(error) = Φ( ) = Φ( 8 ) = Φ( )

Example 5: Bayes decision theory in a discrete case Consider a two-class problem and assume that each component x j of the pattern x is either or with the conditional probabilities p j = p(x j = ω ) and q j = p(x j = ω ) (5) Suppose that the components of x are conditionally independent. Then p(x ω ) = d j= p x j j ( p j ) x j, p(x ω ) = d j= q x j j ( q j ) x j (6) Let the log-likelihood ratio Λ(x) = ln[p(x ω )/p(x ω )], and r = ln[p(ω )/p(ω )], then d d Λ(x) = x j ln[p j ( q j )/q j ( p j )] + ln[( p j )/( q j )] (7) j= j= The Bayes decision rule is to assign x to ω if Λ(x) > r. The decision boundary is d j= w j x j + w =, where w j = ln[p j ( q j )/q j ( p j )], j d, w = d j= ln[( p j )/( q j )] + ln[p(ω )/p(ω )] In particular, if p(ω ) = p(ω ) = /, p j = p, q j = q = p for j d, d is odd and p > q, then the Bayes error rate is E = (d )/ k= d! (d k)!k! pk ( p) d k (8)

X ω N(3, ) and X ω N(6, ) p(x ω i ) = exp[ (x µ i ) /σi ], i =, πσi The Maximum Likelihood (ML) decision is to assign x to ω if p(x ω ) > p(x ω ) The Bayes decision is to assign x to ω if p(x ω )p(ω ) > p(x ω )p(ω ) Note that ML is the special case by assuming p(ω ) = p(ω ) = which need not be true in practical applications. We shall show the effect of p(ω ) =, p(ω 3 ) =. 3 ML Decision: y ω if y < 7 Bayes Decision: y ω if y < 5 or y > 9 The error probability can be computed by Err = p(ω ) b a 4 + (8ln)/3 or y > 7 + 4 + (8ln)/3 π exp[ (x 3) /( )]dx + p(ω ){ a π exp[ (x 6) /]dx + b π exp[ (x 6) /]dx} where a=4.587, b=9.483 for ML decision, and a=5, b=9 for Bayes decision, then Err ML =.65 and Err Bayes =.53.

X ω Rayleigh( ) and X ω Rayleigh(3 ) p(x ω i ) = x σ i exp( x ), i =, σi The Maximum Likelihood (ML) decision is to assign x to ω if p(x ω ) > p(x ω ) The Bayes decision is to assign x to ω if p(x ω )p(ω ) > p(x ω )p(ω ) Note that ML is the special case by assuming p(ω ) = p(ω ) = which need not be true in practical applications. We shall show the effect of p(ω ) = 3, p(ω 4 ) = with σ 4 < σ. ML Decision: y ω if y < Bayes Decision: y ω if y < σ σ σ σ σ σ σ σ The error probability can be computed by Err = p(ω ) t ln( σ σ ) [ln( σ ) + ln( p(ω ) )] σ p(ω ) x exp( x )dx σ σ + p(ω ) t x exp( x )dx σ σ = p(ω ) exp( t ) + p(ω σ ) [ exp( t )] σ where t=.35 for ML decision, and t=.73 for Bayes decision, then Err ML =.34 and Err Bayes =.8.

3 X ω χ (r ) and X ω χ (r ) p(x ω i ) = Γ(r i /) r i/ x(r i/) e x/, x >, i =, The Maximum Likelihood (ML) decision is to assign x to ω if p(x ω ) > p(x ω ) The Bayes decision is to assign x to ω if p(x ω )p(ω ) > p(x ω )p(ω ) Note that ML is the special case by assuming p(ω ) = p(ω ) = which need not be true in practical applications. We shall show the effect of p(ω ) =, p(ω 7 ) = 6 with r 7 < r. ML Decision: Bayes Decision: y ω if Γ(r /) Γ(r /) (r r )/ > x (r r )/ y ω if Γ(r /) Γ(r /) (r r )/ p(ω ) p(ω ) > x(r r )/ The error probability can be computed by Err = p(ω ) t +p(ω ) t Γ(r x (r/) e x/ dx /) r / Γ(r x (r/) e x/ dx /) r / When r = 4, r = 8, p(ω ) = /7, p(ω ) = 6/7, we have t=4.899 for ML decision, and t=. for Bayes decision, and Err ML =.4 and Err Bayes =.4.

4 X ω N(µ, σ ) and X ω N(µ, σ ) p(x ω i ) = exp[ (x µ i ) /σi ], i =, πσi The Maximum Likelihood (ML) decision is to assign x to ω if p(x ω ) > p(x ω ) The Bayes decision is to assign x to ω if p(x ω )p(ω ) > p(x ω )p(ω ) Note that ML is the special case by assuming p(ω ) = p(ω ) = in practical applications. which need not be true The error probability can be computed by Err = p(ω ) T + p(ω ) T πσ exp[ (x µ ) /(σ )]dx πσ exp[ (x µ ) /(σ )]dx As soon as T is chosen, Parameters p(ω i ), µ i, σi, i =, could be estimated, so is Err. Otsu (979) and Fisher (936) chose T to maximize the following criterion, respectively σ B = p(ω )p(ω )(µ µ ) F isher = (µ µ ) σ +σ

5 A Simple Thresholding Algorithm By Otsu, IEEE Trans. on SMC, 6-66, 979 () p i n i n, where n = G i= n i () u T = G k= kp k (3) Do for k =, G ω(k) u(k) k i= p i k i= ip i σ B (k) [u T ω(k) u(k)] ω(k)[ ω(k)] (4) Select k such that σb (k) is maximized Note that ω = ω(k), u = k i= ip (i C ) = u(k)/ω(k) ω = ω(k), u = G i=k+ ip (i C ) = [u T u(k)]/[ ω(k)] σ ω = ω σ + ω σ σ B = ω (u u T ) + ω (u u T ) σ T = G i= (i u T ) p i σ ω + σ B = σ T ɛ = σ B σ T, κ = σ T, λ = σ σω B σ ω

6 Gamma Function and the Volumes of High Dimensional Spheres. Define Γ(x) = e t t x dt for x > and let γ = e x dx. Then (a) Γ(x) = e x dx = π (b) Show that Γ( ) = γ = π (c) Γ(x + ) = xγ(x), for x >, Γ(n) = (n )! if n N. (d) The volume of a d dimensional unit sphere is π d/ /Γ( d + ). 6 The volume of a d dim unite sphere is π d/ /Γ((d/)+) 5 4 3 5 5 Figure : The Volume of a High Dimensional Sphere. V()=; V()=pi; V(3)=4*pi/3; V(4)=pi*pi/; V(5)=8*pi*pi/5; V(6)=pi*pi*pi/6; V(7)=6*pi*pi*pi/5; V(8)=(pi)^4/4; V(9)=3*(pi)^4/945; V()=(pi)^5/; V()=64*(pi)^5/395; V()=(pi)^6/7; V(3)=8*(pi)^6/3535; V(4)=(pi)^7/54; V(5)=56*(pi)^7/75; D=:5; plot(d,v, b-v ) title( The volume of a d-dim unit sphere is \pi^{d/}/\gamma((d/)+) )

7 The Derivation of Volumn for an n-dimensional Sphere For n =, x = r cos θ, x = r sin θ, θ π r R, J = (x, x ) (r, θ) = x r x θ x r x θ = r The volumn is computed by V = R π J drdθ = R π rdrdθ = πr For n = 3, x = r cos θ cos θ, θ π x = r cos θ sin θ, π θ π x 3 = r sin θ, r R, J 3 = (x, x, x 3 ) (r, θ, θ ) = x r x r x 3 r x x x 3 θ θ θ x x x 3 θ θ θ = r cos θ The volumn is computed by V 3 = R π/ π π/ J 3 drdθ dθ = R π/ π π/ r cos θ drdθ dθ = 4πR3 3

8 For n 4, x = r cos θ cos θ cos θ n cos θ n, θ n π x = r cos θ cos θ cos θ n sin θ n, π θ n π x 3 = r cos θ cos θ cos θ n 3 sin θ n, π θ n 3 π.. x j = r cos θ cos θ cos θ n j sin θ n j+, π θ n j π.. x n = r cos θ sin θ, π θ π x n = r sin θ, r R n Note that x i = r and denote c i = cos θ i, i= Jacobian J n = (x,x,,x n) (r,θ,,θ n is computed as ) s i = sin θ i for i n. Then the c c c n c n c c c n s n c c c n 3 s n s s c c n c n s c c n s n s c c n 3 s n c c s c n c n c s c n s n c s c n 3 s n J n = r n..... c c s n c n c c s n s n c c c n 3 c n c c c n s n c c c n c n

9 Then J n = r n c n c n c n s s s n t t t t t t t.. t n t n t n t n t n..., where t i = sinθ i cos θ i Subtracting each column from the preceding one and do further simplifications, we obtain J n = r n c n c n c n s s s n Π n j= (t j + t j ) = r n c n c n c n s s s n Π n j= ( s j c j ) = r n c n c n 3 c n 3 c n = r n cos n θ cos n 3 θ cos θ n 3 cos θ n

Therefore, the volumn of an n-dimensional sphere can be calculated by V n = R π π/ π/ π/ π/ π/ π/ (J n )dθ dθ dθ n dθ n dr = R π π/ π/ π/ π/ [r n cos n θ cos n 3 θ cos θ n ]dθ dθ dθ n dθ n dr π/ π/ = πn/ Γ( n Rn +) Note that the above computations exploit the properties of the following Gamma and Beta functions, and trignometry. Γ(α) = Beta(α, β) = e t t α dt for α > x α ( x) β dx, where α, β > Beta(α, β) = Γ(α) Γ(β) Γ(α+β) Relationship Between Gamma and Beta Functions Γ(x)Γ(y) = e u u x du e v v y dv = = = z= z= t= e u v u x v y dudv e z (zt) x [z( t)] y zdtdz by putting u = zt, v = z( t) e z z x+y dz t= t x ( t) y dt = Γ(x + y)beta(x, y) Γ(α) = (α )Γ(α ) for α > π dθ n = π, Γ() =, π/ π/ Γ( ) = π cos θ n dθ n =, π/ π/ cos θ n 3 dθ n 3 = π, π/ π/ cos m θdθ = π/ cos m θdθ, 3 α n

Now π/ cos m θdθ = x m dx by letting x = cos θ x = = x m ( x ) / dx = y m ( y) / dy = y m/ ( y) / y dy, where x = y y m+ ( y) dy Therefore, = Beta( m+, m+ Γ( )Γ( ) = ) Γ( m +) V n = R π π/ π/ π/ π/ [r n cos n θ cos n 3 θ cos θ n ]dθ dθ dθ n dθ n dr π/ π/ = Rn n (π) () ( π) Πn m=3 Beta(m +, ) = π n/ Rn + ) Γ( n