Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Bayesian statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda

Frequentist vs Bayesian statistics In frequentist statistics the data are modeled as realizations from a distribution that depends on deterministic parameters In Bayesian statistics the parameters are modeled as random variables This allows to quantify our prior uncertainty and incorporate additional information

Learning Bayesian models Conjugate priors Bayesian estimators

Prior distribution and likelihood The data x R n are a realization of a random vector X, which depends on a vector of parameters Θ Modeling choices: Prior distribution: Distribution of Θ encoding our uncertainty about the model before seeing the data Likelihood: Conditional distribution of X given Θ

Posterior distribution The posterior distribution is the conditional distribution of Θ given X Evaluating the posterior at the data x allows to update our uncertainty about Θ using the data

Bernoulli distribution Goal: Estimating Bernoulli parameter from iid data We consider two different Bayesian estimators Θ 1 and Θ 2 : 1. Θ 1 is a conservative estimator with a uniform prior pdf { 1 for 0 θ 1 f Θ1 (θ) = 0 otherwise 2. Θ 2 has a prior pdf skewed towards 1 { 2 θ for 0 θ 1 f Θ2 (θ) = 0 otherwise

Prior distributions 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Bernoulli distribution: likelihood The data are assumed to be iid, so the likelihood is p X Θ ( x θ)

Bernoulli distribution: likelihood The data are assumed to be iid, so the likelihood is p X Θ ( x θ) = θ n 1 (1 θ) n 0 n 0 is the number of zeros and n 1 the number of ones

Bernoulli distribution: posterior distribution f Θ1 X (θ x)

Bernoulli distribution: posterior distribution f Θ1 X (θ x) = f Θ 1 (θ) p X Θ1 ( x θ) p X ( x)

Bernoulli distribution: posterior distribution f Θ1 X (θ x) = f Θ 1 (θ) p X Θ1 ( x θ) p X ( x) f Θ1 (θ) p X Θ1 ( x θ) = u f Θ 1 (u) p X Θ1 ( x u) du

Bernoulli distribution: posterior distribution f Θ1 X (θ x) = f Θ 1 (θ) p X Θ1 ( x θ) p X ( x) f Θ1 (θ) p X Θ1 ( x θ) = u f Θ 1 (u) p X Θ1 ( x u) du θ n 1 (1 θ) n 0 = u un 1 (1 u) n 0 du

Bernoulli distribution: posterior distribution f Θ1 X (θ x) = f Θ 1 (θ) p X Θ1 ( x θ) p X ( x) f Θ1 (θ) p X Θ1 ( x θ) = u f Θ 1 (u) p X Θ1 ( x u) du θ n 1 (1 θ) n 0 = u un 1 (1 u) n 0 du = θn 1 (1 θ) n 0 β (n 1 + 1, n 0 + 1) β (a, b) := u a 1 (1 u) b 1 du u

Bernoulli distribution: posterior distribution f Θ2 X (θ x)

Bernoulli distribution: posterior distribution f Θ2 X (θ x) = f Θ 2 (θ) p X Θ2 ( x θ) p X ( x)

Bernoulli distribution: posterior distribution f Θ2 X (θ x) = f Θ 2 (θ) p X Θ2 ( x θ) p X ( x) θ n1+1 (1 θ) n 0 = u un1+1 (1 u) n 0 du

Bernoulli distribution: posterior distribution f Θ2 X (θ x) = f Θ 2 (θ) p X Θ2 ( x θ) p X ( x) θ n1+1 (1 θ) n 0 = u un1+1 (1 u) n 0 du = θn 1+1 (1 θ) n 0 β (n 1 + 2, n 0 + 1) β (a, b) := u a 1 (1 u) b 1 du u

Bernoulli distribution: n 0 = 1, n 1 = 3 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Bernoulli distribution: n 0 = 3, n 1 = 1 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Bernoulli distribution: n 0 = 91, n 1 = 9 14 12 Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator 10 8 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0

Beta random variable Useful in Bayesian statistics Unimodal continuous distribution in the unit interval The pdf of a beta distribution with parameters a and b is defined as f β (θ; a, b) := { θ a 1 (1 θ) b 1 β(a,b), if 0 θ 1, 0 otherwise β (a, b) := u a 1 (1 u) b 1 du u

Beta random variables fx (x) 6 4 2 a = 1 b = 1 a = 1 b = 2 a = 3 b = 3 a = 6 b = 2 a = 3 b = 15 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x

Learning a Bernoulli distribution The first prior is beta with parameters a = 1 and b = 1 The second prior is beta with parameters a = 2 and b = 1 The posteriors are beta with parameters a = n 1 + 1, b = n 0 + 1 and a = n 1 + 2, b = n 0 + 1 respectively

Conjugate priors A conjugate family of distributions for a certain likelihood satisfies the following property: If the prior belongs to the family, the posterior also belongs to the family Beta distributions are conjugate priors when the likelihood is binomial

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x)

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x)

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x) f Θ (θ) p X Θ (x θ) = u f Θ (u) p X Θ (x u) du

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x) f Θ (θ) p X Θ (x θ) = u f Θ (u) p X Θ (x u) du θ a 1 (1 θ) b 1 ( ) n x θ x (1 θ) n x = u ua 1 (1 u) b 1 ( n x) u x (1 u) n x du

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x) f Θ (θ) p X Θ (x θ) = u f Θ (u) p X Θ (x u) du θ a 1 (1 θ) b 1 ( ) n x θ x (1 θ) n x = u ua 1 (1 u) b 1 ( n x) u x (1 u) n x du θ x+a 1 (1 θ) n x+b 1 = u ux+a 1 (1 u) n x+b 1 du

The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x) f Θ (θ) p X Θ (x θ) = u f Θ (u) p X Θ (x u) du θ a 1 (1 θ) b 1 ( ) n x θ x (1 θ) n x = u ua 1 (1 u) b 1 ( n x) u x (1 u) n x du θ x+a 1 (1 θ) n x+b 1 = u ux+a 1 (1 u) n x+b 1 du = f β (θ; x + a, n x + b)

Poll in New Mexico 429 participants, 227 people intend to vote for Clinton and 202 for Trump Probability that Trump wins in New Mexico? Assumptions: Fraction of Trump voters is modeled as a random variable Θ Poll participants are selected uniformly at random with replacement Number of Trump voters in the poll is binomial with parameters n = 449 and p = Θ

Poll in New Mexico Prior is uniform, so beta with parameters a = 1 and b = 1 Likelihood is binomial Posterior is beta with parameters a = 202 + 1 and b = 227 + 1 The probability that Trump wins in New Mexico is the probability that Θ given the data is greater than 0.5

Poll in New Mexico 18 16 14 88.6% 11.4% 12 10 8 6 4 2 0 0.35 0.40 0.45 0.50 0.55 0.60

Bayesian estimators What estimator should we use? Two main options: The posterior mean The posterior mode

Posterior mean Mean of the posterior distribution θ MMSE ( x) := E ( Θ X = x ) Minimum mean-square-error (MMSE) estimate For any arbitrary estimator θ other ( x), ( ( E θ other ( X ) Θ ) ) ( 2 ( E θ MMSE ( X ) Θ ) ) 2

Posterior mean ( ( E θ other ( X ) Θ ) ) 2 X = x

Posterior mean ( ( E θ other ( X ) Θ ) ) 2 X = x ( ( = E θ other ( X ) θ MMSE ( X ) + θ MMSE ( X ) Θ ) 2 ) X = x

Posterior mean ( ( E θ other ( X ) Θ ) ) 2 X = x ( ( = E θ other ( X ) θ MMSE ( X ) + θ MMSE ( X ) Θ ) 2 ) X = x ( ( = (θ other ( x) θ MMSE ( x)) 2 + E θ MMSE ( X ) Θ ) 2 ) X = x ( ( )) + 2 (θ other ( x) θ MMSE ( x)) E θ MMSE ( x) E Θ X = x

Posterior mean ( ( E θ other ( X ) Θ ) ) 2 X = x ( ( = E θ other ( X ) θ MMSE ( X ) + θ MMSE ( X ) Θ ) 2 ) X = x ( ( = (θ other ( x) θ MMSE ( x)) 2 + E θ MMSE ( X ) Θ ) 2 ) X = x ( ( )) + 2 (θ other ( x) θ MMSE ( x)) E θ MMSE ( x) E Θ X = x ( ( = (θ other ( x) θ MMSE ( x)) 2 + E θ MMSE ( X ) Θ ) 2 ) X = x

Posterior mean By iterated expectation, ( ( E θ other ( X ) ) 2 ) Θ ( ( ( = E E θ other ( X ) Θ ) )) 2 X

Posterior mean By iterated expectation, ( ( E θ other ( X ) ) 2 ) Θ ( ( ( = E E θ other ( X ) Θ ) )) 2 X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ( ) + E E θ MMSE ( X ) Θ ) 2 ) ) X

Posterior mean By iterated expectation, ( ( E θ other ( X ) ) 2 ) Θ ( ( ( = E E θ other ( X ) Θ ) )) 2 X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ( ) + E E θ MMSE ( X ) Θ ) 2 ) ) X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ) + E θ MMSE ( X ) Θ ) ) 2

Posterior mean By iterated expectation, ( ( E θ other ( X ) ) 2 ) Θ ( ( ( = E E θ other ( X ) Θ ) )) 2 X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ( ) + E E θ MMSE ( X ) Θ ) 2 ) ) X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ) + E θ MMSE ( X ) Θ ) ) 2 ( ( E θ MMSE ( X ) Θ ) ) 2

Bernoulli distribution: n 0 = 1, n 1 = 3 2.5 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Bernoulli distribution: n 0 = 3, n 1 = 1 2.0 1.5 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Bernoulli distribution: n 0 = 91, n 1 = 9 14 12 Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator 10 8 6 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0

Posterior mode The maximum-a-posteriori (MAP) estimator is the mode of the posterior distribution ( ) θ MAP ( x) := arg max p Θ X θ x θ if Θ is discrete and if Θ is continuous ( ) θ MAP ( x) := arg max f Θ X θ x θ

Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x θ

Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x = arg max θ θ ( ) f Θ θ f X Θ ( x θ ) u f Θ (u) f X Θ ( x u) du

Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x = arg max θ θ = arg max f X Θ ( x θ θ ( ) f Θ θ f X Θ ( x θ ) u f Θ (u) f X Θ ( x u) du )

Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x = arg max θ θ = arg max f X Θ ( x θ θ ( ) = arg max L x θ θ ( ) f Θ θ f X Θ ( x θ ) u f Θ (u) f X Θ ( x u) du )

Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x = arg max θ θ = arg max f X Θ ( x θ θ ( ) = arg max L x θ θ ( ) f Θ θ f X Θ ( x θ ) u f Θ (u) f X Θ ( x u) du ) Uniform priors are only well defined over bounded domains

Probability of error If Θ is discrete, MAP estimator minimizes the probability of error For any arbitrary estimator θ other ( x) ( P θ other ( X ) Θ ) ( P θ MAP ( X ) Θ )

Probability of error ( P Θ = θ other ( X ) )

Probability of error ( P Θ = θ other ( X ) ( ) = f X ( x) P Θ = θ other ( x) ) X = x d x x

Probability of error ( P Θ = θ other ( X ) ) = x = x ( f X ( x) P Θ = θ other ( x) X ) = x d x f X ( x) p Θ X (θ other ( x) x) d x

Probability of error ( P Θ = θ other ( X ) ) = x = x x ( f X ( x) P Θ = θ other ( x) X ) = x d x f X ( x) p Θ X (θ other ( x) x) d x f X ( x) p Θ X (θ MAP ( x) x) d x

Probability of error ( P Θ = θ other ( X ) ) = x = x ( f X ( x) P Θ = θ other ( x) X ) = x d x f X ( x) p Θ X (θ other ( x) x) d x f X ( x) p Θ X (θ MAP ( x) x) d x x ( = P Θ = θ MAP ( X ) )

Sending bits Model for communication channel: signal Θ encodes a single bit Prior knowledge indicates that a 0 is 3 times more likely than a 1 p Θ (1) = 1 4, p Θ (0) = 3 4. The channel is noisy, so we send the signal n times At the receptor we observe X i = Θ + Z i, 1 i n, where Z is iid standard Gaussian

Sending bits: ML estimator The likelihood is equal to L x (θ) = The log-likelihood is equal to = n f Xi Θ ( x i θ) i=1 n i=1 1 e ( x i θ)2 2 2π n ( x i θ) 2 log L x (θ) = 2 i=1 n log 2π 2

Sending bits: ML estimator θ ML ( x) = 1 if log L x (1) = n i=1 n i=1 = log L x (0) x i 2 2 x i + 1 n log 2π 2 2 x i 2 2 n log 2π 2 Equivalently, θ ML ( x) = { 1 if 1 n n i=1 x i > 1 2 0 otherwise

Sending bits: ML estimator The probability of error is ( P Θ θ ML ( X ) )

Sending bits: ML estimator The probability of error is ( P Θ θ ML ( X ) ) (Θ θ ML ( X ) ) Θ = 0 P (Θ = 0) + P = P (Θ θ ML ( X ) Θ = 1 ) P (Θ = 1)

Sending bits: ML estimator The probability of error is ( P Θ θ ML ( X ) ) = P (Θ θ ML ( X ) ) Θ = 0 P (Θ = 0) + P (Θ θ ML ( X ) ) Θ = 1 P (Θ = 1) ( 1 n = P x i > 1 ) ( n 2 Θ = 0 1 n P (Θ = 0) + P x i < 1 ) n 2 Θ = 1 P (Θ = 1) i=1 i=1

Sending bits: ML estimator The probability of error is ( P Θ θ ML ( X ) ) = P (Θ θ ML ( X ) ) Θ = 0 P (Θ = 0) + P (Θ θ ML ( X ) ) Θ = 1 P (Θ = 1) ( 1 n = P x i > 1 ) ( n 2 Θ = 0 1 n P (Θ = 0) + P x i < 1 ) n 2 Θ = 1 P (Θ = 1) i=1 i=1 = Q ( n/2 )

Sending bits: MAP estimator The logarithm of the posterior is equal to log p Θ X (θ x)

Sending bits: MAP estimator The logarithm of the posterior is equal to n i=1 log p Θ X (θ x) = log f Xi Θ ( x i θ) p Θ (θ) f X ( x)

Sending bits: MAP estimator The logarithm of the posterior is equal to n i=1 log p Θ X (θ x) = log f Xi Θ ( x i θ) p Θ (θ) f X ( x) n = log f Xi Θ ( x i θ) p Θ (θ) log f X ( x) i=1

Sending bits: MAP estimator The logarithm of the posterior is equal to n i=1 log p Θ X (θ x) = log f Xi Θ ( x i θ) p Θ (θ) f X ( x) n = log f Xi Θ ( x i θ) p Θ (θ) log f X ( x) i=1 = n i=1 x i 2 2 x i θ + θ 2 n 2 2 log 2π + log p Θ (θ) log f X ( x)

Sending bits: MAP estimator θ MAP ( x) = 1 if log p Θ X (1 x) + log f X ( x) = n i=1 n i=1 x i 2 2 x i + 1 n log 2π log 4 2 2 x i 2 2 n log 2π log 4 + log 3 2 = log p Θ X (0 x) + log f X ( x). Equivalently, θ MAP ( x) = { 1 if 1 n n i=1 x i > 1 2 + log 3 n, 0 otherwise.

Sending bits: MAP estimator The probability of error is ( )) P (Θ θ MAP X

Sending bits: MAP estimator The probability of error is ( )) P (Θ θ MAP X ( ) ) ( ) ) = P (Θ θ MAP X Θ = 0 P (Θ = 0) + P (Θ θ MAP X Θ = 1 P (Θ = 1)

Sending bits: MAP estimator The probability of error is ( )) P (Θ θ MAP X ( ) (Θ θ MAP X = P ( 1 n = P n i=1 ( 1 + P n ) Θ = 0 X i > 1 2 + log 3 n n i=1 X i < 1 2 + log 3 n ( ) P (Θ = 0) + P (Θ θ MAP X ) Θ = 0 P (Θ = 0) ) Θ = 1 P (Θ = 1) ) Θ = 1 P (Θ = 1)

Sending bits: MAP estimator The probability of error is ( )) P (Θ θ MAP X ( ) (Θ θ MAP X = P ( 1 n = P n i=1 ( 1 + P n ) Θ = 0 X i > 1 2 + log 3 n n i=1 X i < 1 2 + log 3 n ( ) P (Θ = 0) + P (Θ θ MAP X ) Θ = 0 P (Θ = 0) ) Θ = 1 P (Θ = 1) = 3 ( ) n/2 4 Q log 3 + + 1 ( ) n/2 n 4 Q log 3 n ) Θ = 1 P (Θ = 1)

Sending bits: Probability of error 0.35 0.30 ML estimator MAP estimator Probability of error 0.25 0.20 0.15 0.10 0.05 0.00 0 5 10 15 20 n