5.1 logistic regresssion Chris Parrish July 3, 2016

5.1 logistic regresssion Chris Parrish July 3, 2016 Contents logistic regression model 1 1992 vote 1 data..................................................... 1 model.................................................... 2 fit with stan................................................ 2 fitted model................................................. 5 figures.................................................... 5 figure 5.1a............................................... 5 figure 5.1b............................................... 6 fit with glm................................................. 7 fit with stan_glm.............................................. 8 5.1 logistic regresssion reference: - ARM chapter 05, github library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectcores()) library(ggplot2) logistic regression model y i Bernoulli(p i ) logit(p i ) = X i β logit <- function(x){ log(x / (1 - x)) logistic <- function(x){ 1 / (1 + exp(-x)) # logistic = invlogit 1992 vote data # Data source("nes1992_vote.data.r", echo = TRUE) 1

> N <- 1179 > vote <- c(1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, + 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, + 1, 1, 0, 1, 0, 1, 0, 0, 1... [TRUNCATED] > income <- c(4, 2, 1, 2, 3, 4, 2, 4, 1, 4, 4, 1, 3, + 2, 3, 3, 2, 3, 4, 3, 3, 2, 4, 3, 4, 4, 2, 3, 2, 3, 3, 3, + 2, 5, 3, 3, 3, 4, 1, 4, 3,... [TRUNCATED] model nes_logit.stan data { int<lower=0> N; vector[n] income; int<lower=0,upper=1> vote[n]; parameters { vector[2] beta; model { vote ~ bernoulli_logit(beta[1] + beta[2] * income); fit with stan # Logistic model: vote ~ income data.list <- c("n", "vote", "income") nes_logit.sf <- stan(file='nes_logit.stan', data=data.list, iter=1000, chains=4) plot(nes_logit.sf) ci_level: 0.8 (80% intervals) outer_level: 0.95 (95% intervals) beta[1] beta[2] pairs(nes_logit.sf) 1.5 1.0 0.5 0.0 0.5 2

0.15 0.30 0.45 beta[1] 0.15 0.25 0.35 0.45 beta[2] lp 2.0 1.6 1.2 2.0 1.6 1.2 786 784 782 780 print(nes_logit.sf, pars = c("beta", "lp ")) 786 782 Inference for Stan model: nes_logit. 4 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff beta[1] -1.40 0.01 0.18-1.75-1.52-1.40-1.28-1.05 417 beta[2] 0.33 0.00 0.05 0.22 0.29 0.32 0.36 0.43 408 lp -779.49 0.03 0.98-782.07-779.85-779.22-778.77-778.48 912 Rhat beta[1] 1.01 beta[2] 1.00 lp 1.00 Samples were drawn using NUTS(diag_e) at Tue Jul 5 02:10:05 2016. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). The estimated Bayesian Fraction of Missing Information is a measure of the efficiency of the sampler with values close to 1 being ideal. For each chain, these estimates are 1 1 0.9 1.2 summary(nes_logit.sf) $summary mean se_mean sd 2.5% 25% beta[1] -1.4035034 0.008712917 0.1778856-1.7506390-1.5233285 beta[2] 0.3261005 0.002664932 0.0538314 0.2224667 0.2897287 lp -779.4941677 0.032600951 0.9844308-782.0674429-779.8542303 3

50% 75% 97.5% n_eff Rhat beta[1] -1.4035687-1.2798508-1.0459163 416.8257 1.005759 beta[2] 0.3245881 0.3621653 0.4323567 408.0366 1.004427 lp -779.2227841-778.7732889-778.4833578 911.8215 0.999417 $c_summary,, chains = chain:1 stats parameter mean sd 2.5% 25% 50% beta[1] -1.3912375 0.1784808-1.740405-1.5082427-1.3909907 beta[2] 0.3232579 0.0544724 0.209821 0.2899055 0.3217933 lp -779.4943383 0.9838639-781.840308-779.8810535-779.1938310 stats parameter 75% 97.5% beta[1] -1.2779641-1.0209224 beta[2] 0.3600318 0.4246833 lp -778.7633786-778.4780030,, chains = chain:2 stats parameter mean sd 2.5% 25% 50% beta[1] -1.402426 0.1660308-1.7337512-1.511317-1.407834 beta[2] 0.324499 0.0498667 0.2313725 0.285950 0.324505 lp -779.491774 0.9277705-781.8355663-779.854504-779.279869 stats parameter 75% 97.5% beta[1] -1.2754141-1.0879357 beta[2] 0.3585538 0.4191033 lp -778.8159181-778.4775169,, chains = chain:3 stats parameter mean sd 2.5% 25% 50% beta[1] -1.3941353 0.18587868-1.7912372-1.5008939-1.3803333 beta[2] 0.3237461 0.05639666 0.2178067 0.2864599 0.3212421 lp -779.4638482 1.02833725-782.1013213-779.7760841-779.1316373 stats parameter 75% 97.5% beta[1] -1.2728737-1.0584102 beta[2] 0.3544079 0.4416326 lp -778.7501222-778.4983735,, chains = chain:4 stats parameter mean sd 2.5% 25% 50% beta[1] -1.4262147 0.17898108-1.7493863-1.5567923-1.4158572 beta[2] 0.3328991 0.05395974 0.2240717 0.2947117 0.3317793 lp -779.5267100 0.99701066-782.2523550-779.8778098-779.3197711 stats parameter 75% 97.5% 4

beta[1] -1.2964137-1.0472599 beta[2] 0.3724878 0.4311487 lp -778.8106075-778.4843026 fitted model P (y i = 1) = logit 1 ( 1.40 + 0.33 income) figures # Figures beta.post <- extract(nes_logit.sf, "beta")$beta beta.mean <- colmeans(beta.post) figure 5.1a # Figure 5.1 (a) len <- 20 x <- seq(1, 5, length.out = len) y <- 1 / (1 + exp(- beta.mean[1] - beta.mean[2] * x)) nes_vote.ggdf.1 <- data.frame(x, y) p1 <- ggplot(data.frame(income, vote), aes(x = income, y = vote)) + geom_jitter(position = position_jitter(height = 0.04, width = 0.4), shape = 20, color = "darkred") + geom_line(aes(x, y), data = nes_vote.ggdf.1, size = 2) + stat_function(fun = function(x) 1 / (1 + exp(- beta.mean[1] - beta.mean[2] * x))) + scale_x_continuous("income", limits = c(-2, 8), breaks = seq(1, 5), labels = c("1\n(poor)", "2", "3", "4", "5\n(rich)")) + scale_y_continuous("pr(republican Vote)", limits = c(-0.05, 1.05), breaks = seq(0, 1, 0.2)) print(p1) 5

1.0 0.8 Pr(Republican Vote) 0.6 0.4 0.2 0.0 1 (poor) 2 3 4 5 (rich) Income figure 5.1b # Figure 5.1 (b) # dev.new() n <- 20 ndx <- sample(nrow(beta.post), n) min.x <- 0.5 max.x <- 5.5 x <- seq(min.x, max.x, length.out = len) nes_vote.ggdf.2 <- data.frame(c(), c(), c()) # empty data frame for (i in ndx) { y <- 1 / (1 + exp(- beta.post[i, 1] - beta.post[i, 2] * x)) nes_vote.ggdf.2 <- rbind(nes_vote.ggdf.2, data.frame(id = rep(i, len), x, y)) p2 <- ggplot(data.frame(income, vote), aes(x = income, y = vote)) + geom_jitter(position = position_jitter(height =.04, width =.4), shape = 20, color = "darkred") + geom_line(aes(x, y, group = id), data = nes_vote.ggdf.2, alpha = 0.1) + geom_line(aes(x, y = 1 / (1 + exp(- beta.mean[1] - beta.mean[2] * x))), data = nes_vote.ggdf.2) + scale_x_continuous("income", limits = c(min.x, max.x), breaks = seq(1, 5), labels = c("1\n(poor)", "2", "3", "4", "5\n(rich)")) + scale_y_continuous("pr(republican Vote)", limits = c(-0.05, 1.05), 6

print(p2) breaks = seq(0, 1, 0.2)) 1.0 0.8 Pr(Republican Vote) 0.6 0.4 0.2 0.0 1 (poor) 2 3 4 5 (rich) Income fit with glm Gelman and Hill, p.79 library(arm) glm.fit1 <- glm(vote ~ income, family = binomial(link = "logit")) display(glm.fit1) glm(formula = vote ~ income, family = binomial(link = "logit")) coef.est coef.se (Intercept) -1.40 0.19 income 0.33 0.06 --- n = 1179, k = 2 residual deviance = 1556.9, null deviance = 1591.2 (difference = 34.3) options(show.signif.stars = FALSE) summary(glm.fit1) Call: glm(formula = vote ~ income, family = binomial(link = "logit")) Deviance Residuals: 7

Min 1Q Median 3Q Max -1.2756-1.0034-0.8796 1.2194 1.6550 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.40213 0.18946-7.401 1.35e-13 income 0.32599 0.05688 5.731 9.97e-09 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1591.2 on 1178 degrees of freedom Residual deviance: 1556.9 on 1177 degrees of freedom AIC: 1560.9 Number of Fisher Scoring iterations: 4 fit with stan_glm library(rstanarm) data <- data.frame(income, vote) stan_glm.fit1 <- stan_glm(vote ~ income, data = data, family = binomial(link = "logit"), chains = 4) plot(stan_glm.fit1) ci_level: 0.8 (80% intervals) outer_level: 0.95 (95% intervals) (Intercept) income pairs(stan_glm.fit1) 1.5 1.0 0.5 0.0 0.5 8

0.2 0.4 (Intercept) 0.2 0.3 0.4 0.5 794 790 786 income mean_ppd 2.0 1.2 0.35 0.45 794 788 2.0 1.6 1.2 0.8 summary(stan_glm.fit1) 0.35 0.40 0.45 log posterior stan_glm(formula = vote ~ income, family = binomial(link = "logit"), data = data, chains = 4) Family: binomial (logit) Algorithm: sampling Posterior sample size: 4000 Observations: 1179 Estimates: mean sd 2.5% 25% 50% 75% 97.5% (Intercept) -1.4 0.2-1.8-1.5-1.4-1.3-1.0 income 0.3 0.1 0.2 0.3 0.3 0.4 0.4 mean_ppd 0.4 0.0 0.4 0.4 0.4 0.4 0.4 log-posterior -783.7 1.0-786.5-784.0-783.3-782.9-782.7 Diagnostics: mcse Rhat n_eff (Intercept) 0.0 1.0 3282 income 0.0 1.0 3343 mean_ppd 0.0 1.0 3449 log-posterior 0.0 1.0 2158 For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample 9