5.1 logistic regresssion Chris Parrish July 3, 2016

Σχετικά έγγραφα
5.6 evaluating, checking, comparing Chris Parrish July 3, 2016

Λογιστική Παλινδρόµηση

Wan Nor Arifin under the Creative Commons Attribution-ShareAlike 4.0 International License. 1 Introduction 1

m4.3 Chris Parrish June 16, 2016

Wan Nor Arifin under the Creative Commons Attribution-ShareAlike 4.0 International License. 1 Introduction 1


Generalized additive models in R

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Γενικευµένα Γραµµικά Μοντέλα

DirichletReg: Dirichlet Regression for Compositional Data in R

Λογαριθμικά Γραμμικά Μοντέλα Poisson Παλινδρόμηση Παράδειγμα στο SPSS

Supplementary figures

waffle Chris Parrish June 18, 2016

Εργασία. στα. Γενικευμένα Γραμμικά Μοντέλα

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

3 Regressionsmodelle für Zähldaten

Supplementary Material for The Cusp Catastrophe Model as Cross-Sectional and Longitudinal Mixture Structural Equation Models

Biostatistics for Health Sciences Review Sheet

Modern Regression HW #8 Solutions

Άσκηση 10, σελ Για τη μεταβλητή x (άτυπος όγκος) έχουμε: x censored_x 1 F 3 F 3 F 4 F 10 F 13 F 13 F 16 F 16 F 24 F 26 F 27 F 28 F


Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

519.22(07.07) 78 : ( ) /.. ; c (07.07) , , 2008

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

SECTION II: PROBABILITY MODELS

FORMULAS FOR STATISTICS 1

Supplementary Appendix

Εργαστήριο στατιστικής Στατιστικό πακέτο S.P.S.S.


Table 1: Military Service: Models. Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9 num unemployed mili mili num unemployed

( ) ( ) STAT 5031 Statistical Methods for Quality Improvement. Homework n = 8; x = 127 psi; σ = 2 psi (a) µ 0 = 125; α = 0.

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

διαγνωστικούς ελέγχους MCMC diagnostics CODA

APPENDICES APPENDIX A. STATISTICAL TABLES AND CHARTS 651 APPENDIX B. BIBLIOGRAPHY 677 APPENDIX C. ANSWERS TO SELECTED EXERCISES 679

Bayesian Data Analysis, Midterm I

Description of the PX-HC algorithm

Optimizing Microwave-assisted Extraction Process for Paprika Red Pigments Using Response Surface Methodology


[2] T.S.G. Peiris and R.O. Thattil, An Alternative Model to Estimate Solar Radiation

ΕΚΤΙΜΗΣΗ ΤΟΥ ΚΟΣΤΟΥΣ ΤΩΝ ΟΔΙΚΩΝ ΑΤΥΧΗΜΑΤΩΝ ΚΑΙ ΔΙΕΡΕΥΝΗΣΗ ΤΩΝ ΠΑΡΑΓΟΝΤΩΝ ΕΠΙΡΡΟΗΣ ΤΟΥ

PENGARUHKEPEMIMPINANINSTRUKSIONAL KEPALASEKOLAHDAN MOTIVASI BERPRESTASI GURU TERHADAP KINERJA MENGAJAR GURU SD NEGERI DI KOTA SUKABUMI

Επιστηµονική Επιµέλεια ρ. Γεώργιος Μενεξές. Εργαστήριο Γεωργίας. Viola adorata

Αν οι προϋποθέσεις αυτές δεν ισχύουν, τότε ανατρέχουµε σε µη παραµετρικό τεστ.

6. CONSTRUCTION OF THE BODY MASS INDEX-FOR-AGE STANDARDS

Introduction to Bayesian Statistics

Γραµµική Παλινδρόµηση

Lampiran 2 Hasil Kuesioner No. BA1 BA2 BA3 BA4 PQ1 PQ2 PQ3 PQ4 PQ


ΕΚΛΟΓΙΚΗ ΠΕΡΙΦΕΡΕΙΑ ΕΒΡΟΥ

2. ΕΠΙΛΟΓΗ ΤΟΥ ΜΕΓΕΘΟΥΣ ΤΩΝ ΠΑΡΑΤΗΡΗΣΕΩΝ

Γενικευμένα Γραμμικά Μοντέλα (GLM) Επισκόπηση

DirichletReg: Dirichlet Regression for Compositional Data in R

Bayesian., 2016, 31(2): : (heterogeneity) Bayesian. . Gibbs : O212.8 : A : (2016)

Solution Series 9. i=1 x i and i=1 x i.

Lampiran 1 Output SPSS MODEL I

Does anemia contribute to end-organ dysfunction in ICU patients Statistical Analysis

Ανάλυση της ιακύµανσης

Εκτίµηση Μη-Γραµµικών Μοντέλων

5.4 The Poisson Distribution.

Aquinas College. Edexcel Mathematical formulae and statistics tables DO NOT WRITE ON THIS BOOKLET

Lampiran 1 Hasil Kuesioner NO CI1 CI2 CI3 CT1 CT2 CT3 CS1 CS2 CS3 CL1 CL2 CL

MATHACHij = γ00 + u0j + rij

ΟΙΚΟΝΟΜΕΤΡΙΑ. Παπάνα Αγγελική

2. ΧΡΗΣΗ ΣΤΑΤΙΣΤΙΚΩΝ ΠΑΚΕΤΩΝ ΣΤΗ ΓΡΑΜΜΙΚΗ ΠΑΛΙΝΔΡΟΜΗΣΗ

R 28 February 2014

Si + Al Mg Fe + Mn +Ni Ca rim Ca p.f.u

Fitting mixtures of linear regressions

HOMEWORK#1. t E(x) = 1 λ = (b) Find the median lifetime of a randomly selected light bulb. Answer:

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ: ΑΣΚΗΣΕΙΣ

!"!"!!#" $ "# % #" & #" '##' #!( #")*(+&#!', & - #% '##' #( &2(!%#(345#" 6##7

τατιςτική ςτην Εκπαίδευςη II

1.1 t Rikon * --- Signif. codes: 0 *** ** 0.01 *

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Μηχανική Μάθηση Hypothesis Testing

Interpretation of linear, logistic and Poisson regression models with transformed variables and its implementation in the R package tlm

DOUGLAS FIR BEETLE TRAP-SUPPRESSION STUDY STATISTICAL REPORT

ΑΝΑΛΥΣΗ Ε ΟΜΕΝΩΝ. 7. Παλινδρόµηση

1. Ιστόγραμμα. Προκειμένου να αλλάξουμε το εύρος των bins κάνουμε διπλό κλικ οπουδήποτε στο ιστόγραμμα και μετά

LAMPIRAN. Lampiran I Daftar sampel Perusahaan No. Kode Nama Perusahaan. 1. AGRO PT Bank Rakyat Indonesia AgroniagaTbk.

R & R- Studio. Πασχάλης Θρήσκος PhD Λάρισα

An Introduction to Splines

Appendix A3. Table A3.1. General linear model results for RMSE under the unconditional model. Source DF SS Mean Square

Μενύχτα, Πιπερίγκου, Σαββάτης. ΒΙΟΣΤΑΤΙΣΤΙΚΗ Εργαστήριο 6 ο

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

ΣΤΑΤΙΣΤΙΚΗ ΕΠΙΧΕΙΡΗΣΕΩΝ ΕΙΔΙΚΑ ΘΕΜΑΤΑ. Κεφάλαιο 13. Συμπεράσματα για τη σύγκριση δύο πληθυσμών

APPENDIX B NETWORK ADJUSTMENT REPORTS JEFFERSON COUNTY, KENTUCKY JEFFERSON COUNTY, KENTUCKY JUNE 2016

ΔPersediaan = Persediaan t+1 - Persediaan t

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

Απλή Ευθύγραµµη Συµµεταβολή

Supplementary Information 1.

Artiste Picasso 9.1. Total Lumen Output: lm. Peak: cd 6862 K CRI: Lumen/Watt. Date: 4/27/2018

Στοιχεία από την r-project για την επεξεργασία και χαρτογράφηση χωρική κατανομή σημειακών παρατηρήσεων

Table A.1 Random numbers (section 1)

Figure 3 Three observations (Vp, Vs and density isosurfaces) intersecting in the PLF space. Solutions exist at the two indicated points.

Ταξινόμηση. Εισαγωγή. Ορισμός. Ορισμός. Τεχνικές Ταξινόμησης. Εισαγωγή

Funktionsdauer von Batterien in Abhängigkeit des verwendeten Materials und der Umgebungstemperatur

Tutorial on Multinomial Logistic Regression

90 [, ] p Panel nested error structure) : Lagrange-multiple LM) Honda [3] LM ; King Wu, Baltagi, Chang Li [4] Moulton Randolph ANOVA) F p Panel,, p Z

Transcript:

5.1 logistic regresssion Chris Parrish July 3, 2016 Contents logistic regression model 1 1992 vote 1 data..................................................... 1 model.................................................... 2 fit with stan................................................ 2 fitted model................................................. 5 figures.................................................... 5 figure 5.1a............................................... 5 figure 5.1b............................................... 6 fit with glm................................................. 7 fit with stan_glm.............................................. 8 5.1 logistic regresssion reference: - ARM chapter 05, github library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectcores()) library(ggplot2) logistic regression model y i Bernoulli(p i ) logit(p i ) = X i β logit <- function(x){ log(x / (1 - x)) logistic <- function(x){ 1 / (1 + exp(-x)) # logistic = invlogit 1992 vote data # Data source("nes1992_vote.data.r", echo = TRUE) 1

> N <- 1179 > vote <- c(1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, + 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, + 1, 1, 0, 1, 0, 1, 0, 0, 1... [TRUNCATED] > income <- c(4, 2, 1, 2, 3, 4, 2, 4, 1, 4, 4, 1, 3, + 2, 3, 3, 2, 3, 4, 3, 3, 2, 4, 3, 4, 4, 2, 3, 2, 3, 3, 3, + 2, 5, 3, 3, 3, 4, 1, 4, 3,... [TRUNCATED] model nes_logit.stan data { int<lower=0> N; vector[n] income; int<lower=0,upper=1> vote[n]; parameters { vector[2] beta; model { vote ~ bernoulli_logit(beta[1] + beta[2] * income); fit with stan # Logistic model: vote ~ income data.list <- c("n", "vote", "income") nes_logit.sf <- stan(file='nes_logit.stan', data=data.list, iter=1000, chains=4) plot(nes_logit.sf) ci_level: 0.8 (80% intervals) outer_level: 0.95 (95% intervals) beta[1] beta[2] pairs(nes_logit.sf) 1.5 1.0 0.5 0.0 0.5 2

0.15 0.30 0.45 beta[1] 0.15 0.25 0.35 0.45 beta[2] lp 2.0 1.6 1.2 2.0 1.6 1.2 786 784 782 780 print(nes_logit.sf, pars = c("beta", "lp ")) 786 782 Inference for Stan model: nes_logit. 4 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff beta[1] -1.40 0.01 0.18-1.75-1.52-1.40-1.28-1.05 417 beta[2] 0.33 0.00 0.05 0.22 0.29 0.32 0.36 0.43 408 lp -779.49 0.03 0.98-782.07-779.85-779.22-778.77-778.48 912 Rhat beta[1] 1.01 beta[2] 1.00 lp 1.00 Samples were drawn using NUTS(diag_e) at Tue Jul 5 02:10:05 2016. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). The estimated Bayesian Fraction of Missing Information is a measure of the efficiency of the sampler with values close to 1 being ideal. For each chain, these estimates are 1 1 0.9 1.2 summary(nes_logit.sf) $summary mean se_mean sd 2.5% 25% beta[1] -1.4035034 0.008712917 0.1778856-1.7506390-1.5233285 beta[2] 0.3261005 0.002664932 0.0538314 0.2224667 0.2897287 lp -779.4941677 0.032600951 0.9844308-782.0674429-779.8542303 3

50% 75% 97.5% n_eff Rhat beta[1] -1.4035687-1.2798508-1.0459163 416.8257 1.005759 beta[2] 0.3245881 0.3621653 0.4323567 408.0366 1.004427 lp -779.2227841-778.7732889-778.4833578 911.8215 0.999417 $c_summary,, chains = chain:1 stats parameter mean sd 2.5% 25% 50% beta[1] -1.3912375 0.1784808-1.740405-1.5082427-1.3909907 beta[2] 0.3232579 0.0544724 0.209821 0.2899055 0.3217933 lp -779.4943383 0.9838639-781.840308-779.8810535-779.1938310 stats parameter 75% 97.5% beta[1] -1.2779641-1.0209224 beta[2] 0.3600318 0.4246833 lp -778.7633786-778.4780030,, chains = chain:2 stats parameter mean sd 2.5% 25% 50% beta[1] -1.402426 0.1660308-1.7337512-1.511317-1.407834 beta[2] 0.324499 0.0498667 0.2313725 0.285950 0.324505 lp -779.491774 0.9277705-781.8355663-779.854504-779.279869 stats parameter 75% 97.5% beta[1] -1.2754141-1.0879357 beta[2] 0.3585538 0.4191033 lp -778.8159181-778.4775169,, chains = chain:3 stats parameter mean sd 2.5% 25% 50% beta[1] -1.3941353 0.18587868-1.7912372-1.5008939-1.3803333 beta[2] 0.3237461 0.05639666 0.2178067 0.2864599 0.3212421 lp -779.4638482 1.02833725-782.1013213-779.7760841-779.1316373 stats parameter 75% 97.5% beta[1] -1.2728737-1.0584102 beta[2] 0.3544079 0.4416326 lp -778.7501222-778.4983735,, chains = chain:4 stats parameter mean sd 2.5% 25% 50% beta[1] -1.4262147 0.17898108-1.7493863-1.5567923-1.4158572 beta[2] 0.3328991 0.05395974 0.2240717 0.2947117 0.3317793 lp -779.5267100 0.99701066-782.2523550-779.8778098-779.3197711 stats parameter 75% 97.5% 4

beta[1] -1.2964137-1.0472599 beta[2] 0.3724878 0.4311487 lp -778.8106075-778.4843026 fitted model P (y i = 1) = logit 1 ( 1.40 + 0.33 income) figures # Figures beta.post <- extract(nes_logit.sf, "beta")$beta beta.mean <- colmeans(beta.post) figure 5.1a # Figure 5.1 (a) len <- 20 x <- seq(1, 5, length.out = len) y <- 1 / (1 + exp(- beta.mean[1] - beta.mean[2] * x)) nes_vote.ggdf.1 <- data.frame(x, y) p1 <- ggplot(data.frame(income, vote), aes(x = income, y = vote)) + geom_jitter(position = position_jitter(height = 0.04, width = 0.4), shape = 20, color = "darkred") + geom_line(aes(x, y), data = nes_vote.ggdf.1, size = 2) + stat_function(fun = function(x) 1 / (1 + exp(- beta.mean[1] - beta.mean[2] * x))) + scale_x_continuous("income", limits = c(-2, 8), breaks = seq(1, 5), labels = c("1\n(poor)", "2", "3", "4", "5\n(rich)")) + scale_y_continuous("pr(republican Vote)", limits = c(-0.05, 1.05), breaks = seq(0, 1, 0.2)) print(p1) 5

1.0 0.8 Pr(Republican Vote) 0.6 0.4 0.2 0.0 1 (poor) 2 3 4 5 (rich) Income figure 5.1b # Figure 5.1 (b) # dev.new() n <- 20 ndx <- sample(nrow(beta.post), n) min.x <- 0.5 max.x <- 5.5 x <- seq(min.x, max.x, length.out = len) nes_vote.ggdf.2 <- data.frame(c(), c(), c()) # empty data frame for (i in ndx) { y <- 1 / (1 + exp(- beta.post[i, 1] - beta.post[i, 2] * x)) nes_vote.ggdf.2 <- rbind(nes_vote.ggdf.2, data.frame(id = rep(i, len), x, y)) p2 <- ggplot(data.frame(income, vote), aes(x = income, y = vote)) + geom_jitter(position = position_jitter(height =.04, width =.4), shape = 20, color = "darkred") + geom_line(aes(x, y, group = id), data = nes_vote.ggdf.2, alpha = 0.1) + geom_line(aes(x, y = 1 / (1 + exp(- beta.mean[1] - beta.mean[2] * x))), data = nes_vote.ggdf.2) + scale_x_continuous("income", limits = c(min.x, max.x), breaks = seq(1, 5), labels = c("1\n(poor)", "2", "3", "4", "5\n(rich)")) + scale_y_continuous("pr(republican Vote)", limits = c(-0.05, 1.05), 6

print(p2) breaks = seq(0, 1, 0.2)) 1.0 0.8 Pr(Republican Vote) 0.6 0.4 0.2 0.0 1 (poor) 2 3 4 5 (rich) Income fit with glm Gelman and Hill, p.79 library(arm) glm.fit1 <- glm(vote ~ income, family = binomial(link = "logit")) display(glm.fit1) glm(formula = vote ~ income, family = binomial(link = "logit")) coef.est coef.se (Intercept) -1.40 0.19 income 0.33 0.06 --- n = 1179, k = 2 residual deviance = 1556.9, null deviance = 1591.2 (difference = 34.3) options(show.signif.stars = FALSE) summary(glm.fit1) Call: glm(formula = vote ~ income, family = binomial(link = "logit")) Deviance Residuals: 7

Min 1Q Median 3Q Max -1.2756-1.0034-0.8796 1.2194 1.6550 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.40213 0.18946-7.401 1.35e-13 income 0.32599 0.05688 5.731 9.97e-09 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1591.2 on 1178 degrees of freedom Residual deviance: 1556.9 on 1177 degrees of freedom AIC: 1560.9 Number of Fisher Scoring iterations: 4 fit with stan_glm library(rstanarm) data <- data.frame(income, vote) stan_glm.fit1 <- stan_glm(vote ~ income, data = data, family = binomial(link = "logit"), chains = 4) plot(stan_glm.fit1) ci_level: 0.8 (80% intervals) outer_level: 0.95 (95% intervals) (Intercept) income pairs(stan_glm.fit1) 1.5 1.0 0.5 0.0 0.5 8

0.2 0.4 (Intercept) 0.2 0.3 0.4 0.5 794 790 786 income mean_ppd 2.0 1.2 0.35 0.45 794 788 2.0 1.6 1.2 0.8 summary(stan_glm.fit1) 0.35 0.40 0.45 log posterior stan_glm(formula = vote ~ income, family = binomial(link = "logit"), data = data, chains = 4) Family: binomial (logit) Algorithm: sampling Posterior sample size: 4000 Observations: 1179 Estimates: mean sd 2.5% 25% 50% 75% 97.5% (Intercept) -1.4 0.2-1.8-1.5-1.4-1.3-1.0 income 0.3 0.1 0.2 0.3 0.3 0.4 0.4 mean_ppd 0.4 0.0 0.4 0.4 0.4 0.4 0.4 log-posterior -783.7 1.0-786.5-784.0-783.3-782.9-782.7 Diagnostics: mcse Rhat n_eff (Intercept) 0.0 1.0 3282 income 0.0 1.0 3343 mean_ppd 0.0 1.0 3449 log-posterior 0.0 1.0 2158 For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample 9