Generalized additive models in R

Σχετικά έγγραφα
5.1 logistic regresssion Chris Parrish July 3, 2016

Wan Nor Arifin under the Creative Commons Attribution-ShareAlike 4.0 International License. 1 Introduction 1

Λογαριθμικά Γραμμικά Μοντέλα Poisson Παλινδρόμηση Παράδειγμα στο SPSS

Statistical Inference I Locally most powerful tests

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Supplementary figures

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University


Λογιστική Παλινδρόµηση

Other Test Constructions: Likelihood Ratio & Bayes Tests

Solution Series 9. i=1 x i and i=1 x i.

Wan Nor Arifin under the Creative Commons Attribution-ShareAlike 4.0 International License. 1 Introduction 1

4.6 Autoregressive Moving Average Model ARMA(1,1)


3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Areas and Lengths in Polar Coordinates

«ΑΝΑΠΣΤΞΖ ΓΠ ΚΑΗ ΥΩΡΗΚΖ ΑΝΑΛΤΖ ΜΔΣΔΩΡΟΛΟΓΗΚΩΝ ΓΔΓΟΜΔΝΩΝ ΣΟΝ ΔΛΛΑΓΗΚΟ ΥΩΡΟ»

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Εργασία. στα. Γενικευμένα Γραμμικά Μοντέλα

Γενικευµένα Γραµµικά Μοντέλα

Homework 3 Solutions

Areas and Lengths in Polar Coordinates

6.3 Forecasting ARMA processes

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Approximation of distance between locations on earth given by latitude and longitude

Asymptotic distribution of MLE

Supplementary Appendix

Numerical Analysis FMN011

Introduction to the ML Estimation of ARMA processes

Math 6 SL Probability Distributions Practice Test Mark Scheme

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

ST5224: Advanced Statistical Theory II

Άσκηση 10, σελ Για τη μεταβλητή x (άτυπος όγκος) έχουμε: x censored_x 1 F 3 F 3 F 4 F 10 F 13 F 13 F 16 F 16 F 24 F 26 F 27 F 28 F

[2] T.S.G. Peiris and R.O. Thattil, An Alternative Model to Estimate Solar Radiation

Aquinas College. Edexcel Mathematical formulae and statistics tables DO NOT WRITE ON THIS BOOKLET

Concrete Mathematics Exercises from 30 September 2016

Interpretation of linear, logistic and Poisson regression models with transformed variables and its implementation in the R package tlm

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

Μη Παραµετρική Παλινδρόµηση

An Introduction to Splines

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Module 5. February 14, h 0min

Solutions to Exercise Sheet 5

FORMULAS FOR STATISTICS 1

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Section 8.3 Trigonometric Equations

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Review Test 3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

MATHACHij = γ00 + u0j + rij

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

6. MAXIMUM LIKELIHOOD ESTIMATION

If we restrict the domain of y = sin x to [ π, π ], the restrict function. y = sin x, π 2 x π 2

APPENDICES APPENDIX A. STATISTICAL TABLES AND CHARTS 651 APPENDIX B. BIBLIOGRAPHY 677 APPENDIX C. ANSWERS TO SELECTED EXERCISES 679

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Biostatistics for Health Sciences Review Sheet

SECTION II: PROBABILITY MODELS

If we restrict the domain of y = sin x to [ π 2, π 2

Example Sheet 3 Solutions

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Tutorial on Multinomial Logistic Regression

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Γενικευμένα Γραμμικά Μοντέλα (GLM) Επισκόπηση

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

DirichletReg: Dirichlet Regression for Compositional Data in R

2 Composition. Invertible Mappings

EE512: Error Control Coding

CRASH COURSE IN PRECALCULUS

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

Inverse trigonometric functions & General Solution of Trigonometric Equations

Lecture 7: Overdispersion in Poisson regression

Fractional Colorings and Zykov Products of graphs

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

5.4 The Poisson Distribution.

An Inventory of Continuous Distributions

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Summary of the model specified

Section 9.2 Polar Equations and Graphs

Elements of Information Theory

Differential equations

PARTIAL NOTES for 6.1 Trigonometric Identities

Does anemia contribute to end-organ dysfunction in ICU patients Statistical Analysis

Problem Set 9 Solutions. θ + 1. θ 2 + cotθ ( ) sinθ e iφ is an eigenfunction of the ˆ L 2 operator. / θ 2. φ 2. sin 2 θ φ 2. ( ) = e iφ. = e iφ cosθ.

Supplementary Material for The Cusp Catastrophe Model as Cross-Sectional and Longitudinal Mixture Structural Equation Models

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Matrices and Determinants

Tridiagonal matrices. Gérard MEURANT. October, 2008

forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with

w o = R 1 p. (1) R = p =. = 1

Finite Field Problems: Solutions

Laplace Expansion. Peter McCullagh. WHOA-PSI, St Louis August, Department of Statistics University of Chicago

Transcript:

www.nr.no Generalized additive models in R Magne Aldrin, Norwegian Computing Center and the University of Oslo Sharp workshop, Copenhagen, October 2012

Generalized Linear Models - GLM y Distributed with mean µ and perhaps an additional parameter h(µ) = η η = β 0 + β 1 x 1 + β 2 x 2 + + β p x p η: linear predictor h: link fuction 1

Generalized Additive Models - GAM η is additive, but each term can be non-linear η = β 0 + s 1 (x 1 ) + s 2 (x 2 ) + + s p (x p ) η: additive predictor s i ( ) is a smooth function, estimated from the data 2

Estimation of s Find s such that fit to data is as best as possible and s is as smooth as possible Can for instance be formulated as Minimize likelihood + i x λ is i (x)dx where s (x) = d 2 s(x)/d 2 x is the second derivative of s and λ s are constants that control the degree of smoothing 3

Effective number of parameters Very high λ i gives maximal smoothness a straight line described by 1 parameter (per explanatory variable) Smaller values of λ i gives more flexible functions effective number of parameters > 1 4

Choice of smoothness in the mgcv package Default is to estimate the smoothness of each s-function, controlled by λ i or a corresponding sp[i] by generalized cross-validation, minimizing - 2 log likelihood n/(n p) 2 where n = number of observations p = effective number of parameters This tends to give too unsmooth curves 5

Ex: Ozone data y = log(no 3 ) Gaussian > require(faraway) > data(ozone) > > require(mgcv) > gamobj<-gam(log(o3)~s(vh)+s(wind)+s(humidity)+s(temp)+s(ibh)+ s(dpg)+s(ibt)+s(vis)+s(doy), family=gaussian(link=identity),data=ozone) > summary(gamobj) Family: gaussian Link function: identity Formula: log(o3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + s(dpg) + s(ibt) + s(vis) + s(doy) Parametric coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.21297 0.01717 128.9 <2e-16 *** 6

Approximate significance of smooth terms: edf Ref.df F p-value s(vh) 1.000 1.000 10.780 0.001146 ** s(wind) 1.021 1.040 8.713 0.003036 ** s(humidity) 2.406 3.025 2.567 0.054130. s(temp) 3.801 4.740 4.161 0.001418 ** s(ibh) 2.774 3.393 5.341 0.000797 *** s(dpg) 3.285 4.176 14.247 5.27e-11 *** s(ibt) 1.000 1.000 0.495 0.482255 s(vis) 5.477 6.635 6.023 2.30e-06 *** s(doy) 4.612 5.738 25.162 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 R-sq.(adj) = 0.826 Deviance explained = 84% GCV score = 0.10569 Scale est. = 0.097247 n = 330 7

Plot results > pdf("gamozone.pdf") > par(mfrow=c(3,3)) > plot(gamobj) > dev.off() null device 1 8

s(vh,1) s(wind,1.02) 5300 5500 5700 5900 vh 0 2 4 6 8 10 wind 20 40 60 80 humidity s(temp,3.8) s(ibh,2.77) s(humidity,2.41) 30 50 70 90 temp 0 1000 3000 5000 ibh 50 0 50 100 dpg s(ibt,1) s(vis,5.48) s(doy,4.61) s(dpg,3.29) 0 50 150 250 0 50 150 250 350 50 150 250 350 ibt vis doy 9

Ex: Precipitation - binary logistic regression > Tryvann.dat<-read.table("/nr/user/aldrin/Sharp/data/Tryvann.dat") > > ### load the mgcv library > require(mgcv) > > gamobj<-gam(p01~p01.l1+p01.l2+p01.l3+ s(prep.l1)+s(prep.l2)+s(prep.l3), data=tryvann.dat,family=binomial(link=logit)) > summary(gamobj) Family: binomial Link function: logit Formula: P01 ~ P01.L1 + P01.L2 + P01.L3 + s(prep.l1) + s(prep.l2) + s(prep.l3) 10

Parametric coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.64629 0.08114-7.965 1.65e-15 *** P01.L1 0.95902 0.08893 10.784 < 2e-16 *** P01.L2 0.31057 0.08590 3.615 3e-04 *** P01.L3 0.41355 0.08487 4.873 1.10e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 11

Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(prep.l1) 2.530 3.154 34.803 1.67e-07 *** s(prep.l2) 1.008 1.015 2.208 0.140 s(prep.l3) 1.110 1.212 0.564 0.532 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 R-sq.(adj) = 0.128 Deviance explained = 9.81% UBRE score = 0.2394 Scale est. = 1 n = 3420 > 12

Plot results > pdf("gamlogistic.pdf") > par(mfrow=c(2,2)) > plot(gamobj) > dev.off() null device 1 13

s(prep.l1,2.53) 1.5 0.5 0.5 1.5 1.5 0.5 0.5 1.5 0 10 20 30 40 50 60 Prep.L1 0 10 20 30 40 50 60 Prep.L2 s(prep.l3,1.11) 1.5 0.5 0.5 1.5 s(prep.l2,1.01) 0 10 20 30 40 50 60 Prep.L3 14

Ex: Simulated data y Poisson > n<-200 > x1<-rnorm(n) # N(0,1) > x2<-runif(n,-10,10) # Uniform(-10,10) > x3<-rnorm(n) > > eta<- 1 + 2*x1-0.2*x2^2 + 0*x3 > mu<-exp(eta) > > y<-rpois(n=n,lambda=mu) > > data.obj<-data.frame(y=y,x1=x1,x2=x2,x3=x3) > > gamobj<-gam(y~s(x1)+s(x2)+s(x3),family=poisson(link=log), data=data.obj) 15

Results > summary(gamobj) Family: poisson Link function: log Formula: y ~ s(x1) + s(x2) + s(x3) Parametric coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -3.7679 0.5131-7.343 2.08e-13 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 16

Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(x1) 1.508 1.849 657.653 <2e-16 *** s(x2) 4.357 5.138 119.824 <2e-16 *** s(x3) 1.000 1.000 0.427 0.514 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 R-sq.(adj) = 0.992 Deviance explained = 97.4% UBRE score = -0.50231 Scale est. = 1 n = 200 17

Plot results > pdf("gampoisson.pdf") > par(mfrow=c(2,2)) > plot(gamobj) > dev.off() null device 1 18

s(x1,1.51) 10 5 0 5 10 5 0 5 2 1 0 1 2 x1 10 5 0 5 10 x2 s(x3,1) 10 5 0 5 s(x2,4.36) 2 1 0 1 2 3 x3 19

Option: Terms can be forced to be linear > ### The terms can be forced to be linear > gamobj<-gam(log(o3)~vh+wind+s(humidity)+s(temp)+s(ibh)+ s(dpg)+ibt+s(vis)+s(doy), family=gaussian(link=identity),data=ozone) > summary(gamobj) Family: gaussian Link function: identity Formula: log(o3) ~ vh + wind + s(humidity) + s(temp) + s(ibh) + s(dpg) + ibt + s(vis) + s(doy) Parametric coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -6.0874893 2.4699664-2.465 0.01427 * vh 0.0014501 0.0004384 3.308 0.00105 ** wind -0.0311454 0.0101294-3.075 0.00230 ** ibt 0.0006986 0.0010388 0.673 0.50177 20

Approximate significance of smooth terms: edf Ref.df F p-value s(humidity) 2.398 3.017 2.476 0.061153. s(temp) 3.811 4.753 4.081 0.001638 ** s(ibh) 2.761 3.378 5.431 0.000711 *** s(dpg) 3.354 4.259 14.320 3.23e-11 *** s(vis) 4.559 5.627 6.689 2.15e-06 *** s(doy) 4.571 5.692 25.575 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 R-sq.(adj) = 0.826 Deviance explained = 83.9% GCV score = 0.10575 Scale est. = 0.097591 n = 330 21

Controlling the smoothness by the sp option > gamobj<-gam(log(o3)~s(vh)+s(wind)+s(humidity)+s(temp)+s(ibh)+ s(dpg)+s(ibt)+s(vis)+s(doy),sp=rep(0.5,9), family=gaussian(link=identity),data=ozone) > summary(gamobj) Family: gaussian Link function: identity Formula: log(o3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + s(dpg) + s(ibt) + s(vis) + s(doy) Parametric coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.21297 0.01954 113.2 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 22

Approximate significance of smooth terms: edf Ref.df F p-value s(vh) 1.633 2.022 0.679 0.509227 s(wind) 2.562 3.241 1.189 0.314667 s(humidity) 1.559 1.928 2.478 0.087614. s(temp) 1.846 2.299 13.333 7.16e-07 *** s(ibh) 1.467 1.771 4.698 0.012703 * s(dpg) 1.957 2.506 10.365 8.06e-06 *** s(ibt) 1.705 2.107 0.437 0.656654 s(vis) 2.044 2.547 7.610 0.000172 *** s(doy) 1.407 1.727 14.618 3.45e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 R-sq.(adj) = 0.775 Deviance explained = 78.6% GCV score = 0.13294 Scale est. = 0.12602 n = 330 23

Plot results > pdf("gamsmoothozone.pdf") > par(mfrow=c(3,3)) > plot(gamobj) > dev.off() null device 1 24

s(vh,1.63) 1.0 0.0 1.0 s(wind,2.56) 1.0 0.0 1.0 1.0 0.0 1.0 5300 5500 5700 5900 vh 0 2 4 6 8 10 wind 20 40 60 80 humidity s(temp,1.85) 1.0 0.0 1.0 s(ibh,1.47) 1.0 0.0 1.0 1.0 0.0 1.0 s(humidity,1.56) 30 50 70 90 temp 0 1000 3000 5000 ibh 50 0 50 100 dpg s(ibt,1.71) 1.0 0.0 1.0 s(vis,2.04) 1.0 0.0 1.0 s(doy,1.41) 1.0 0.0 1.0 s(dpg,1.96) 0 50 150 250 0 50 150 250 350 50 150 250 350 ibt vis doy 25

AIC or BIC to choose degree of smoothness > gamobj<-gam(log(o3)~s(vh)+s(wind)+s(humidity)+s(temp)+s(ibh)+ s(dpg)+s(ibt)+s(vis)+s(doy), family=gaussian(link=identity),data=ozone) > sp<-gamobj$sp > > tuning.scale<-c(1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5) > ###tuning.scale<-c(0.00001,0.0001,0.001,0.01,0.1,1,10,100,1000,10000 > scale.exponent<-log10(tuning.scale) > n.tuning<-length(tuning.scale) > edf<-rep(na,n.tuning) > min2ll<-rep(na,n.tuning) > aic<-rep(na,n.tuning) > bic<-rep(na,n.tuning) 26

> for (i in 1:n.tuning) { + gamobj<-gam(log(o3)~s(vh)+s(wind)+s(humidity)+s(temp)+s(ibh)+ s(dpg)+s(ibt)+s(vis)+s(doy), family=gaussian(link=identity),data=ozone, sp=tuning.scale[i]*sp) + min2ll[i]<--2*loglik(gamobj) + edf[i]<-sum(gamobj$edf)+1 + aic[i]<-aic(gamobj) + bic[i]<-bic(gamobj) + } 27

Plot results > pdf("aicbicozone.pdf") > par(mfrow=c(2,2)) > plot(scale.exponent,min2ll,type="b",main="-2 log likelihood") > plot(scale.exponent,edf,ylim=c(0,70),type="b",main="effective number > plot(scale.exponent,aic,type="b",main="aic") > plot(scale.exponent,bic,type="b",main="bic") > dev.off() null device 1 28

2 log likelihood effective number of parameters min2ll 100 150 200 250 300 edf 0 10 30 50 70 4 2 0 2 4 scale.exponent 4 2 0 2 4 scale.exponent AIC BIC aic 200 250 300 350 bic 300 350 400 450 4 2 0 2 4 scale.exponent 4 2 0 2 4 scale.exponent 29

Scaling corresponding to minimum BIC > min.bic<-1e100 > opt.tuning.scale<-null > for (i in 1:n.tuning) { + if (bic[i]<min.bic) { + min.bic<-bic[i] + opt.tuning.scale<-tuning.scale[i] + } + } > opt.sp<-opt.tuning.scale*sp 30

Estimate final model with optimal value of sp > gamobj<-gam(log(o3)~s(vh)+s(wind)+s(humidity)+s(temp)+s(ibh)+ s(dpg)+s(ibt)+s(vis)+s(doy), family=gaussian(link=identity),data=ozone,sp=opt.sp) > summary(gamobj) Family: gaussian Link function: identity Formula: log(o3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + s(dpg) + s(ibt) + s(vis) + s(doy) Parametric coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.21297 0.01838 120.4 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 31

Approximate significance of smooth terms: edf Ref.df F p-value s(vh) 1.000 1.000 3.650 0.05699. s(wind) 1.002 1.004 5.834 0.01617 * s(humidity) 1.369 1.646 2.349 0.10744 s(temp) 2.098 2.649 9.436 1.52e-05 *** s(ibh) 1.657 2.030 6.658 0.00139 ** s(dpg) 1.727 2.192 15.471 1.39e-07 *** s(ibt) 1.000 1.000 0.037 0.84715 s(vis) 3.171 3.965 7.033 2.10e-05 *** s(doy) 2.462 3.196 27.265 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 R-sq.(adj) = 0.801 Deviance explained = 81% GCV score = 0.11734 Scale est. = 0.11147 n = 330 32

Plot results > pdf("gamoptsmoothozone.pdf") > par(mfrow=c(3,3)) > plot(gamobj) > dev.off() null device 1 33

s(vh,1) s(wind,1) 5300 5500 5700 5900 vh 0 2 4 6 8 10 wind 20 40 60 80 humidity s(temp,2.1) s(ibh,1.66) s(humidity,1.37) 30 50 70 90 temp 0 1000 3000 5000 ibh 50 0 50 100 dpg s(ibt,1) s(vis,3.17) s(doy,2.46) s(dpg,1.73) 0 50 150 250 0 50 150 250 350 50 150 250 350 ibt vis doy 34