Στατιστική Ανάλυση Δεδομένων ΙII. Γραμμική Παλινδρόμηση με το S.P.S.S.

Σχετικά έγγραφα
Στατιστική Ανάλυση Δεδομένων II. Γραμμική Παλινδρόμηση με το S.P.S.S.

(Στατιστική Ανάλυση) Δεδομένων I. Σύγκριση δύο πληθυσμών (με το S.P.S.S.)

Section 8.3 Trigonometric Equations

EE512: Error Control Coding

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Μηχανική Μάθηση Hypothesis Testing

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

5.4 The Poisson Distribution.

Finite Field Problems: Solutions

Repeated measures Επαναληπτικές μετρήσεις

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

ΑΝΑΛΥΣΗ Ε ΟΜΕΝΩΝ. 7. Παλινδρόµηση

Homework 3 Solutions

Section 9.2 Polar Equations and Graphs

ΠΕΡΙΓΡΑΦΙΚΗ και ΕΠΑΓΩΓΙΚΗ ΣΤΑΤΙΣΤΙΚΗ

Other Test Constructions: Likelihood Ratio & Bayes Tests

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

The Simply Typed Lambda Calculus

4.6 Autoregressive Moving Average Model ARMA(1,1)

Statistical Inference I Locally most powerful tests

Μενύχτα, Πιπερίγκου, Σαββάτης. ΒΙΟΣΤΑΤΙΣΤΙΚΗ Εργαστήριο 6 ο

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

HISTOGRAMS AND PERCENTILES What is the 25 th percentile of a histogram? What is the 50 th percentile for the cigarette histogram?

derivation of the Laplacian from rectangular to spherical coordinates

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

2 Composition. Invertible Mappings

Approximation of distance between locations on earth given by latitude and longitude

6.3 Forecasting ARMA processes

ΑΝΑΛΥΣΗ ΔΕΔΟΜΕΝΩΝ. Δρ. Βασίλης Π. Αγγελίδης Τμήμα Μηχανικών Παραγωγής & Διοίκησης Δημοκρίτειο Πανεπιστήμιο Θράκης

ΠΡΟΚΑΤΑΡΚΤΙΚΗ Στατιστική Ανάλυση με το S.P.S.S.

Lampiran 1 Output SPSS MODEL I

the total number of electrons passing through the lamp.

Math221: HW# 1 solutions

Concrete Mathematics Exercises from 30 September 2016

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

519.22(07.07) 78 : ( ) /.. ; c (07.07) , , 2008

Numerical Analysis FMN011

ST5224: Advanced Statistical Theory II

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Example Sheet 3 Solutions

C.S. 430 Assignment 6, Sample Solutions

Aquinas College. Edexcel Mathematical formulae and statistics tables DO NOT WRITE ON THIS BOOKLET

Matrices and Determinants

Supplementary Appendix

Math 6 SL Probability Distributions Practice Test Mark Scheme

PENGARUHKEPEMIMPINANINSTRUKSIONAL KEPALASEKOLAHDAN MOTIVASI BERPRESTASI GURU TERHADAP KINERJA MENGAJAR GURU SD NEGERI DI KOTA SUKABUMI

Instruction Execution Times

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

The challenges of non-stable predicates

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

Solutions to Exercise Sheet 5

Areas and Lengths in Polar Coordinates

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

Λυμένες Ασκήσεις για το μάθημα:

department listing department name αχχουντσ ϕανε βαλικτ δδσϕηασδδη σδηφγ ασκϕηλκ τεχηνιχαλ αλαν ϕουν διξ τεχηνιχαλ ϕοην µαριανι

Solution Series 9. i=1 x i and i=1 x i.

[1] P Q. Fig. 3.1

Άσκηση 11. Δίνονται οι παρακάτω παρατηρήσεις:

Démographie spatiale/spatial Demography

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

Μοντέλα Πολλαπλής Παλινδρόμησης

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

TMA4115 Matematikk 3

Srednicki Chapter 55

Lecture 34 Bootstrap confidence intervals

PARTIAL NOTES for 6.1 Trigonometric Identities

Section 7.6 Double and Half Angle Formulas

Lecture 2. Soundness and completeness of propositional logic

Reminders: linear functions

Inverse trigonometric functions & General Solution of Trigonometric Equations

Biostatistics for Health Sciences Review Sheet

Areas and Lengths in Polar Coordinates

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

+ ε βελτιώνει ουσιαστικά το προηγούμενο (β 3 = 0;) 2. Εξετάστε ποιο από τα παρακάτω τρία μοντέλα:

9.09. # 1. Area inside the oval limaçon r = cos θ. To graph, start with θ = 0 so r = 6. Compute dr

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Correction Table for an Alcoholometer Calibrated at 20 o C

Does anemia contribute to end-organ dysfunction in ICU patients Statistical Analysis

Every set of first-order formulas is equivalent to an independent set

ΒΙΟΣΤΑΤΙΣΤΙΚΗ ΙΙ. ΜΑΘΗΜΑ 11 Συµπερασµατολογία για την επίδραση πολλών µεταβλητών σε µια ποσοτική (Πολλαπλή Παλινδρόµηση)

Second Order Partial Differential Equations

Assalamu `alaikum wr. wb.

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

Right Rear Door. Let's now finish the door hinge saga with the right rear door

Notes on the Open Economy

New bounds for spherical two-distance sets and equiangular lines

LAMPIRAN. Lampiran I Daftar sampel Perusahaan No. Kode Nama Perusahaan. 1. AGRO PT Bank Rakyat Indonesia AgroniagaTbk.

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 11/3/2006

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

Transcript:

Στατιστική Ανάλυση Δεδομένων ΙII Γραμμική Παλινδρόμηση με το S.P.S.S. μέρος Β (πολλαπλή παλινδρόμηση) Νίκος Τσάντας Πρόγραμμα Μεταπτυχιακών Σπουδών Τμήμ. Μαθηματικών Μαθηματικά και Σύγχρονες Εφαρμογές Ακαδημαϊκό έτος 2011-12 Η απλή παλινδρόμηση δεν είναι παρά ένα μοντέλο πρόβλεψης των τιμών μιας μεταβλητής από τις τιμές μιας άλλης. Η πολλαπλή παλινδρόμηση είναι η λογική γενίκευση αυτού του μοντέλου: χρησιμοποιείται προκειμένου να προβλεφθούν οι τιμές μιας μεταβλητής από τις τιμές πολλών άλλων μεταβλητών. είναι ένα υποθετικό μοντέλο της σχέσης μεταξύ διαφόρων (πολλών) μεταβλητώ ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 1

Γραμμικά μοντέλα πρόβλεψης: Στην πολλαπλή παλινδρόμηση η ζητούμενη σχέση καταγράφεται με μια παραλλαγή της εξίσωσης για την ευθεία γραμμή: y b0 b1 x1 b2 x2 bnxn? Εύρεση b 0, b 1,, b n b 0 intercept είναι η τιμή της y-μεταβλητής όταν όλα τα x s = 0. είναι το σημείο στο οποίο το επίπεδο της παλινδρόμησης τέμνει τον y- άξονα (οριζόντιο). b s είναι ο συντελεστής παλινδρόμησης για τη μεταβλητής x s είναι οι κλίσεις ως προς την x 1, x 2, Πολλαπλή Παλινδρόμηση (female life expectancy) = b 0 + b 1 (infant mortality) + b 2 (fertility) + ε Statistics Descriptives Analyze Regression Linear... Plots Produce all partial plots ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 2

??? the independent variables have strong correlations with life expectancy. the correlation between the two independent variables is also strong. infant mortality appears to have the strongest linear relation with life expectancy. negative sign For multiple regression models, R is the correlation between the observed and predicted values of the dependent variable. R 2 is the square of this correlation. For this model with the two variables R 2 is 0.929, an increase of more than 25% over the model using just one variable (literacy with R 2 = 0.67). Knowing infant mortality and fertility explains almost 93% of the variability of life expectancy. The sample estimate R 2 tends to be an overestimate of the population parameter. Adjusted R 2 is designed to compensate for the optimists bias of R 2. It is a function of R 2 adjusted by the number of variables in the model and the sample size: R 2 a R 2 2 p 1 R n p 1 ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 3

The F test looks at whether the variance explained by the model (SS M ) is significantly greater that the error within the model (SS R ). So, it tell us whether using the regression model is significantly better at predicting values of the outcome (dependent variable) that using the mean. The F statistic is highly significant, indicating that the simultaneous test that each coefficient is 0 is rejected. The fact that the associated probability (Sig) is so small does not imply that each of the independent variables makes a meaningful contribution to the fit of the model. Πολλαπλή Παλινδρόμηση (female life expectancy) = 82.667 0.240(mortality) 0.662(fertility) i.e. for baby mortality =10 and average number of kids =2 female life expectancy = 82.667 0.240 10 0.662 2 = 78.943 b values are the change in the outcome associated with a unit change in the predictor. (i.e. b 2 =-0.662, so each time a child is added, the life expectancy decrease by 0.662 years). In order to assess the usefulness of each predictor in the model, you can t simply compare the coefficients. Even if the independent variables are all measured in the same units, a comparison of their size may not be revealing. When they are correlated, it is hard to quantify the unique contribution of each variable. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 4

Beta coefficients are an attempt to make the regression coefficients more comparable. You would get the same coefficients if you transformed the data to z-scores prior to your regression run. As infant mortality increases by 38.1531 (deaths per 1000 live births), average female life expectancy decrease by 0.863 10.615 = 9.1607 years. As fertility increases by 1.9025 (average number of kids), average female life expectancy decrease by 0.119 10.615 = 1.263 years. Πολλαπλή Παλινδρόμηση (female life expectancy) = 82.667 0.240(mortality) 0.662(fertility) The t-statistics provide some clue regarding the relative importance of (each) variable in the model. The probabilities should not be used for a formal test regarding the importance of each variable. These probabilities are appropriate if you want to do one preselected test and not if you are looking, say, for the strongest variable. For such a test, you would need the distribution of the largest t, and that is affected by the number of variables scanned, their correlation structure and the sample size. As a guide regarding useful predictors, look for t values well below -2 or above +2. In this example, the t s are -18.3 and -2.5, so both independent variables meet the guideline. However, infant mortality is the strongest predictor; fertility is more marginal (see the partial residual plots). ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 5

The Partial Residual Plots provide a graphical version of the t statistic for each predictor, or its partial correlation with the dependent variable after removing the linear effect of the other variables. These plots are useful for identifying cases that mask, or falsely enhance, the predictive power of a particular independent variable. If the target model holds, linearity must be evident in the display. Two sets of residuals are displayed regressing the dependent variable life expectancy on fertility regressing infant mortality on fertility (when your model has three or more independent variables, each regression is computed using all independent variables). the correlation between the two sets of residuals is the partial correlation that measures the relation between life expectancy and infant mortality after adjusting for fertility. Notices Observations with influence on the infant mortality coefficient stand out on the x axis (Afghanistan is furthest from 0). The slope for the regression line through the origin is the same as that for the full model (-0.24). The residuals (vertical deviations from this regression) are the same as those from the full model. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 6

The simple correlation between fertility and life expectancy is -0.838. In the plot here, after the effect of infant mortality has been removed, this relation is considerable diminished. Zambia is well separated from the other plot points vertically, but not extreme in the x direction. Thus, it may not exert undue influence on the coefficient for fertility. We know that that numbers in the regression output do not provide a complete picture of how well the estimated model fits. We need to plot the residuals in order to identify such problems as unusual values in the dependent variable, nonlinearity and a need for transformation. Here, we will look at diagnostics for identifying outliers among the independent variables: Mahalanobis distance and leverage. The computations for both ignore the dependent variable and use the values of the independent variables to compute the distance of each case to the mean of all cases. Values of leverage range from 0 to (n-1)/n. Values less than 0.2 appear safe, values between 0.2 and 0.5 risky and values above 0.5 are to be avoided. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 7

Cook s distance is a statistic measures the change in all of the regression coefficients if the i-th case is deleted. Cook s distance for a case depends on both the Studentized residual and the leverage values. Countries, ordered by Cook s measure, exerting the most influence on the estimates of the coefficients are Afghanistan, Zambia, Somalia and Uganda. It is possible to run the regression analysis with a case included and then rerun the analysis with that case excluded. If we did this, undoubtedly there would be some difference between the regression coefficients. The difference between a parameter estimated using all cases and estimated when one case is excluded is known as the DFBeta. This difference would tell us how much influence a particular case has on the parameters of the regression model. DFBeta is calculated for every case and for each of the parameters in the model. Since the units of measurement used will affect these values, the SPSS also produces the standardized DFBeta. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 8

Graphs Sequence (firstly, sort by region ) The first 21 cases are OECD countries. Cases 36 through 52 are countries in the Pacific Asia group. The DfBeta value for Afghanistan is the tall spike at the beginning of this group. The descending spike for fertility is Zambia -4 th group- We suspect that the sample may not be homogeneous; the model seems to fit less well for one or more subsets of observations. DfFits is a related statistic, which is the difference between the predicted value for a case when the model is calculating including that case and when the model is calculated excluding that case. The SPSS also produces standardized versions of the DFFit values. Except of Afghanistan, the estimates of the coefficients and their variances are most sensitive to countries in the fourth interval (Africa). Graphs Sequence (firstly, sort by region ) ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 9

SPSS can estimate coefficients for multiple linear regression models with more than two independent variables. When fitting such a model, you may know just which variables you want to include as predictors. If so, proceed using SPSS default equation-building method, Enter. All variables you select as independent variables are included in the model. Often, however a researcher may not know just which subset of variables contributes a good model. SPSS provides several methods for controlling the entry or removal of independent variables from the regression model: Forward selection enters variables into the model step by step. The first variable entered at step 1 is the one with the strongest simple correlation with the dependent variable. At each subsequent step the variable with the strongest partial correlation enters. At each step, the hypothesis that the coefficient of the entered variable is 0 is tested using its t statistic. Stepping stops when an established criterion for the F no longer holds. Backward elimination begins with all candidate variables in the model, and at each step, removes the least useful predictor (lowest F-to-remove). Variables are removed until an established criterion for the F no longer holds. Stepwise selection begins like forward stepping, but at each step, tests variables already in the model for removal. This is the most commonly used method, especially when there are correlations among the independent variables. Dependent: lifeexp Independent(s): babymort, fertility, urban, log_gdp, b_to_d, pop_incr ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 10

The simple correlations of each candidate predictor with the dependent variable life expectancy are displayed in the top row of this matrix. Infant mortality has the highest correlation (-0.962) and the birth-to-death ratio, the lowest (-0.079). The correlations not in the first row or first column are the correlations among the independent variables. This overview of the stepping process indicates that four of the six candidate predictors are included in the final model. They are entered into the equation in this order: babymort, urban, fertility, b_to_d, log_gdp. The second variable to enter, urban, is removed at step 6. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 11

Here, R 2 for the final model is 0.946 and adjusted R 2 is 0.944. Notice that these values did not drop when urban was removed. Including irrelevant variables also increases the standard errors of the estimate (the standard error of the estimate decreases from 2.93 years when infant mortality is the only predictor to 2.52 years when th model includes four variables). At step 1, SPSS enters babymort as the first variable because it has the highest correlation with the dependent variable. Then you can use either of these statistics to see how variables entered/removed The t statistic for each candidate. At step 1, urban has the largest t, so it is entered into the step 2 model. Then At step 5 the t for pop_incr fails the default entrance criterion that an F statistic must be greater than 3.84 ( 3.84 = 1.96). Meanwhile, among the values reported in the Exclude Variables table, SPSS checks the t values for already entered variables to see if any have value less than 1.65 ( 2.71 = 1.65, default F removal criterion-). In step 5, the t for urban is 1.152. The partial correlation of each candidate variable with the dependent variable after removing the linear effect of the variables already entered. Notice that at each step, the candidate variable with the largest t also has the strongest partial correlation. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 12

Πολλαπλή Παλινδρόμηση (female life expectancy) = 70.754 0.166(babymort) 1.625(fertility) + 0.750(b_to_d) + 2.867(log_gdp). Collinearity refers to the troublesome situation where the correlations among the independent variables are strong. When you suspect that collinearity may be a problem, study the tolerance statistic. Only the values of the independent variables are used to calculate it. Values of tolerance range from 0 to 1. When its value is small (close to 0), the variable is almost a linear combination of the other independent variables. Tolerance should be more than 0.2. The variance inflation factor (VIF) is the reciprocal of tolerance. So, by definition, the variables here with low tolerances have large variance inflation factor. VIF should be less tan 10. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 13

Eigenvalues provide an indication of how many distinct dimensions there are among the independent variables. When several eigenvalues are close to 0, the variables are highly intercorrelated (ill-conditioned). Condition indices are the square roots of the ratios of the largest eigenvalue to each successive eigenvalue. A condition index greater than 15 indicates a possible problem and an index greater than 30 suggests a serious problem with collinearity. The Variance Proportions are the proportions of the variance of the estimate accounted for by each principal component associated with each of the eigenvalues. Collinearity is a problem when a component associated with a high condition index contributes substantially to the variance of two or more variables. Here, for the final set of four variables (Model 6): the last condition index is 32.897, the last component accounts for 99% of the variance of the constant, 99% of the variance of log_gdp and 43% of the variance of babymort. Thus, for a more stable model, it might be wise to explore models with three variables. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 14

Βιβλιογραφία Andy Field (2009). Discovering statistics using SPSS, 3 rd edition. SAGE Publications M.J. Norusis (2011). IBM SPSS Statistics 19 Guide to Data Analysis. Prentice Hall. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 15