Στατιστική Ανάλυση Δεδομένων ΙII Γραμμική Παλινδρόμηση με το S.P.S.S. μέρος Β (πολλαπλή παλινδρόμηση) Νίκος Τσάντας Πρόγραμμα Μεταπτυχιακών Σπουδών Τμήμ. Μαθηματικών Μαθηματικά και Σύγχρονες Εφαρμογές Ακαδημαϊκό έτος 2011-12 Η απλή παλινδρόμηση δεν είναι παρά ένα μοντέλο πρόβλεψης των τιμών μιας μεταβλητής από τις τιμές μιας άλλης. Η πολλαπλή παλινδρόμηση είναι η λογική γενίκευση αυτού του μοντέλου: χρησιμοποιείται προκειμένου να προβλεφθούν οι τιμές μιας μεταβλητής από τις τιμές πολλών άλλων μεταβλητών. είναι ένα υποθετικό μοντέλο της σχέσης μεταξύ διαφόρων (πολλών) μεταβλητώ ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 1
Γραμμικά μοντέλα πρόβλεψης: Στην πολλαπλή παλινδρόμηση η ζητούμενη σχέση καταγράφεται με μια παραλλαγή της εξίσωσης για την ευθεία γραμμή: y b0 b1 x1 b2 x2 bnxn? Εύρεση b 0, b 1,, b n b 0 intercept είναι η τιμή της y-μεταβλητής όταν όλα τα x s = 0. είναι το σημείο στο οποίο το επίπεδο της παλινδρόμησης τέμνει τον y- άξονα (οριζόντιο). b s είναι ο συντελεστής παλινδρόμησης για τη μεταβλητής x s είναι οι κλίσεις ως προς την x 1, x 2, Πολλαπλή Παλινδρόμηση (female life expectancy) = b 0 + b 1 (infant mortality) + b 2 (fertility) + ε Statistics Descriptives Analyze Regression Linear... Plots Produce all partial plots ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 2
??? the independent variables have strong correlations with life expectancy. the correlation between the two independent variables is also strong. infant mortality appears to have the strongest linear relation with life expectancy. negative sign For multiple regression models, R is the correlation between the observed and predicted values of the dependent variable. R 2 is the square of this correlation. For this model with the two variables R 2 is 0.929, an increase of more than 25% over the model using just one variable (literacy with R 2 = 0.67). Knowing infant mortality and fertility explains almost 93% of the variability of life expectancy. The sample estimate R 2 tends to be an overestimate of the population parameter. Adjusted R 2 is designed to compensate for the optimists bias of R 2. It is a function of R 2 adjusted by the number of variables in the model and the sample size: R 2 a R 2 2 p 1 R n p 1 ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 3
The F test looks at whether the variance explained by the model (SS M ) is significantly greater that the error within the model (SS R ). So, it tell us whether using the regression model is significantly better at predicting values of the outcome (dependent variable) that using the mean. The F statistic is highly significant, indicating that the simultaneous test that each coefficient is 0 is rejected. The fact that the associated probability (Sig) is so small does not imply that each of the independent variables makes a meaningful contribution to the fit of the model. Πολλαπλή Παλινδρόμηση (female life expectancy) = 82.667 0.240(mortality) 0.662(fertility) i.e. for baby mortality =10 and average number of kids =2 female life expectancy = 82.667 0.240 10 0.662 2 = 78.943 b values are the change in the outcome associated with a unit change in the predictor. (i.e. b 2 =-0.662, so each time a child is added, the life expectancy decrease by 0.662 years). In order to assess the usefulness of each predictor in the model, you can t simply compare the coefficients. Even if the independent variables are all measured in the same units, a comparison of their size may not be revealing. When they are correlated, it is hard to quantify the unique contribution of each variable. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 4
Beta coefficients are an attempt to make the regression coefficients more comparable. You would get the same coefficients if you transformed the data to z-scores prior to your regression run. As infant mortality increases by 38.1531 (deaths per 1000 live births), average female life expectancy decrease by 0.863 10.615 = 9.1607 years. As fertility increases by 1.9025 (average number of kids), average female life expectancy decrease by 0.119 10.615 = 1.263 years. Πολλαπλή Παλινδρόμηση (female life expectancy) = 82.667 0.240(mortality) 0.662(fertility) The t-statistics provide some clue regarding the relative importance of (each) variable in the model. The probabilities should not be used for a formal test regarding the importance of each variable. These probabilities are appropriate if you want to do one preselected test and not if you are looking, say, for the strongest variable. For such a test, you would need the distribution of the largest t, and that is affected by the number of variables scanned, their correlation structure and the sample size. As a guide regarding useful predictors, look for t values well below -2 or above +2. In this example, the t s are -18.3 and -2.5, so both independent variables meet the guideline. However, infant mortality is the strongest predictor; fertility is more marginal (see the partial residual plots). ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 5
The Partial Residual Plots provide a graphical version of the t statistic for each predictor, or its partial correlation with the dependent variable after removing the linear effect of the other variables. These plots are useful for identifying cases that mask, or falsely enhance, the predictive power of a particular independent variable. If the target model holds, linearity must be evident in the display. Two sets of residuals are displayed regressing the dependent variable life expectancy on fertility regressing infant mortality on fertility (when your model has three or more independent variables, each regression is computed using all independent variables). the correlation between the two sets of residuals is the partial correlation that measures the relation between life expectancy and infant mortality after adjusting for fertility. Notices Observations with influence on the infant mortality coefficient stand out on the x axis (Afghanistan is furthest from 0). The slope for the regression line through the origin is the same as that for the full model (-0.24). The residuals (vertical deviations from this regression) are the same as those from the full model. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 6
The simple correlation between fertility and life expectancy is -0.838. In the plot here, after the effect of infant mortality has been removed, this relation is considerable diminished. Zambia is well separated from the other plot points vertically, but not extreme in the x direction. Thus, it may not exert undue influence on the coefficient for fertility. We know that that numbers in the regression output do not provide a complete picture of how well the estimated model fits. We need to plot the residuals in order to identify such problems as unusual values in the dependent variable, nonlinearity and a need for transformation. Here, we will look at diagnostics for identifying outliers among the independent variables: Mahalanobis distance and leverage. The computations for both ignore the dependent variable and use the values of the independent variables to compute the distance of each case to the mean of all cases. Values of leverage range from 0 to (n-1)/n. Values less than 0.2 appear safe, values between 0.2 and 0.5 risky and values above 0.5 are to be avoided. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 7
Cook s distance is a statistic measures the change in all of the regression coefficients if the i-th case is deleted. Cook s distance for a case depends on both the Studentized residual and the leverage values. Countries, ordered by Cook s measure, exerting the most influence on the estimates of the coefficients are Afghanistan, Zambia, Somalia and Uganda. It is possible to run the regression analysis with a case included and then rerun the analysis with that case excluded. If we did this, undoubtedly there would be some difference between the regression coefficients. The difference between a parameter estimated using all cases and estimated when one case is excluded is known as the DFBeta. This difference would tell us how much influence a particular case has on the parameters of the regression model. DFBeta is calculated for every case and for each of the parameters in the model. Since the units of measurement used will affect these values, the SPSS also produces the standardized DFBeta. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 8
Graphs Sequence (firstly, sort by region ) The first 21 cases are OECD countries. Cases 36 through 52 are countries in the Pacific Asia group. The DfBeta value for Afghanistan is the tall spike at the beginning of this group. The descending spike for fertility is Zambia -4 th group- We suspect that the sample may not be homogeneous; the model seems to fit less well for one or more subsets of observations. DfFits is a related statistic, which is the difference between the predicted value for a case when the model is calculating including that case and when the model is calculated excluding that case. The SPSS also produces standardized versions of the DFFit values. Except of Afghanistan, the estimates of the coefficients and their variances are most sensitive to countries in the fourth interval (Africa). Graphs Sequence (firstly, sort by region ) ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 9
SPSS can estimate coefficients for multiple linear regression models with more than two independent variables. When fitting such a model, you may know just which variables you want to include as predictors. If so, proceed using SPSS default equation-building method, Enter. All variables you select as independent variables are included in the model. Often, however a researcher may not know just which subset of variables contributes a good model. SPSS provides several methods for controlling the entry or removal of independent variables from the regression model: Forward selection enters variables into the model step by step. The first variable entered at step 1 is the one with the strongest simple correlation with the dependent variable. At each subsequent step the variable with the strongest partial correlation enters. At each step, the hypothesis that the coefficient of the entered variable is 0 is tested using its t statistic. Stepping stops when an established criterion for the F no longer holds. Backward elimination begins with all candidate variables in the model, and at each step, removes the least useful predictor (lowest F-to-remove). Variables are removed until an established criterion for the F no longer holds. Stepwise selection begins like forward stepping, but at each step, tests variables already in the model for removal. This is the most commonly used method, especially when there are correlations among the independent variables. Dependent: lifeexp Independent(s): babymort, fertility, urban, log_gdp, b_to_d, pop_incr ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 10
The simple correlations of each candidate predictor with the dependent variable life expectancy are displayed in the top row of this matrix. Infant mortality has the highest correlation (-0.962) and the birth-to-death ratio, the lowest (-0.079). The correlations not in the first row or first column are the correlations among the independent variables. This overview of the stepping process indicates that four of the six candidate predictors are included in the final model. They are entered into the equation in this order: babymort, urban, fertility, b_to_d, log_gdp. The second variable to enter, urban, is removed at step 6. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 11
Here, R 2 for the final model is 0.946 and adjusted R 2 is 0.944. Notice that these values did not drop when urban was removed. Including irrelevant variables also increases the standard errors of the estimate (the standard error of the estimate decreases from 2.93 years when infant mortality is the only predictor to 2.52 years when th model includes four variables). At step 1, SPSS enters babymort as the first variable because it has the highest correlation with the dependent variable. Then you can use either of these statistics to see how variables entered/removed The t statistic for each candidate. At step 1, urban has the largest t, so it is entered into the step 2 model. Then At step 5 the t for pop_incr fails the default entrance criterion that an F statistic must be greater than 3.84 ( 3.84 = 1.96). Meanwhile, among the values reported in the Exclude Variables table, SPSS checks the t values for already entered variables to see if any have value less than 1.65 ( 2.71 = 1.65, default F removal criterion-). In step 5, the t for urban is 1.152. The partial correlation of each candidate variable with the dependent variable after removing the linear effect of the variables already entered. Notice that at each step, the candidate variable with the largest t also has the strongest partial correlation. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 12
Πολλαπλή Παλινδρόμηση (female life expectancy) = 70.754 0.166(babymort) 1.625(fertility) + 0.750(b_to_d) + 2.867(log_gdp). Collinearity refers to the troublesome situation where the correlations among the independent variables are strong. When you suspect that collinearity may be a problem, study the tolerance statistic. Only the values of the independent variables are used to calculate it. Values of tolerance range from 0 to 1. When its value is small (close to 0), the variable is almost a linear combination of the other independent variables. Tolerance should be more than 0.2. The variance inflation factor (VIF) is the reciprocal of tolerance. So, by definition, the variables here with low tolerances have large variance inflation factor. VIF should be less tan 10. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 13
Eigenvalues provide an indication of how many distinct dimensions there are among the independent variables. When several eigenvalues are close to 0, the variables are highly intercorrelated (ill-conditioned). Condition indices are the square roots of the ratios of the largest eigenvalue to each successive eigenvalue. A condition index greater than 15 indicates a possible problem and an index greater than 30 suggests a serious problem with collinearity. The Variance Proportions are the proportions of the variance of the estimate accounted for by each principal component associated with each of the eigenvalues. Collinearity is a problem when a component associated with a high condition index contributes substantially to the variance of two or more variables. Here, for the final set of four variables (Model 6): the last condition index is 32.897, the last component accounts for 99% of the variance of the constant, 99% of the variance of log_gdp and 43% of the variance of babymort. Thus, for a more stable model, it might be wise to explore models with three variables. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 14
Βιβλιογραφία Andy Field (2009). Discovering statistics using SPSS, 3 rd edition. SAGE Publications M.J. Norusis (2011). IBM SPSS Statistics 19 Guide to Data Analysis. Prentice Hall. ΠΟΛΛΑΠΛΗ ΠΑΛΙΝΔΡΟΜΗΣΗ 15