Multivariate Models, PCA, ICA and ICS

Transcript

1 Multivariate Models, PCA, ICA and ICS Klaus Nordhausen 1 Hannu Oja 1 Esa Ollila 2 David E. Tyler 3 1 Tampere School of Public Health University of Tampere 2 Department of Mathematical Science University of Oulu 3 Department of Statistics Rutgers - The State University of New Jersey

2 Outline Preliminaries Multivariate Models Whitening PCA ICA ICS R

3 Important Matrices and Decompositions A p p permutation matrix P is obtained by permuting the rows and/or columns of the identity matrix I p. The sign-change matrix J is a diagonal matrix with diagonal elements ±1. The p p matrix O is an orthogonal matrix if O T O = OO T = I p. P and J are orthogonal matrices. D is a diagonal matrix. Let A be p p matrix. The singular value decomposition (SVD) of A is A = O DO T. Let A be p p matrix. The eigenvalue decomposition of A is A = ODO T.

4 Affine transformation Let x be a p-vector. An affine transformation has then the form y = Ax + b, where A is a full rank p p matrix with SVD A = O DO T and b is a p vector. This transformation can be interpreted as a change in coordinate system. The original system of x is first rotated / reflected using O T then componentwise rescaled using D, again rotated / reflected using O and finally the origin of the system is shifted by b. In statistics methods and estimates are preferred that do not depend on the underlying coordinate system. A method like for example a hypothesis test is called affine invariant if it does not depend on the coordinate system, i.e. the p-value does not change when the coordinate system is changed. Estimates are called affine equivariant if they follow a change in the coordinate system in an appropriate way.

5 Location and Scatter Functionals Let x be a p-variate random vector with cdf F. A p-vector valued functional T = T(F ) = T(x) is called a location functional if it is affine equivariant in the sense that T(Ax + b) = AT(x) + b for all full-rank p p-matrices A and for all p-vectors b. A p p matrix S = S(F) = S(x) is a scatter functional if it is affine equivariant in the sense that S(Ax + b) = AS(x)A T for all full-rank p p-matrices A and for all p-vectors b. The corresponding sample versions are called location statistic and scatter statistic.

6 Special Scatter Functionals A symmetrized scatter functional version of a scatter functional S can be obtained by using two independent copies x 1 and x 2 of x S sym (x) := S(x 1 x 2 ). A shape matrix is only affine equivariant in the sense that S(Ax + b) AS(x)A T.

7 Independence property A scatter functional S has the independence property if S(x) is a diagonal matrix for all x with independent margins. Note that in general scatter functionals do not have the independence property. Only the covariance matrix, the matrix of fourth moments and symmetrized scatter matrices have this property. If, however, x has independent and at least p 1 symmetric components all scatter matrices will be diagonal matrices.

8 M-estimates of Location and Scatter There are a lot of techniques to construct location and scatter estimates. In this talk we will mainly use M-estimates. M-estimators for Location (T) and Scatter (S) satisfy the following two implicit equations: and T = T(x) = (E(w 1 (r))) 1 E (w 1 (r)x) S = S(x) = E ( w 2 (r)(x T)(x T) T ) for some suitably chosen weight functions w 1 (r) and w 2 (r). The scalar r is the Mahalanobis distance between x and T, that is, r = x T S.

9 Examples of Scatter Matrices The most common location and scatter statistics are the vector of means E(x) and the regular covariance matrix COV(x). Using those two and compute a so called 1 step M-estimator one yields as a scatter the matrix of fourth moments. COV 4 (x) = 1 p + 2 E(r 2 (x E(x))(x E(x)) T ) Another famous M-estimator is Tyler s shape matrix. ( ) (x T(x))(x T(x)) T S Tyl (x) = pe x T(x) 2 S The symmetrized version of Tyler s shape matrix is known as Dümbgens s shape matrix.

10 Location - Scatter Model All models discussed in this talk can be derived from the location scatter model x = Ωɛ + µ, where x = (x 1,..., x p ) T is a p-variate random vector, ɛ = (ɛ 1,..., ɛ p ) T is a p-variate random vector standardized in a way explained later, Ω a full rank p p mixing matrix and µ a p-variate location vector. The quantity Σ = ΩΩ T is the scatter matrix parameter. For further analysis the assumptions imposed on ɛ are crucial. Note however, that the standardized vector ɛ is actually not observed, only x is directly measureable. The vector ɛ is rather a mental construction than something with a physical meaning. Yet, in some cases, ɛ can have an interpretation and it might even be the goal of the analysis to recover it when only x is observed. Either way, one of the first challenges one faces in practical data analysis is to evaluate which assumptions on ɛ can be justified for the data at hand.

11 Symmetric Models A1: Multivariate normal model. ɛ has a standard multivariate normal distribution N(0, I p ). A2: Elliptic model. ɛ has a spherical distribution around the origin, i.e. Oɛ ɛ for all orthogonal p p matrices O. A3: Exchangeable sign-symmetric model. In this model ɛ is symmetric around the origin in the sense that PJɛ ɛ for all permutation matrices P and sign change matrices J. A4: Sign-symmetric model. ɛ is symmetric around the origin in the sense that Jɛ ɛ for all sign change matrices J. A5: Central symmetric model. ɛ is symmetric around the origin in the sense that ɛ ɛ.

12 Asymmetric Models B1: Finite mixtures of elliptical distributions with proportional scatter matrices. For a fixed k is ɛ = k i=1 p i(ɛ i + µ i ), where p = (p 1,..., p k ) is Multin(1, π) distributed with 0 π i 1 and k i=1 π i = 1 and ɛ i s are all independent and follow A2. B2: Skew-elliptical model. ɛ = sgn(ɛ p+1 α βɛ p)ɛ, where (ɛ T, ɛ p+1 )T satisfies A2, but with dimension p + 1, and α, β R are constants. B3: Independent component model. The components of ɛ are independent with E(ɛ i ) = 0 and Var(ɛ i ) = 1.

13 Relationships Between the Models The symmetric models A1-A5 are ranked from the strongest symmetry assumption to the weakest one, which means A1 A2 A3 A4 A5. The symmetric models can be seen as border cases of the asymmetric models and we have the following relationships: A1 = A2 B3 A2 B1 and A2 B2

14 Uniqueness of Model parameters The location parameter µ is well-defined in models A1-A5. In that case µ is the center of symmetry. The scatter parameter Σ is only well defined in models A1 and B3. It is the covariance matrix. The mixing matrix Ω is not well defined in any of the models. Further assumptions are needed in most models in order to have well-defined parameters Ω and Σ. Relation between location and scatter functionals in the different models: Location statistics estimate in A1-A5 the same population quantity, the center of symmetry. Scatter statistics measure in A1-A3 the same population quantity. They measure the shape of data, i.e. S(x) Σ.

15 Data Transformations In practical data analysis data is often transformed to different coordinates. The transformation can be model based or not and may be motivated by different reasons. Transformations we will have a closer look at are: Whitening Principal components. Independent components. Invariant components. Other transformations are for example: Canonical correlations. Factor analysis. Most transformations are based usually on E(x) and COV(x).

16 Whitening The whitening transformation centers the variables and jointly uncorrelates and scales them to have unit variances. The transformation is where then Properties: y = COV 1/2 (x)(x E(x)), E(y) = 0 and COV(y) = I p. In A1 it makes the marginal variables independent. In A2 y will be spherical. It does usually not recover ɛ. The resulting coordinate system is not affine invariant. It only holds that COV 1 2 (Ax + b)(ax + b E(Ax + b)) = O COV 1 2 (x)(x E(x)).

17 Principal Components The principal components also uncorrelates marginals. It does however not scale them to have unit variances but chooses the new variables successively in such a way that they are linear combination of the original variables which have maximal variation under the condition of being orthogonal with the previous ones. The transformation is based on the eigenvalue decomposition of the regular covariance COV(x) = ODO T. The principal components are with COV(y) = D. Variations of PCA: y = O T x, centered PCA. The variables are first centered, i.e. x x E(x). scaled PCA. Each variable is scaled to have unit variance, i.e. x i x i /σ i. Where σ 2 i is the variance of x i.

18 Properties and Purpose of PCA PCA has the properties: not affine invariant. not model based, only assumes existence of the first two moments. lot of nice geometrical properties, especially in A1. Main purposes: dimension reduction. in connection with other multivariate techniques. For example in regression to avoid multicollinearity problems or in clustering. outlier detection.

19 Comparison Whitening vs. PCA In the following example 70 observations are sampled from a bivariate normal distribution. Original data Whitenend data Principal components

20 Robust Whitening and Robust PCA The mean vector and the covariance matrix are not very robust and therefore the transformations based on them can be dominated by only a few influential observations. A common method to robustify classical methods is to replace the mean vector and the covariance matrix by more robust robust location and scatter functionals. This of course leads only to the same method if the functionals estimate the same population quantity. Which basically means this is possible only in models A1-A3.

21 Independent Component Analysis Independent component analysis (ICA) can be seen as a model based transformation. Let assume for simplicity that µ = 0. Then the IC model B3 reduces to x = Ωɛ, where ɛ has independent components. The model is however ill defined since also x = (ΩPDJ)(JD 1 P T )ɛ = Ω ɛ where ɛ has independent components. Therefore the components can be rescaled, permuted and their sign changed. To fix the scale a common assumption is E(ɛ) = 0 and COV(ɛ) = I p.

22 Estimation of the Mixing Matrix In ICA the main goal is to estimate the mixing matrix Ω respectively an unmixing matrix Λ. This is possible up to sign and permutation if at most one component are gaussian. In some cases then the up to sign and order recovered underlying independent components z = Λx have physical meaning. Application areas of ICA of are: signal processing. medical image analysis. dimension reduction.

23 Common Structure of Most ICA Algorithms Most ICA algorithms consist of the following two steps. Step 1: Whitening. x is whitened so that E(x) = 0 and COV(x) = I p. Then x = Oz with an orthogonal matrix O and independent components z which have E(z ) = 0 and COV(z ) = I p. Step 2: Rotation. Find an orthogonal matrix O whose columns maximize (minimize) a chosen criterion g(o T x). Common criteria are measures of marginal nongaussianity (negentropy, moments) and likelihood functions. Well known algorithms are fastica, FOBI, JADE, PearsonICA,...

24 Picture example for ICA

25 Multiple functionals for transformation So far the transformations involved mainly one location and / or one scatter functional. Recently is was suggested to use two location functionals and / or two scatter functionals for this purpose. Useful to recall in this context is that skewness can be interpreted as the difference of two different location functionals and kurtosis as the ratio of two different scatter functionals. For example the classical univariate measures: β 1 (x) = E((x E(x))3 ) var(x) 3/2 = E 3(x) E(x) var(x) with E 3 (x) = E (( ) ) x E(x) x var(x) β 2 (x) = E((x E(x))4 ) var(x) 2 = var 4(x) var(x) with var 4 (x) = E (( ) ) x E(x) (x E(x)) 2 var(x)

26 Two Scatter Matrices for ICS Two different scatter matrices S 1 = S 1 (x) and S 2 = S 2 (x) can be used to find an invariant coordinate system (ICS) as follows: Starting with S 1 and S 2, define a p p transformation matrix B = B(x) and a diagonal matrix D = D(x) by S 1 1 S 2B T = B T D that is, B gives the eigenvectors of S 1 1 S 2. The following result can then be shown to hold. The transformation x z = B(x)x yields an invariant coordinate system in the sense that B(Ax)(Ax) = JB(x)x for some p p sign change matrix J.

27 Interpretation of the ICS Transformation This transformation can be seen as a double decorrelation. First x is whitened using S 1 and on the whitened data a PCA is performed with respect to S 2. Basically this can also be seen as a way to compare two scatter matrices. If the PCA step finds still an interesting direction after the whitening step then the two scatter matrices measure different population quantities. The diagonal elements of D can be seen as ratios of two different scatter functionals. Hence the values can be seen as kurtosis measures and therefore the invariant coordinates are according to their kurtosis measures with respect to S 1 and S 2.

28 An extension of ICS The sign of the column vectors of B, respectively of the invariant coordinates z is not fixed. One option is here to use two pairs (T 1, S 1 ) and (T 2, S 2 ) and fix the sign then such that The the a way to construct the invariant coordinate system is to center x using T 1, i.e. x x T 1 (x) estimate B and the invariant coordinate z fix the signs of z so that T 2 (z) 0. In that case T 2 (z) 0 can be seen as the difference of two location measures and therefore as a measure of skewness. If the D has distinct diagonal entries and T 2 (z) > 0 the transformation is well defined. Denote this version of ICS from now on as centered ICS.

29 Applications of ICS ICS is useful in the following context: Descriptive statistics and model selection (centered ICS). Outlier detection and clustering. Dimension reduction. ICA, if S 1 and S 2 have the independence property and the IC model holds then B is an estimate of the unmixing matrix. In the transformation-retransformation framework of multivariate nonparametric methods. Some of the applications will be now discussed in more detail.

30 On the choice of S 1 and S 2 As shown previously there are a lot of possibilities for S 1 and S 2 to choose from. Since all the scatter matrices have different properties, these can yield different invariant coordinate systems. Unfortunately, there are so far no theoretic results about the optimal choice. This is still an open research question. Some comments are however already possible: For two given scatter matrices, the order has no effect Depending on the application in mind the scatter matrices should fulfill some further conditions. Practise showed so far that for most data sets different choices yield only minor differences.

31 Descriptive Statistics based on Centered ICS Using the two pairs (T 1, S 1 ) and (T 2, S 2 ) then a data set can be described via The location: T 1 (x) The scatter: S 1 (x) Measure of skewness: T 2 (z) Measure of Kurtosis: S 2 (z) The last two measures can then be used to construct tests for multinormality or ellipticity (see for example Kankainen et al. 2007).

32 Moment Based Descriptive Statistics using ICS An interesting choice for model selection is T 1 = E(x), S 1 = COV(x), T 2 = 1 p E( x E(x) 2 COV x), the vector of third moments, S 2 the matrix of fourth moments. Denote D(k) as the set of all positive definite diagonal matrices with at most k distinct diagonal entries, then the values of s and D in the eight models are: Model Skewness s Kurtosis D A1 0 I p A2 0 D(1) A3 0 D(1) A4 0 D(p) A5 0 D(p) B1 0 D(k) B2 e p or e 1 D(2) B3 0 D(p)

33 Finding outliers, structure and clusters In general, that the coordinates are ordered according to their kurtosis is a useful feature. One can assume that interesting directions are ones with extreme measures of kurtosis. Outliers for example are usually shown in the first coordinate. If the original data is coming from an elliptical distribution (A2), S 1 and S 2 measure the same population quantity and therefore the values of D in that case should be all the same. For non-elliptical distributions however they measure different quantities and therefore the coordinates can reveal "hidden" structures. In the case of x coming from an mixture of elliptical distributions (B1), ICS estimates heuristically spoken to Fisher s linear discriminant space. (Without knowing the class labels!)

34 Finding the outliers The modified wood gravity data is a classical data set for outlier detection methods. It has 6 measurements on 20 observations containing four outliers. Here S 1 is a M-estimator based on t 1 and S 2 based on t 2. x x2 x x4 x y Original data IC IC.2 IC IC.4 IC IC.6 Invariant coordinates

35 Finding the structure I The following data set is simulated and has 400 observations for 6 variables. It looks like an elliptical data set. V V2 V V4 V V6 Original data

36 Finding the structure II Comp Comp.2 Comp Comp.4 Comp Comp.6 PCA

37 Finding the structure III IC IC.2 IC IC.4 IC IC.6 ICS

38 Clustering and dimension reduction To illustrate clustering via an ICS we use the Iris data set. The different species are colored differently and we can see, that the last component is enough for doing the clustering. IC IC.2 IC IC.4

39 ICS and ICA When the IC model holds then the ICS transformation recovers the independent components upto scale, sign and permutation if S 1 and S 2 have the independence property. the diagonal elements of D are all distinct. In the case of S 1 = COV and S 2 = COV 4 this corresponds to the FOBI algorithm and the condition of D having distinct elements means the independent components must have all different kurtosis values.

40 Implementation in R The two scatter transformation for ICS / ICA is implemented in R in the package ICS. The main function ics can take the name of two functions for S 1 and S 2 or two in advance computed scatter matrices. The R package offers then several functions to work with an ics object and offers also several scatter matrices and two tests of multinormality. Basically everything shown in this talk can be easily done in R using the methods of base R and the package ICS.

41 Australian Athletes Data, a More Detailed Example Using R We have now a closer look at the Australian Athletes data set using PCA and ICS on the four variables: Body Mass Index Body Fat Sum of Skin Folds Lean Body Mass > library(daag) > library(ics) > data(ais) > data.ana <- data.frame(bmi=ais$bmi, Bfat=ais$pcBfat, + ssf=ais$ssf, LBM=ais$lbm) > pairs(data.ana)

42 Australian Athletes Data, a More Detailed Example Using R > pairs(data.ana) BMI Bfat ssf LBM

43 Australian Athletes Data II Performing the PCA > pca.as<-princomp(data.ana,score=true) > pairs(pca.as$scores) Comp Comp.2 Comp Comp.4

44 Australian Athletes Data III > pairs(pca.as$scores,col=ais$sex) Comp Comp.2 Comp Comp.4

45 Australian Athletes Data IV Descriptive statistics using ICS > LOCATION <- colmeans(data.ana) > LOCATION BMI Bfat ssf LBM > > SCATTER <- cov(data.ana) > SCATTER BMI Bfat ssf LBM BMI Bfat ssf LBM

46 Australian Athletes Data V > data.ana.centered <- sweep(data.ana, 2, LOCATION, "-") > ics.as <- ics(data.ana.centered, stdkurt = FALSE, + stdb="z") > summary(ics.as) ICS based on two scatter matrices S1: cov S2: cov4 The generalized kurtosis measures of the components are: [1] The Unmixing matrix is: [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,]

47 Australian Athletes Data VI > Z <- sweep(ics.components(ics.as),2, + sign(mean3(ics.components(ics.as))),"*") > > SKEWNESS <- mean3(z) > SKEWNESS IC.1 IC.2 IC.3 IC > KURTOSIS <- ics.as@gkurt > KURTOSIS [1]

48 Australian Athletes Data VII > pairs(z) IC IC.2 IC IC.4

49 Australian Athletes Data VIII > pairs(z, col=ais$sex) IC IC.2 IC IC.4

50 Australian Athletes Data IX Analyzing only the females > data.ana.females<-data.ana[ais$sex=="f",] > pca.as.f<-princomp(data.ana.females,score=true) > pairs(pca.as.f$scores) Comp Comp.2 Comp Comp.4

51 Australian Athletes Data X Analyzing only the females > LOCATION <- colmeans(data.ana.females) > LOCATION BMI Bfat ssf LBM > > SCATTER <- cov(data.ana.females) > SCATTER BMI Bfat ssf LBM BMI Bfat ssf LBM

52 Australian Athletes Data XI > data.ana.f.centered <- sweep(data.ana.females, 2, + LOCATION, "-") > ics.as.f <- ics(data.ana.f.centered, stdkurt = FALSE, + stdb="z") > summary(ics.as.f) ICS based on two scatter matrices S1: cov S2: cov4 The generalized kurtosis measures of the components are: [1] The Unmixing matrix is: [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,]

53 Australian Athletes Data XII > Z.f <- sweep(ics.components(ics.as.f),2, + sign(mean3(ics.components(ics.as.f))),"*") > > SKEWNESS <- mean3(z.f) > SKEWNESS IC.1 IC.2 IC.3 IC > KURTOSIS <- ics.as.f@gkurt > KURTOSIS [1]

54 Australian Athletes Data XIII > pairs(z.f) IC IC.2 IC IC.4

55 Key references I Tyler, D. E., Critchley, F., Dümbgen, L. and Oja, H. (2008). Exploring multivariate data via multiple scatter matrices. Journal of the Royal Statistical Society, Series B, accepted. Oja, H., Sirkiä, S. and Eriksson, J. (2006). Scatter matrices and independent component analysis. Austrian Journal of Statistics, 19, Nordhausen, K., Oja, H. and Tyler, D. E. (2008). Two scatter matrices for multivariate data analysis: The package ICS. Journal of Statistical Software, accepted. Nordhausen, K., Oja, H. and Ollila, E. (2008). Multivariate models and their first four moments. Submitted.

56 Key references II Nordhausen, K., Oja, H. and Ollila, E. (2008): Robust independent component analysis based on two scatter matrices. Austrian Journal of Statistics, 37, Nordhausen, K., Oja, H. and Tyler, D. E. (2006): On the efficiency of invariant multivariate sign and rank tests. In Liski, E.P., Isotalo, J., Niemelä, J., Puntanen, S., and Styan, G.P.H. (editors) Festschrift for Tarmo Pukkila on his 60th birthday, , University of Tampere, Tampere. Kankainen, A., Taskinen, S. and Oja, H. (2007). Tests of multinormality based on location vectors and scatter matrices. Statistical Methods & Applications, 16,