How to compute correlations in R? There are many web sites showing how to compute correlations; we shall use the cor package (base R). We have an example data set (coleman1.dat), with measurements of several variables (x1 -- x6) on 20 subjects. The variables X6, X1 and X2 were the dependent (outcome) variables, variability of which were dependent upon the effects of 2 regression variables, X4 and X5. We shall assume that we want the correlations amongst all the continuous, Normally distributed variables X6, X1 and X2). Note: correlations (Spearman) assume and presume that the variables are continuous, Normally distributed, that the observations are a randomly sampled, a representative sample and completely independent of each other.
Data file with recorded observations
We have noted that the variables X6, X1 and X2 are those we want the correlations amongst. So, we shall create a new dataset with just those variables:
# R code y612 <- ds[,c(7,2,3)]
Note that our dataframe (ds) contains the variables PID, X1, X2, X3, X4, X5 and X6; thus we want to extract columns 7, 2 and 3 (corresponding to variables X6, X1 and X2). Then:
r1 <- cor(y612)
will provide us with correlations amongst these 3 variables. Here are the estimates of the correlations:
x6 x1 x2 x6 1.0000000 0.1922916 0.7534008 x1 0.1922916 1.0000000 0.1811398 x2 0.7534008 0.1811398 1.0000000
There is just one small problem, which many students overlook, ignore or do not understand, and that is that the effects of X4 and X5 completely mess things up. It is not appropriate to compute the correlations as we have just done; we need to account for the effects of our independent (regression) variables X4 and X5. Some students who realise that they need to 'account' for X4 and X5 think (mistakenly) that they can jsut include X4 and X5 in their correlation calculations and then calcualte the partial correlations of X6, X1 and X2 accounting for X4 and X5; this is erroneous. Such calculations would assume and presume that X4 and X5 were random and also Normally distributed, this may not be the case (we might well have chosen the amounts of X4 and X5 [they might be amounts of Energy and Protein fed to subjects in a controlled trial]).
What we DO need to do is to 'adjust', or account for, the [fixed] effects of X4 and X5 on each of our dependent variables (X6, X1 and X2) and then compute the correlations; effectively we are computing the correlations amongst the residuals. Here is our R code:
# multi-dependent variables, X6, X1 and X2, ie MANOVA lm612 <- lm(cbind(x6,x1,x2)~x4+x5,data=ds) r <- cor(lm612$residuals) # same result as SAS proc glm manova r
Here are our estimates of the correlations amongst X6, X11 and X2, correctly accounting for the fixed effects of X4 and X5:
x6 x1 x2 x6 1.0000000 -0.12663048 0.38438405 x1 -0.1266305 1.00000000 0.09787269 x2 0.3843840 0.09787269 1.00000000 >
Note the difference; our raw correlation between X6 and X1 gave a value of 0.192, whereas the correlation after accounting for X4 and X5 shows a value of -0.1266 (not just a small change, but a complete sign change, from +ve to -ve!). The other correlation estimates are also different. This example illustrates the danger of drawing quite the wrong conclusions of we do not properly account for the effects of the various independent variables. It should be noted that the above methods are appropriate when we are looking at the effects of fixed effects (X4 and X5); such fixed effects may be regressions (as in this example) or classification effects. IF we have other random effects in our model then things become a lot more complicated, not least because we shall have 2 or more sets of correlations (amongst our residuals and also amongst the effects fo the random effects). In such a case I recommend using either the ASREML software (commercial), or WOMBAT (free for research software, produced by Dr Karin Meyer, Oz). Both these software are orientated towards estimation of variance and covariance components in quantitative genetics and animal breeding, but are applicable elsewhere.