We will illustrate the problems of using a simplistic approach to testing for normality and homogeneity of variance. Consider the following data on the weight of male and female piglets reared to 100 days of age.
sex | Wt. (kg) | sex | Wt. (kg) | sex | Wt. (kg) | sex | Wt. (kg) | |||
---|---|---|---|---|---|---|---|---|---|---|
m | 96.3 | m | 85.3 | m | 86.0 | f | 76.5 | |||
f | 79.7 | f | 80.9 | f | 85.0 | f | 76.4 | |||
m | 86.5 | f | 78.6 | f | 79.1 | f | 81.2 | |||
m | 84.9 | m | 90.4 | m | 88.7 | m | 87.2 | |||
m | 90.0 | m | 91.8 | m | 86.4 | f | 74.0 | |||
m | 90.2 | f | 79.4 | f | 78.2 | f | 81.1 | |||
m | 92.0 | f | 73.1 | f | 77.8 | m | 86.4 | |||
m | 91.1 | m | 92.7 | f | 78.7 | m | 81.0 | |||
f | 79.8 | m | 89.9 | m | 80.0 | f | 77.0 | |||
f | 79.0 | m | 87.0 | f | 79.7 | m | 86.3 | |||
m | 76.7 | f | 75.7 | m | 90.2 | f | 87.1 | |||
m | 92.7 | m | 82.9 | m | 89.7 | f | 76.7 | |||
f | 85.4 | f | 75.5 | m | 81.0 | f | 78.2 | |||
m | 88.5 | f | 72.7 | m | 87.4 | m | 84.8 | |||
m | 95.6 | f | 79.9 | m | 92.1 | f | 82.0 | |||
f | 77.0 | f | 84.4 | m | 93.5 | m | 86.0 | |||
f | 83.5 | f | 78.9 | m | 93.0 | f | 84.0 | |||
f | 82.6 | f | 80.4 | m | 84.9 | f | 82.4 | |||
f | 78.4 | m | 84.0 | m | 91.0 | f | 81.7 | |||
m | 95.3 | m | 82.4 | m | 82.9 | m | 90.1 | |||
m | 89.8 | f | 80.9 | f | 79.6 | f | 79.3 | |||
m | 82.7 | f | 80.8 | f | 81.8 | f | 76.8 | |||
m | 88.9 | f | 81.0 | f | 84.0 | f | 83.7 | |||
m | 90.7 | f | 77.1 | m | 87.6 | m | 90.0 | |||
f | 74.5 | f | 83.4 | f | 84.2 | f | 80.9 | |||
f | 82.2 | f | 79.4 | m | 91.2 | m | 84.1 | |||
m | 88.9 | m | 88.2 | f | 77.8 | f | 80.8 | |||
f | 75.9 | f | 87.8 | m | 93.1 | f | 80.4 | |||
m | 84.9 | m | 96.0/td> | f | 79.9 | f | 83.8 | |||
f | 80.8 | f | 81.1 | m | 81.0 | f | 81.4 | |||
m | 94.1 | m | 93.6 | m | 86.8 | f | 81.3 | |||
f | 81.2 | f | 76.7 | m | 88.4 | f | 81.7 | |||
f | 78.2 | m | 87.2 | m | 90.6 | f | 85.8 | |||
m | 88.2 | m | 88.1 | f | 77.5 | m | 95.1 | |||
f | 76.5 | m | 85.9 | f | 79.4 | m | 91.1 | |||
f | 85.3 | m | 83.5 | f | 78.3 | f | 81.4 | |||
f | 81.2 | m | 86.3 | f | 85.2 | m | 85.2 | |||
m | 91.1 | f | 79.0 | m | 89.9 | f | 74.0 | |||
f | 88.6 | f | 75.0 | f | 82.2 | m | 87.6 | |||
m | 85.7 | f | 82.0 | m | 91.9 | f | 80.5 | |||
m | 85.4 | f | 86.6 | m | 94.6 | f | 80.1 | |||
m | 95.9 | f | 84.3 | m | 93.2 | f | 78.5 | |||
f | 88.7 | m | 92.1 | m | 89.2 | f | 79.9 | |||
f | 87.1 | m | 86.8 | f | 86.3 | m | 95.1 | |||
m | 91.0 | f | 82.1 | m | 90.6 | m | 87.9 | |||
f | 73.0 | f | 82.0 | m | 90.3 | f | 83.9 | |||
f | 73.2 | m | 84.9 | f | 84.4 | m | 88.8 | |||
f | 77.2 | f | 83.4 | f | 78.2 | m | 85.2 | |||
m | 86.8 | f | 78.7 | f | 78.9 | m | 85.9 | |||
f | 82.4 | f | 77.3 | m | 85.7 | m | 89.0 |
Suppose we take the weights for all the male pigs and test for Normality. The SAS procedure PROC UNIVARIATE provides a test for Normality, the W statistic (or if the data/sample size is large enough the K-S test). We find that the probability associated with the W statistic is of the order of 50% (Pr<W 0.52). This is the probability of getting such a sample distribution when the true distribution is Normal. So we will accept that the weights of male pigs at 100 days follow a Normal distribution. Likewise, if we take the weights of the female pigs and use the same approach of testing for Normality we arrive at the same conclusion; the data are Normally distributed (Pr<W 0.68). Thus we accept the Null Hypothesis that the weights of male pigs and of female pigs are Normally distributed. The data given in the table above and the SAS code to read the data, to create seperate male and female data sets and to run PROC UNIVARIATE on each data set are given: data, SAS statements for Normality.
The output from these SAS statements are given: SAS output testing for Normality.
When we test (simplisticly) the combined data (Males+Females) for Normality we find that the W statistic indicates that there is only about a 2 percent chance (Pr<W 0.0205) that such a sample of data comes from a Normal distribution, thus if we are continueing to use our 5% probability level as our cut-off to accept or reject the Null Hypothesis we will reject the Null Hypothesis and instead conclude that the data are not Normally distributed! This is at variance with the conclusion that we have reached for Males and for Females seperately. This illustrates the danger of not computing the residuals and testing them for Normality. This point IS important because if we refer to the basic statistical assumptions of our ANOVA we know that the requirements for the ANOVA to be valid are that the ERRORS be normally distributed, not the various fixed effects of the model!, nor the actual data (the Y's).
If we compute the mean for each sex and then compute the deviation from the mean (specific to each sex), the "residuals", we CAN test these for Normality. This is what is shown in the subsequent sections of the SAS output where we have the Weight and the Residual (Wt - Mean) tested for Normality and they are both EXACTLY the same as the above test in each sex. Then combining the residuals (Combined Male and Female Deviations: Normality) and testing for Normality IS a legitimate and sensible thing to do. When we do this we find that the Normality test gives us Pr<W 0.53; indicating that these data ARE Normally distributed. This is in agreement with our original tests within each sex. Note that if we had not computed the Deviations (Residuals) and tested these we would have come to exactly the wrong conclusion. This is why it is absolutely necessary to develop a model that is appropriate so that one can then sensibly compute the residuals and test these for Normality, and then go back and test if the model is appropriate; in an iterative fashion.
The usual Analysis of Variance (using PROC GLM) requires certain assumptions be met if the statistical tests are to be valid. One of the assumptions we make is that the errors (residuals) all come from the same (Normal) distribution. Thus we have to test not only for Normality, but we must also ensure (i.e. test) that the variances are homogeneous. We can do this by subdividing the data into appropriate groups, computing the variances in each of the groups and testing that they are consistent with being sampled from a single, Normal distribution. The statistical test for homogeneity of variance is due to Bartlett and is a modification of the Neyman-Pearson likelihood ratio test.
What does the above formula mean and what do we require to compute it? We need to compute the variance in each group as well as the pooled, overall variance.
Consider the data from the Completely Randomised Design where we had 6 treatments applied to 28 experimental units. We want to verify that the variances in the 6 treatment groups are homogeneous. If we fit our CRD model and calculate the fitted value for each observation then the residuals e_{i} can be computed for each observation as the actual observation - the fitted (predicted) value. Then the Corrected Sums of Squares amongst the residuals is computed for each group, CSS_{i}, and the variance amongst the residuals in each group is computed (as CSS_{i}/d.f._{i}), as well as the pooled overall variance across groups (calculated from the sum of the Corrected Sums of Squares, CSS_{i}/sum of the d.f._{i}) and the raw c^{2} can be calculated using the above formula.
Much of these calculations can be done using SAS. We can output the residuals for each observation, from PROC GLM, to a new SAS dataset, sort the resulting data set by the group variable, use PROC MEANS to compute the variance and Corrected Sums of Squares in each group and then use the above formula in a further SAS data step to compute the raw c^{2}.
proc glm data=oneway; classes trt; model y = trt; output out=onewaye r=ehat; run; proc sort data=onewaye; by trt; run; proc means data=onewaye; by trt; var ehat; output out=vars mean=ebars n=ni var=s css=csse; run; proc print; run;
This approach can be extended to other designs and models, however, now with the use of mixed models procedures which can accommodate heterogeneous variances it is probably easier to use such mixed models, see below.
The SAS procedure PROC MIXED actually provides us with a quite convenient means of testing whether a common homogeneous variance is sufficient, or whether a seperate variance is necessary and desireable for each group, i.e. whether the variances are heterogeneous. Not only can we directly test for homogeneity of variance, but we also have the variances estimated.
Consider the same One-way Completely Randomised Design. We can analyse this data using the SAS procedure PROC MIXED:
proc mixed data=oneway; classes trt; model y = trt; repeated /group=trt; run;
We fit essentially the same model (y = trt) that we would usually do with a conventional fixed effects model using PROC GLM. Then we specify that each of the experimental units for each treatment is a repeated measurement (of the treatment effect) and that we are to estimate a seperate residual variance within each group (the /group=trt), that is to say that the residual variances are heterogeneous. PROC MIXED is also kind enough to provide us with a chi-squared test of the statistical significance. We could also re-run PROC MIXED without the repeated statement and use either the BIC (Schwarz's) Bayesian Information Criterion, or the AIC criteria to compare the 2 models (one with a homogeneous residual variance and the other, with the repeated statement for heterogeneous variances) to decide whether heterogeneous variances are necessary.
Using the above data and using the residuals output from PROC GLM together with Bartlett's test we obtain a raw c^{2} of 16.2858 and a correction factor of 1.1081, giving a corrected c^{2} of 14.697, statistically significant on 5 degrees of freedom. Using PROC MIXED directly, as outlined above, we find the the Null Model LRT c^{2} is 16.2859, with 5 d.f., and a Null Model LRT P-Value of 0.0061. This c^{2} is just the same as that which we computed using Bartlett's test with the residuals. Using a method, such as PROC MIXED, which can accommodate heterogeneous variances has several advantages. If we had used PROC GLM and Bartlett's test and arrived at the conclusion that the variances are heterogeneous we would be faced with the problem of how to best analyse the data and how to best estimate effects (treatment means and/or treatment differences) correctly accounting for the heterogeneous variances. PROC MIXED obviates these problems, the Mixed Model equations are defined quite generally with an Error (Residual) Variance-Covariance matrix R.
Consider another example, the same data on pig weights that we used for testing for Normality; are the variances for male pigs and female pigs homogeneous or heterogeneous? We can fit the following SAS code:
proc mixed data=pigs; classes sex; model wt = sex; run;
This model fits one common residual variance for both sexes. The statistical model is:
Wt_{ij} = µ + sex_{i} + e_{ij}
with the same variance common to both sexes.
We obtain the following results (Note obtained from SAS v8.1):
Homogeneous Variance |
The Mixed Procedure |
Model Information | |
Data Set | WORK.PIGS |
Dependent Variable | wt |
Covariance Structure | Diagonal |
Estimation Method | REML |
Residual Variance Method | Profile |
Fixed Effects SE Method | Model-Based |
Degrees of Freedom Method | Residual |
Class Level Information | ||
Class | Levels | Values |
sex | 2 | f m |
Dimensions | |
Covariance Parameters | 1 |
Columns in X | 3 |
Columns in Z | 0 |
Subjects | 1 |
Max Obs Per Subject | 200 |
Observations Used | 200 |
Observations Not Used | 0 |
Total Observations | 200 |
Covariance Parameter Estimates | |
Cov Parm | Estimate |
Residual | 14.2629 |
Fit Statistics | |
-2 Res Log Likelihood | 1097.3 |
AIC (smaller is better) | 1099.3 |
AICC (smaller is better) | 1099.3 |
BIC (smaller is better) | 1102.6 |
Type 3 Tests of Fixed Effects | ||||
Effect | Num DF | Den DF | F Value | Pr > F |
sex | 1 | 198 | 233.00 | <.0001 |
In our model we see that there is one factor (SEX) with 2 levels (f and m). Note that since the levels are letters they are in alphabetic order, so f corresponds to the first of the two columns corresponding to the two levels of sex, and m is the second; this is an important factor to note and be aware of, otherwise you could end up testing (f-m) rather than (m-f) which might not be what you wanted. The common, pooled, estimate of the residual (co)variance is 14.25. The -2 Res(idual) Log Likelihood is 1097.3, and the BIC value is 1102.6. Note, again, these results were obtained using SAS v8.1, the results shown from SAS v6.12 have the sign reversed!
We can also fit a similar model, where the fixed effects ( µ and sex_{i}) are the same, but where we have 2 error (residual) variances, one for male pigs and one for female pigs:
proc mixed data=pigs; classes sex; model wt = sex; repeated /group=sex; /* seperate error variances for each group */ run;
Heterogeneous Variance |
The Mixed Procedure |
Model Information | |
Data Set | WORK.PIGS |
Dependent Variable | wt |
Covariance Structure | Variance Components |
Group Effect | sex |
Estimation Method | REML |
Residual Variance Method | None |
Fixed Effects SE Method | Model-Based |
Degrees of Freedom Method | Between-Within |
Class Level Information | ||
Class | Levels | Values |
sex | 2 | f m |
Dimensions | |
Covariance Parameters | 2 |
Columns in X | 3 |
Columns in Z | 0 |
Subjects | 200 |
Max Obs Per Subject | 1 |
Observations Used | 200 |
Observations Not Used | 0 |
Total Observations | 200 |
Iteration History | |||
Iteration | Evaluations | -2 Res Log Like | Criterion |
0 | 1 | 1097.32394532 | |
1 | 1 | 1095.75636824 | 0.00000000 |
Convergence criteria met. |
Covariance Parameter Estimates | ||
Cov Parm | Group | Estimate |
Residual | sex f | 12.5719 |
Residual | sex m | 16.1721 |
Fit Statistics | |
-2 Res Log Likelihood | 1095.8 |
AIC (smaller is better) | 1099.8 |
AICC (smaller is better) | 1099.8 |
BIC (smaller is better) | 1106.4 |
Null Model Likelihood Ratio Test | ||
DF | Chi-Square | Pr > ChiSq |
1 | 1.57 | 0.2106 |
Type 3 Tests of Fixed Effects | ||||
Effect | Num DF | Den DF | F Value | Pr > F |
sex | 1 | 198 | 229.50 | <.0001 |
Here we see that again we have 2 sexes (we should have, it is the same data), but this time we started with 1 initial evaluation and then had to make another iteration until the model objective function converged, that is to say that it did not change at the next iteration. You might like to think of this as being like Newton-Raphson iteration from an initial estimate to a precise value for a square root of a number that you will have surely seen in high school. To continue; the estimated variance amongst female pigs was 12.57 and amongst male pigs was 16.15. The -2 Res(tricted) Log Likelihood was 1095.8, and the BIC value was 1106.4.
The two models that we have just fitted differ only in the 1 extra variance component in the second model; the fixed effects of µ and sex_{i} are present in both models. Was it worthwhile increasing the complexity of the model by having 2 variance components, one for females and one for males, i.e. are the variance estimates of 12.57 and 16.15 really different from oneanother. An alternative way of stating this question is to ask whether the 'goodness of fit' of the model has improved. This is where the Restricted Log Likelihoods come in to play. Standard Maximum Likelihood theory tells us that -2 the difference in the Log Likelihoods of out models has a chi-squared distribution. Thus since the models differ only in the second model having 1 extra variance parameter there will be 1 degree of freedom difference. Thus:
Using 5% as our accept/reject criterion we find, from statistical tables, that a 1 degree of freedom c^{2} has a critical value of 3.84; that is to say that a c^{2} value of less than 3.84 is not statistically significant. We can also look at the BIC values from the 2 models, the one with the smallest BIC value is the model which is the most parsimonious, but still the best adequately fitting model:
BIC_{common variance} = 1102.6
BIC_{heterogenous variances} = 1106.4
Since the 2 models have the same fixed effects (sex), but differ only in the number of variance component parameters (1 vs. 2) then we can directly compare the BIC values, the model with the lower value is the best; in this case the homogeneous variance model. Therefore we conclude (whether from -2LnL or BIC values)that there was no significant improvement in the model by including the extra variance component and that the original model with 1 common variance component for both sexes was quite suitable. So we have in fact concluded that the data for weights of female and male pigs are Normally distributed and have a common variance (for which our best estimate is 14.25). It is often easier to compare two models via their BIC values rather than have to compute differences in Log Likelihoods and chi-squares. Why use BIC rather than AIC, etc? A good question; there are many opinions; I have reached the conclusion that the BIC value seems to provide a reasonable balance between a complex model with many parameters and a too simple model which does not adequately describe the data.
From our initial analysis, which is the most appropriate since the second was not significantly better, we also have the test of the fixed effects.
Tests of Fixed Effects Source NDF DDF Type III F Pr > F
SEX 1 198 233.00 <0.0001
We can see that there is a highly statistically significant difference between the sexes. There is 1 numerator degree of freedom (NDF) and 198 denominator dgrees of freedom (DDF) and the Type III F-ratio (from the Type III Sums of Squares, the marginal effect of sex) is 233.00. This substantially exceeds the tabulated critical value at even the 1% level. If we want to know what is the difference (estimate) we could use the estimate statement:
proc mixed data=pigs; classes sex; model wt = sex; estimate 'm - f' sex -1 1; /* NOTE the ordering of the coefficients */ run;
Reference: Steel, Torrie and Dickey. Chapter 19.3 Homogeneity of Variance
Reference: SAS Procedures Guide, PROC MEANS.
Reference: SAS System for Mixed Models.