### Nested, Subsampling Design, take 2 (aka follow-up)

In many cases we have a model with subsamples; as we have seen earlier. This arises when the experimental unit and the sampling unit are not the same. For example, imagine that we have a group of 54 pigs and we divide them up into groups of 6 pigs and put them in 9 pens. We have 3 diets and we randomly assign the 3 diets to the pens, so that we have 3 pens on each diet. In each pen there is a trough into which we put the feed, so that all pigs have free and equal access. We weigh each pig at the start of the experiment and at the end and then calculate te difference; it is that difference that we shall be looking at as our dependent variable (outcome). We have 54 pigs and hence 54 weight-gains. BUT, the experimental unit, to which the treatment was applied, was the pen and NOT the pig; pigs are the subsampling unit. It does not matter that it was the pigs that ate the feed and not the pen; the pigs were still a group. If one ignored this elementary fact and analysed the data one would in all likelihood come up with overly optimistic results; i.e. rubbish. Additionally, we may well have a mixture of male and female pigs. It does not matter that it is the pigs that ate the feed and not the 'pen' eating the feed; the animals were a 'group' and this MUST be accounted for.

The statistical model will be:
Yijk = mu + dieti + penij + ...

The data, in the form of the SAS code and data step to input said data are given below.

### Here are the input data and SAS statements

```
data subsamp2;
input diet pen pig gain sex;
cards;
1  1  1   241.47   2
1  1  2   266.13   1
1  1  3   201.13   2
1  1  4   314.11   1
1  1  5   234.54   1
1  1  6   256.76   2
1  2  1   343.85   1
1  2  2   278.91   1
1  2  3   263.21   2
1  2  4   299.79   2
1  2  5   329.00   2
1  2  6   343.61   1
1  3  1   257.04   2
1  3  2   340.12   1
1  3  3   276.31   2
1  3  4   293.03   1
1  3  5   322.97   1
1  3  6   313.31   1
2  1  1   268.52   1
2  1  2   271.46   2
2  1  3   269.41   2
2  1  4   242.54   2
2  1  5   265.53   2
2  1  6   281.51   1
2  2  1   355.27   1
2  2  2   291.54   1
2  2  3   308.28   1
2  2  4   275.09   1
2  2  5   301.42   2
2  2  6   312.66   2
2  3  1   287.78   1
2  3  2   356.63   1
2  3  3   347.70   1
2  3  4   339.20   1
2  3  5   334.41   2
2  3  6   259.72   2
3  1  1   316.33   2
3  1  2   354.47   1
3  1  3   340.59   2
3  1  4   395.03   2
3  1  5   372.19   2
3  1  6   375.58   1
3  2  1   317.48   1
3  2  2   287.57   2
3  2  3   348.24   2
3  2  4   336.53   2
3  2  5   283.04   2
3  2  6   310.32   1
3  3  1   341.39   2
3  3  2   350.02   1
3  3  3   380.93   1
3  3  4   395.15   1
3  3  5   365.06   1
3  3  6   364.22   1
;

```

Anything else to our model?

Well. IFF all our pigs were the same sex then no, our model would be complete (apart from the residual error [the variation amongst pigs within pen].

As described, there IS more; we have piglets of different sexes, males and females. We cannot and should not ignore this. How do we consider the effect of sex?

WELL. We should recognise that this is an example of a 'split-plot'. We have sub-divided our group of piglets (pen, aka 'plot') into 2 sub-groups (males and females). So, we shall add to our model an effect of sex and a diet-by-sex interaction.

Our model is now:
Yijk = mu + dieti + penij + sexk + diet*sexik + eijk

and our SAS code is:

```
proc mixed data=subsamp2 lognote;
class diet pen sex;
model gain = diet sex diet*sex/ddfm=kr;
random pen(diet);
run;

```

### SAS output

The Mixed Procedure

Model Information
Data Set WORK.SUBSAMP2
Dependent Variable gain
Covariance Structure Variance Components
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Kenward-Roger
Degrees of Freedom Method Kenward-Roger

Class Level Information
Class Levels Values
diet 3 1 2 3
pen 3 1 2 3
pig 6 1 2 3 4 5 6
sex 2 1 2

Dimensions
Covariance Parameters 2
Columns in X 12
Columns in Z 9
Subjects 1
Max Obs per Subject 54

Number of Observations
Number of Observations Used 54
Number of Observations Not Used 0

Iteration History
Iteration Evaluations -2 Res Log Like Criterion
0 1 488.99068327
1 2 477.15648448 0.00000003
2 1 477.15647939 0.00000000

 Convergence criteria met.

Covariance Parameter Estimates
Cov Parm Estimate
pen(diet) 614.06
Residual 744.20

Fit Statistics
-2 Res Log Likelihood 477.2
AIC (Smaller is Better) 481.2
AICC (Smaller is Better) 481.4
BIC (Smaller is Better) 481.6

Type 3 Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
diet 2 5.91 4.24 0.0721
sex 1 44 8.47 0.0057
diet*sex 2 43.9 0.94 0.4000

Least Squares Means
Effect diet sex Estimate Standard Error DF t Value Pr > |t|
diet 1   285.41 15.7023 5.92 18.18 <.0001
diet 2   297.22 15.7036 5.92 18.93 <.0001
diet 3   346.34 15.6854 5.89 22.08 <.0001
sex   1 321.20 9.7810 7.88 32.84 <.0001
sex   2 298.12 10.0036 8.59 29.80 <.0001
diet*sex 1 1 304.33 16.7309 7.6 18.19 <.0001
diet*sex 1 2 266.50 17.2912 8.63 15.41 <.0001
diet*sex 2 1 306.56 16.8111 7.69 18.24 <.0001
diet*sex 2 2 287.89 17.4123 8.77 16.53 <.0001
diet*sex 3 1 352.72 17.2763 8.37 20.42 <.0001
diet*sex 3 2 339.97 17.2763 8.37 19.68 <.0001

This provides us with our basic analysis, and provides the appropriate tests of the various fixed effects. However, it does not tell us if the effect of pen can be considered statistically significant or not. for that we need to re-fit the model, BUT without the random effect of pen, so that we can compare the Fit Statistics (we shall use the BIC values to compare with and without pen).

The Mixed Procedure, model without random pen effect

Model Information
Data Set WORK.SUBSAMP2
Dependent Variable gain
Covariance Structure Diagonal
Estimation Method REML
Residual Variance Method Profile
Fixed Effects SE Method Model-Based
Degrees of Freedom Method Residual

Class Level Information
Class Levels Values
diet 3 1 2 3
pen 3 1 2 3
pig 6 1 2 3 4 5 6
sex 2 1 2

Dimensions
Covariance Parameters 1
Columns in X 12
Columns in Z 0
Subjects 1
Max Obs per Subject 54

Number of Observations
Number of Observations Used 54
Number of Observations Not Used 0

Covariance Parameter Estimates
Cov Parm Estimate
Residual 1182.39

Fit Statistics
-2 Res Log Likelihood 489.0
AIC (Smaller is Better) 491.0
AICC (Smaller is Better) 491.1
BIC (Smaller is Better) 492.9

Type 3 Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
diet 2 48 15.97 <.0001
sex 1 48 10.16 0.0025
diet*sex 2 48 0.31 0.7334

#### Deciding whether the effect of pen, nested within diet

The classical approach to comparing (RE)Maximum Likelihood models has been to compare the log likelihood values; Fisher showed that -twice the difference in the Log Likelihood (LnL) has a chi-squared distribution. The BIC statistics can also be used to compare the models. We shall present both comparisons, but will use the BIC approach for making our decision.

Fit Statistics to compare models
LnL   BIC
With   477.2   481.6   <- smaller
Without   489.0   492.9
Difference   11.8   11.3

Using a Chi-squared, the difference in the -2LnL is 11.8. Comparing this to the tabulated Chi-squared value (for 1 d.f.) we can consider that this is statistically significant. Likewise, using the BIC values the model with the random effect of pen has a smaller value than the model without the effect of pen: the difference is 11.3. Using the BIC statistics, a model with a smaller BIC value is preferred, thus we should consider that the model with the random effect of pen is a better fitting model than the one without pen. The difference should be greater than 5 for us to have a reasonable measure of confidence that the model IS better. Therefore we conclude that there is an effect of pen.

## GLM analysis, WRONG

The GLM Procedure, Wrong analysis, using GLM, ignoring pen

Class Level Information
Class Levels Values
diet 3 1 2 3
pen 3 1 2 3
pig 6 1 2 3 4 5 6
sex 2 1 2

 Number of Observations Read 54 54

The GLM Procedure

Dependent Variable: gain

Source DF Sum of Squares Mean Square F Value Pr > F
Model 5 48044.2681 9608.8536 8.13 <.0001
Error 48 56754.7356 1182.3903
Corrected Total 53 104799.0037

R-Square Coeff Var Root MSE gain Mean
0.458442 11.06704 34.38590 310.7056

Source DF Type I SS Mean Square F Value Pr > F
diet 2 35325.87903 17662.93952 14.94 <.0001
sex 1 11980.36755 11980.36755 10.13 0.0026
diet*sex 2 738.02153 369.01076 0.31 0.7334

Source DF Type III SS Mean Square F Value Pr > F
diet 2 37758.13169 18879.06584 15.97 <.0001
sex 1 12008.48965 12008.48965 10.16 0.0025
diet*sex 2 738.02153 369.01076 0.31 0.7334

The GLM Procedure
Least Squares Means

diet gain LSMEAN Standard Error Pr > |t|
1 285.323375 8.155333 <.0001
2 296.647875 8.155333 <.0001
3 346.341111 8.104835 <.0001

### Conclusions?

We have conclded that pen is significant. The analysis shown in the first model is the most appropriate one. We can see that there is no significant interaction (between Diet and Sex) and that there is no real effect of Diet (Prob = 0.0721.

However, if we had erroneously ignored the pen effect and had thought that the pig was the experimental unit and looked at the GLM analysis we would think that there was a highly significant effect of Diet, i.e. that there were real differences between the diets, whereas in fact there are no differences. So, if we got our model wrong then we would come to completely the wrong conclusions, possibly to our detriment if our research results and manuscript are rejected!

In general, if we have a model with a nested (aka subsampling) structure and we ignore the structure and test the varoous effects against the residual then we shall overestimate the significance of the effects.