Although (multiple) linear [linear relationshipp] regression models are extremely useful
they are not the only biological relationship between 2 variables.
A linear regression (linear in the relationship between the variables, not linear in
the parameters) implies that as the value of X (the independent
varible) increases so Y increases by an amount equal to the
regression coefficient (b_{i}). However, many biological
relationships are not completely linear and often a curvilinear,
quadratic relationship can exist, with an intermediate optimum (which
may be a maximum or a minimum depending upon the relationship).
For example, if we look at the corn yield per hectare and its relationship
with the amount of fertiliser used then we will likely find that, initially,
as we use more fertiliser that the corn yield will increase. However, we
know that this increase in yield with increasing fertiliser cannot
continue * ad infinitum*. The corn yield will probably reach a plateau,
where increasing fertiliser use does not cause any increase in yield, and
may even cause a decline. This type of relationship is a curvilinear
relationship, perhaps adequately described by a quadratic relationship
(perhaps not!). If a quadratic relationship is a reasonable representation
then there will be an intermediate optimimum (maximum). Another example,
this time closer to home (*sic*). If we look at the mortality rate
of newborn babies and the relationship with birthweight we see a
curvilinear relationship; babies with a very low birthweight have a high
proability (risk) of death. Babies with an intermediate (average)
birthweight have a low probability of death, and babies with a high
birthweight again, have a higher risk (probability) of death. Thus,
a quadratic relationship between risk of death and birthweight seems
to exist, with an intermediate (minimum) optimum birthweight at which
the risk of death is minimized.

How do we handle this quadratic relationship in our model and analysis?
Well, it's not **too** difficult! We can include a term for the
square (quadratic) of the independent variable as an additional regression
covariate:

This will give us linear and quadratic regressions of Y on X.

We could take our data and square each observation of X_{1}
and write down the square and enter that as a new column (variable) and
proceed just as for a multiple regression problem. However, we might make
[careless] arithmetic mistakes, and it will take more time; let's let
the computer do the work, that is what they are there for!

Consider the following experiment: a group of 50 cows were fed diets with various levels of feed intake (50 to 140 lbs of haylage) with various energy densities (0.8 to 1.6). The milk yield for the complete lactation was measured (in kg.) The data are:

Cow | Feed Intake | Energy Density | Milk Yield |
---|---|---|---|

1 | 50 | 0.8 | 5731.05 |

2 | 50 | 1.0 | 4607.40 |

3 | 50 | 1.2 | 5169.25 |

4 | 50 | 1.4 | 6345.16 |

5 | 50 | 1.6 | 6477.83 |

6 | 60 | 0.8 | 4970.22 |

7 | 60 | 1.0 | 5263.30 |

8 | 60 | 1.2 | 5414.44 |

9 | 60 | 1.4 | 7102.82 |

10 | 60 | 1.6 | 6670.46 |

11 | 70 | 0.8 | 6371.27 |

12 | 70 | 1.0 | 5594.80 |

13 | 70 | 1.2 | 6033.55 |

14 | 70 | 1.4 | 7248.72 |

15 | 70 | 1.6 | 7288.52 |

16 | 80 | 0.8 | 5499.63 |

17 | 80 | 1.0 | 6644.66 |

18 | 80 | 1.2 | 6880.00 |

19 | 80 | 1.4 | 7542.48 |

20 | 80 | 1.6 | 7916.68 |

21 | 90 | 0.8 | 6758.12 |

22 | 90 | 1.0 | 7547.07 |

23 | 90 | 1.2 | 7855.26 |

24 | 90 | 1.4 | 7879.89 |

25 | 90 | 1.6 | 7938.86 |

26 | 100 | 0.8 | 6371.87 |

27 | 100 | 1.0 | 6996.44 |

28 | 100 | 1.2 | 7095.97 |

29 | 100 | 1.4 | 8360.18 |

30 | 100 | 1.6 | 8206.27 |

31 | 110 | 0.8 | 6750.66 |

32 | 110 | 1.0 | 7567.50 |

33 | 110 | 1.2 | 8222.51 |

34 | 110 | 1.4 | 8336.00 |

35 | 110 | 1.6 | 8967.15 |

36 | 120 | 0.8 | 6575.70 |

37 | 120 | 1.0 | 8261.29 |

38 | 120 | 1.2 | 7488.05 |

39 | 120 | 1.4 | 9299.34 |

40 | 120 | 1.6 | 8629.58 |

41 | 130 | 0.8 | 7165.49 |

42 | 130 | 1.0 | 7047.87 |

43 | 130 | 1.2 | 7764.65 |

44 | 130 | 1.4 | 8740.82 |

45 | 130 | 1.6 | 9101.40 |

46 | 140 | 0.8 | 7608.81 |

47 | 140 | 1.0 | 7843.19 |

48 | 140 | 1.2 | 8400.67 |

49 | 140 | 1.4 | 9421.99 |

50 | 140 | 1.6 | 9010.69 |

We could use the following SAS code to read the data in and fit a
multiple regression model with **ed**, **fi** and **fi ^{2}**

data quad1; input cow fi ed yield; cards; 1 50 0.8 5731.05 2 50 1.0 4607.40 3 50 1.2 5169.25 4 50 1.4 6345.16 . . . 48 140 1.2 8400.67 49 140 1.4 9421.99 50 140 1.6 9010.69 ; proc glm data=quad1; model my = ed fi fi*fi; run;

Note how we have included the term **fi*fi** which is fi^{2}!

data, SAS data step code and PROC GLM statements

We obtain the following SAS output:

The SAS System |

The GLM Procedure |

Number of observations |
50 |

The SAS System |

The GLM Procedure |

Dependent Variable: Yield |

Source |
DF |
Sum of Squares |
Mean Square |
F Value |
Pr > F |

Model |
3 | 61129561.86 | 20376520.62 | 103.21 | <.0001 |

Error |
46 | 9081842.78 | 197431.36 | ||

Corrected Total |
49 | 70211404.64 |

R-Square |
Coeff Var |
Root MSE |
Yield Mean |

0.870650 | 6.137434 | 444.3325 | 7239.711 |

Source |
DF |
Type I SS |
Mean Square |
F Value |
Pr > F |

ed |
1 | 20896893.40 | 20896893.40 | 105.84 | <.0001 |

fi |
1 | 38518744.03 | 38518744.03 | 195.10 | <.0001 |

fi*fi |
1 | 1713924.43 | 1713924.43 | 8.68 | 0.0050 |

Source |
DF |
Type III SS |
Mean Square |
F Value |
Pr > F |

ed |
1 | 20896893.40 | 20896893.40 | 105.84 | <.0001 |

fi |
1 | 4481068.57 | 4481068.57 | 22.70 | <.0001 |

fi*fi |
1 | 1713924.43 | 1713924.43 | 8.68 | 0.0050 |

Parameter |
Estimate |
Standard Error |
t Value |
Pr > |t| |

Intercept |
-495.414245 | 788.0807136 | -0.63 | 0.5327 |

ed |
2285.656000 | 222.1662467 | 10.29 | <.0001 |

fi |
78.969322 | 16.5758454 | 4.76 | <.0001 |

fi*fi |
-0.254797 | 0.0864781 | -2.95 | 0.0050 |

What can we see from this analysis? Well we see that the Model over and above the Mean, R(ed, fi, fi*fi | µ ), accounts for a statistically significant amount of the variation, F-ratio = 103.2. We can also see that the Marginal effect of Energy Density, R(ed | µ fi, fi*fi), is statistically significant (F-ratio = 105.84), as is the Marginal effect of fi*fi (the quadratic effect of Feed Intake), F-ratio = 8.68. We shall not test the statistical significance of the linear regression component for Feed Intake, since if the quadratic effect is significant then we are going to include the linear regression effect in the model!!! Hence testing its statistical significance is a nonsense.

What is the optimum feed intake? Well,let us look at the prediction equation that we have obtained.

We can differentiate this with respect to feed intake, equate to Zero and solve. Obvious is it not? It almost takes us back to high school, solving for maximums and minimums. Bet you never thought that you'd ever have any use for the calculus that you learnt! What do we get?

Note that the estimated optimum ( ~ 155kg) actually lies outside the range of our data, hence we have a curve which is reaching a maximum, but our data does not in fact encompass the maximum. Since extrapolating outside the data range is somewhat speculative we should be quite cautious about these results. We would probably want to repeat the experiment, feeding increased amounts of feed to check out the prediction. It would be most desirable to have feed intakes (X values) spanning the area of the optimum, so that we are not extrapolating.

R.I. Cue ©

Department of Animal Science, McGill University

last updated : 2010 April 28