What does linear regression do?


Assumptions


Linear regression using lm()

Some data:

set.seed(777)
x = seq(1:12)
y = 2*x + rnorm(length(x), mean = 0, sd = 1.5)     # linear, with some noise
plot(x, y, main = "Measurements", font.main = 1)

The plot looks fairly linear, a linear regression can be conducted without any data manipulation:

Let us create a data frame with the observations first:

dat = data.frame(x = x, y = y)
class(dat)
## [1] "data.frame"
head(dat)
##   x         y
## 1 1  2.734679
## 2 2  3.402188
## 3 3  6.766254
## 4 4  7.401782
## 5 5 12.458029
## 6 6 12.931911

We run the function, telling it that \(y\) is to be treated as a function of \(x\):

lm.out <- lm(y ~ x, data = dat)
summary(lm.out)
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22999 -0.60183  0.00169  0.27844  2.00234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74984    0.61946    1.21    0.254    
## x            1.94117    0.08417   23.06 5.31e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 10 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9797 
## F-statistic: 531.9 on 1 and 10 DF,  p-value: 5.307e-10


Reading the output


Draw the regression line using the lm() results

# str(lm.out)
alpha = lm.out$coefficients[1]
beta = lm.out$coefficients[2]
plot(x, y)
abline(a = alpha, b = beta, col = "red")   # abline draws a line with slope b and intercept a


Checking if the assumptions were fullfilled

Yes! This is done afterwards by looking at the residuals (the deviations of the original points from the fitted line).
Check homoscedasticity and normality:

# plot(lm.out)  
residuals = resid(lm.out)  # get the residuals 
par(mfrow = c(1, 2))
plot(residuals, col = "blue", main = "Residuals")
abline(h = 0, col = "red", lty = 2)
qqnorm(residuals, col = "blue")
qqline(residuals, col = "red", lty = 2)

par(mfrow = c(1, 1))

Independence of the residuals can be checked using an autocorrelation plot

auto.cor = acf(resid(lm.out))
plot(auto.cor)

We can also directly have a look at some key data of the residuals:

residuals = resid(lm.out)
sd(residuals)     # standard deviation 
## [1] 0.9596625
summary(residuals)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.229993 -0.601828  0.001691  0.000000  0.278440  2.002337


Why is plotting necessary?

Technically, lm() will work in most of the cases. But all datasets above yield the same regression line!


Fitting non-linear functions

No problem!

Fitting a square function

Create some data points:

set.seed(777) 
x = seq(1:12)
y = x^2 + rnorm(length(x), mean = 0, sd = 10)  # square function with some noise added
plot(x, y, col = "red", pch = 18)

It would be possible to define a new variable \(\acute{x} = x^2\) and to feed the lm() with this variable.
However, it is probably easier to use Wilkinson-Rogers (W-R) notation:

dat2 = data.frame(y = y, x = x)
lm.out = lm(y ~ I(x^2), data = dat2)
summary(lm.out)
## 
## Call:
## lm(formula = y ~ I(x^2), data = dat2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.432 -3.687 -0.148  2.010 12.776 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.60544    2.92698   1.573    0.147    
## I(x^2)       0.96020    0.04115  23.334 4.73e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.571 on 10 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.9802 
## F-statistic: 544.5 on 1 and 10 DF,  p-value: 4.732e-10

Plot original data and the fitted curve:

txt = "Original data + fit"
plot(x, y, col = "red", pch = 18, main = txt)             # original data
c1 = lm.out$coefficients["(Intercept)"]                   # intercept
c2 = lm.out$coefficients["I(x^2)"]                        # coefficient of x^2
curve(c1 + c2*x^2, from = min(x), to = max(x), n = 101, col = "blue", lwd = 1.5, lty = 2, add = TRUE)

We used the curve-function to plot the fit.


Fitting a polynomial function

Create some data points:

set.seed(777) 
x = runif(100, min = -1, max = 5)               # create 100 uniformly distributed numbers between -1 and 5
y = x^3 - 5*x^2 + 2*x - 1 + rnorm(100, 0, 0.5)  # a third order polynomial with some noise added
plot(x, y, col = "red", pch = 18)

Do the fit using the lm-function with W-R-notation:

dat3 = data.frame(y = y, x = x)
fit = lm(y ~ x + I(x^2) + I(x^3))
summary(fit)
## 
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14497 -0.31654 -0.02124  0.30661  1.24284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.03723    0.07852  -13.21   <2e-16 ***
## x            2.05217    0.10294   19.94   <2e-16 ***
## I(x^2)      -4.97719    0.06695  -74.34   <2e-16 ***
## I(x^3)       0.99243    0.01091   91.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4719 on 96 degrees of freedom
## Multiple R-squared:  0.9915, Adjusted R-squared:  0.9912 
## F-statistic:  3713 on 3 and 96 DF,  p-value: < 2.2e-16

Plot original data and the fitted curve:

c1 = fit$coefficients["(Intercept)"]       # intercept
c2 = fit$coefficients["x"]                 # coefficient of x
c3 = fit$coefficients["I(x^2)"]            # coefficient of x^2
c4 = fit$coefficients["I(x^3)"]            # coefficient of x^3

txt = "Original data + fit"
plot(x, y, col = "red", pch = 18, main = txt)
curve(c1 + c2*x + c3 *x^2 + c4*x^3, from = min(x), to = max(x), n = 101, col = "blue", lwd = 1.7, lty = 2, add = TRUE)

Prediction

Above, as a result of a regression, we obtained an object fit. This object represents a statistical model for the connection of the x- and y-data. We can use this model to predict points at arbitrary positions on the x-axis if they are not too far outside the range of the original data.

In order to do so, we can use the function predict. The points intended for prediction must be available as a dataframe:

to_predict <- data.frame(x = seq(-2, 3, 0.01))
head(to_predict)
##       x
## 1 -2.00
## 2 -1.99
## 3 -1.98
## 4 -1.97
## 5 -1.96
## 6 -1.95
pdout <- predict(fit, newdata = to_predict, interval = "confidence")
head(pdout)
##         fit       lwr       upr
## 1 -32.98977 -34.02908 -31.95045
## 2 -32.65216 -33.68226 -31.62205
## 3 -32.31673 -33.33768 -31.29578
## 4 -31.98348 -32.99532 -30.97164
## 5 -31.65239 -32.65518 -30.64960
## 6 -31.32347 -32.31725 -30.32968

Here, we conducted a prediction using the model fit at datapoints defined by to_predict, simultaneously calculating a 95% confidence interval (the confidence level can be changed, read the full description of predict). The output of the predict-command has 3 columns if interval = "confidence" has been chosen. Here, the column fit contains the fitted points, while lwr and upr denote the lower and upper confidence limit, respectively.

Let us plot the original data together with the predicted points and the confidence limits:

txt = "Original data + prediction + CI"
plot(x, y, col = "red", pch = 18, main = txt, xlim = c(-2, 6), ylim = c(min(pdout[,"lwr"]), max(pdout[,"upr"])))   
lines(to_predict[,1], pdout[,"fit"], col = "blue", lwd = 1.8, lty = 2)
lines(to_predict[,1], pdout[,"lwr"], col = "green", lwd = 1.8, lty = 3)
lines(to_predict[,1], pdout[,"upr"], col = "green", lwd = 1.8, lty = 3)

For this example, the confidence limits are very narrow, which already can be seen in the output of the head-command above. This is caused by the large number of available data points and the weak noise.

Let us try out an example with fewer data points and stronger noise:

Now create only 15 data points and increase the standard deviation of the noise term:

set.seed(888)
x = runif(15, min = -1, max = 5)                      # create 15 uniformly distributed numbers
y = x^3 - 5*x^2 + 2*x - 1 + rnorm(length(x), 0, 2.5)  # a third order polynomial with noise
dat3 = data.frame(y = y, x = x)                       # observed data

We build the model using lm and make the prediction at many intermediate points, and at some points outside the original range:

fit = lm(y ~ x + I(x^2) + I(x^3))                                     # build the model
to_predict <- data.frame(x = seq(-2, 6, 0.01))                        # points to predict 
pdout <- predict(fit, newdata = to_predict, interval = "confidence")  # prediction 

We plot original data, prediction, and confidence limits:

txt = "Original data + prediction + CI"
plot(x, y, col = "red", pch = 18, main = txt, xlim = c(-2, 6), ylim = c(min(pdout[,"lwr"]), max(pdout[,"upr"])))   
lines(to_predict[,1], pdout[,"fit"], col = "blue", lwd = 1.8, lty = 2)
lines(to_predict[,1], pdout[,"lwr"], col = "green", lwd = 1.8, lty = 3)
lines(to_predict[,1], pdout[,"upr"], col = "green", lwd = 1.8, lty = 3)

We see that the confidence intervals are wider now. They are largest outside the range of the original data and smallest where many observations are located.


uwe.menzel@matstat.org