lm()
Some data:
set.seed(777)
x = seq(1:12)
y = 2*x + rnorm(length(x), mean = 0, sd = 1.5) # linear, with some noise
plot(x, y, main = "Measurements", font.main = 1)
The plot looks fairly linear, a linear regression can be conducted without any data manipulation:
Let us create a data frame with the observations first:
dat = data.frame(x = x, y = y)
class(dat)
## [1] "data.frame"
head(dat)
## x y
## 1 1 2.734679
## 2 2 3.402188
## 3 3 6.766254
## 4 4 7.401782
## 5 5 12.458029
## 6 6 12.931911
We run the function, telling it that \(y\) is to be treated as a function of \(x\):
lm.out <- lm(y ~ x, data = dat)
summary(lm.out)
##
## Call:
## lm(formula = y ~ x, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22999 -0.60183 0.00169 0.27844 2.00234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.74984 0.61946 1.21 0.254
## x 1.94117 0.08417 23.06 5.31e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 10 degrees of freedom
## Multiple R-squared: 0.9815, Adjusted R-squared: 0.9797
## F-statistic: 531.9 on 1 and 10 DF, p-value: 5.307e-10
Coefficients/Estimates displays the calculated values for \(\alpha\) and \(\beta\)
Coefficients/Pr(>|t|) are the p-values for the t-test
R squared (coefficient of determination) describes how good the fit is. A value close to 1 is good.
The p-value for the F-test should be small. Roughly spoken, that means that the fit makes sense.
lm()
results# str(lm.out)
alpha = lm.out$coefficients[1]
beta = lm.out$coefficients[2]
plot(x, y)
abline(a = alpha, b = beta, col = "red") # abline draws a line with slope b and intercept a
Yes! This is done afterwards by looking at the residuals (the
deviations of the original points from the fitted line).
Check homoscedasticity
and normality:
# plot(lm.out)
residuals = resid(lm.out) # get the residuals
par(mfrow = c(1, 2))
plot(residuals, col = "blue", main = "Residuals")
abline(h = 0, col = "red", lty = 2)
qqnorm(residuals, col = "blue")
qqline(residuals, col = "red", lty = 2)
par(mfrow = c(1, 1))
Independence of the residuals can be checked using an autocorrelation plot
auto.cor = acf(resid(lm.out))
plot(auto.cor)
We can also directly have a look at some key data of the residuals:
residuals = resid(lm.out)
sd(residuals) # standard deviation
## [1] 0.9596625
summary(residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.229993 -0.601828 0.001691 0.000000 0.278440 2.002337
Technically, lm()
will work in most of the cases. But
all datasets above yield the same regression line!
No problem!
Create some data points:
set.seed(777)
x = seq(1:12)
y = x^2 + rnorm(length(x), mean = 0, sd = 10) # square function with some noise added
plot(x, y, col = "red", pch = 18)
It would be possible to define a new variable \(\acute{x} = x^2\) and to feed the
lm()
with this variable.
However, it is probably easier to use Wilkinson-Rogers (W-R)
notation:
dat2 = data.frame(y = y, x = x)
lm.out = lm(y ~ I(x^2), data = dat2)
summary(lm.out)
##
## Call:
## lm(formula = y ~ I(x^2), data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.432 -3.687 -0.148 2.010 12.776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.60544 2.92698 1.573 0.147
## I(x^2) 0.96020 0.04115 23.334 4.73e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.571 on 10 degrees of freedom
## Multiple R-squared: 0.982, Adjusted R-squared: 0.9802
## F-statistic: 544.5 on 1 and 10 DF, p-value: 4.732e-10
Plot original data and the fitted curve:
txt = "Original data + fit"
plot(x, y, col = "red", pch = 18, main = txt) # original data
c1 = lm.out$coefficients["(Intercept)"] # intercept
c2 = lm.out$coefficients["I(x^2)"] # coefficient of x^2
curve(c1 + c2*x^2, from = min(x), to = max(x), n = 101, col = "blue", lwd = 1.5, lty = 2, add = TRUE)
We used the curve-function to plot the fit.
Create some data points:
set.seed(777)
x = runif(100, min = -1, max = 5) # create 100 uniformly distributed numbers between -1 and 5
y = x^3 - 5*x^2 + 2*x - 1 + rnorm(100, 0, 0.5) # a third order polynomial with some noise added
plot(x, y, col = "red", pch = 18)
Do the fit using the lm-function with W-R-notation:
dat3 = data.frame(y = y, x = x)
fit = lm(y ~ x + I(x^2) + I(x^3))
summary(fit)
##
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14497 -0.31654 -0.02124 0.30661 1.24284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.03723 0.07852 -13.21 <2e-16 ***
## x 2.05217 0.10294 19.94 <2e-16 ***
## I(x^2) -4.97719 0.06695 -74.34 <2e-16 ***
## I(x^3) 0.99243 0.01091 91.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4719 on 96 degrees of freedom
## Multiple R-squared: 0.9915, Adjusted R-squared: 0.9912
## F-statistic: 3713 on 3 and 96 DF, p-value: < 2.2e-16
Plot original data and the fitted curve:
c1 = fit$coefficients["(Intercept)"] # intercept
c2 = fit$coefficients["x"] # coefficient of x
c3 = fit$coefficients["I(x^2)"] # coefficient of x^2
c4 = fit$coefficients["I(x^3)"] # coefficient of x^3
txt = "Original data + fit"
plot(x, y, col = "red", pch = 18, main = txt)
curve(c1 + c2*x + c3 *x^2 + c4*x^3, from = min(x), to = max(x), n = 101, col = "blue", lwd = 1.7, lty = 2, add = TRUE)
Above, as a result of a regression, we obtained an object
fit
. This object represents a statistical model for
the connection of the x- and y-data. We can use this
model to predict points at arbitrary positions on the x-axis if they are
not too far outside the range of the original data.
In order to do so, we can use the function predict. The points intended for prediction must be available as a dataframe:
to_predict <- data.frame(x = seq(-2, 3, 0.01))
head(to_predict)
## x
## 1 -2.00
## 2 -1.99
## 3 -1.98
## 4 -1.97
## 5 -1.96
## 6 -1.95
pdout <- predict(fit, newdata = to_predict, interval = "confidence")
head(pdout)
## fit lwr upr
## 1 -32.98977 -34.02908 -31.95045
## 2 -32.65216 -33.68226 -31.62205
## 3 -32.31673 -33.33768 -31.29578
## 4 -31.98348 -32.99532 -30.97164
## 5 -31.65239 -32.65518 -30.64960
## 6 -31.32347 -32.31725 -30.32968
Here, we conducted a prediction using the model fit
at
datapoints defined by to_predict
, simultaneously
calculating a 95% confidence
interval (the confidence
level can be changed, read the full
description of predict). The output of the predict-command has 3
columns if interval = "confidence"
has been chosen. Here,
the column fit
contains the fitted points, while
lwr
and upr
denote the lower and upper
confidence limit, respectively.
Let us plot the original data together with the predicted points and the confidence limits:
txt = "Original data + prediction + CI"
plot(x, y, col = "red", pch = 18, main = txt, xlim = c(-2, 6), ylim = c(min(pdout[,"lwr"]), max(pdout[,"upr"])))
lines(to_predict[,1], pdout[,"fit"], col = "blue", lwd = 1.8, lty = 2)
lines(to_predict[,1], pdout[,"lwr"], col = "green", lwd = 1.8, lty = 3)
lines(to_predict[,1], pdout[,"upr"], col = "green", lwd = 1.8, lty = 3)
For this example, the confidence limits are very narrow, which
already can be seen in the output of the head
-command
above. This is caused by the large number of available data points and
the weak noise.
Let us try out an example with fewer data points and stronger noise:
Now create only 15 data points and increase the standard deviation of the noise term:
set.seed(888)
x = runif(15, min = -1, max = 5) # create 15 uniformly distributed numbers
y = x^3 - 5*x^2 + 2*x - 1 + rnorm(length(x), 0, 2.5) # a third order polynomial with noise
dat3 = data.frame(y = y, x = x) # observed data
We build the model using lm
and make the prediction at
many intermediate points, and at some points outside the original
range:
fit = lm(y ~ x + I(x^2) + I(x^3)) # build the model
to_predict <- data.frame(x = seq(-2, 6, 0.01)) # points to predict
pdout <- predict(fit, newdata = to_predict, interval = "confidence") # prediction
We plot original data, prediction, and confidence limits:
txt = "Original data + prediction + CI"
plot(x, y, col = "red", pch = 18, main = txt, xlim = c(-2, 6), ylim = c(min(pdout[,"lwr"]), max(pdout[,"upr"])))
lines(to_predict[,1], pdout[,"fit"], col = "blue", lwd = 1.8, lty = 2)
lines(to_predict[,1], pdout[,"lwr"], col = "green", lwd = 1.8, lty = 3)
lines(to_predict[,1], pdout[,"upr"], col = "green", lwd = 1.8, lty = 3)
We see that the confidence intervals are wider now. They are largest outside the range of the original data and smallest where many observations are located.