Linear Regression - Part 2 ~ Python and Machine Learning Blog

In the previous post we discussed how to calculate the coefficient for simple and multiple regression.
In this post we will study how to check the accuracy of the coefficient values and how to evaluate the model fit.

Assumptions

Before jumping into the coefficient accuracy, let's list down the assumption that we make to calculate the coefficient :
- The sample taken to calculate the coefficient is unbiased.
- This means that model does not overestimate or underestimate the coefficient. So for a particular sample, model may overestimate or underestimate the coefficient but if multiple samples are taken then average value of coefficients calculated over multiple sample will be spot on.

Standard Error of coefficients

Standard error defines the sample to sample variability of β0 and β1. It can also be defined as the average value by which the coefficient will differ from TRUE value. Following are the standard error for β0 and β1.

From the above equation we can note that :

- The more the 'X' value is spread, the lower will be the standard error because of denominator in SE(β0).

- The more the 'Y' value is spread, the higher will be the standard error.

The following diagram depicts the behavior that the more X value is spread the more the line will be closer to each other and hence lower will be standard error.

More variation (spread) is 'X" results in better estimate of slope β1.

Based on the standard error, that is calculate above, we can also define the range of coefficient with 95% confidence interval :

Confidence interval of β1

[ β1 + 2SE(β1) , β1 - 2SE(β1) ]

Confidence interval of β0

[ β0 + 2SE(β0) , β0 - 2SE(β0) ]

In simple words it defines that if we change the sample and then calculate the coefficient again then 95% of time the coefficient of β1 and β0 will fall in the range of confidence interval as explained above.

Check for NULL Hypothesis

To find the relation between coefficient and response variable, we need to check that coefficients are sufficiently far from zero value.

How far from zero ?

t = (β1 - 0) / SE(β1)

The larger the 't' value, the more will be the confidence that coefficient is far from 0. If 't' value is greater than 2 then we can say that '0' value is outside the range of 95% confidence interval for coefficient. In other words if we change the sample and again calculate β1 then 95% of times β1 value will not be 0 or outside the range of 0. Note that if β1 is equal to zero then it means that there is no association between corresponding input and response variable.

Importance of a input variable:
t-stat and corresponding p-stat provides the details about importance of a particular variable. The higher the t-stat, the lower will be the p-stat and hence higher will be the importance of the variable. When doing Regression, we need to check which variable is not important(lower t value and so higher p-value) and based on that we can remove corresponding variable from the model.

Model Accuracy

Quality of linear regression model fit is accessed using following :

RSE : It is average amount by which the predicted value will deviate from true regression line. RSE provides the lack of fit of the model. One drawback of RSE is that it provides the o/p in the unit of the response variable and hence it is difficult to device any standard to find whether RSE is more or less.

R2 - ( R-squared ) - R2 is similar to RSE except that it takes care of the drawback of RSE. It is of the form of proportion so that its value will always be in the range 0 to 1 with '0' means that model is not good fit and '1' means that model is good fit.

Why we need to check t-stat and R2 ?

Note than R2 can't be only measure to check the statistical significance of the model as R2 increases with the increase of number of input variables or predictors. That's why we need to check the t-stat of every input variable to check its statistical significance of corresponding input variable.

Summary

In this article we saw various measures to check the standard error of coefficient and how to evaluate the regression model. We also saw how to check the importance of a particular input variable and how to include or exclude them from the model. In the next article we will write R code for regression and explain the model based on the various parameters provided by the mode.

Python and Machine Learning Blog

Blogger templates

Wednesday, 6 May 2020