In our previous post we had looked at single variable (univariate) linear regression using R. Single variable regression is good but not useful in most practical situations as there would be multiple variables which help predict the value of a dependent variable (DV). In this post we will look at multiple linear regression and also use the model created on a test dataset. We go through the same steps as we did for single variable linear regression.
There are a few things that we need to do before we start building our model, the steps are mentioned here and the details can be found at this site
- Loading the Data
- Structure of the data (str and summary functions)
After getting to know about our dataset we will start creating the model. The syntax is:
Model 1 = lm(DV ~ IV1 + IV2 + … IVk , data = dataset name)
- lm is the R function to create linear regression models
- DV is the dependent variable
- The ~ (tilde) sign in between the DV and the IV tells R to create a linear model between the DV and IV
- IV1 thru IVk are the k Independent variables(IV)
- data = dataset name helps R identfiy the dataset that it will be working on to create the model
The syntax above would result in creating a Linear regression of the form:
The syntax mentioned above would create the model but nothing would be displayed in the console window. To see the results of the linear model, we would need to look at the summary of the model using:
There are various things that we can see in the summary of the model as shown in the image below. For details about the individual elements please visit this site
For multiple linear regression, one important factor is multi-colinearity. Multi-colinearity is a situation when we have independent variables which are highly correlated. So, what is Correlation?
Correlation measures the linear relation between 2 variables and is a number between -1 and +1. We use the following formula to calculate the correlation between 2 variables
The formula cor(dataset name) helps calculate the pair-wise linear correlation between all the variables in a dataset. A snapshot of it is given below:
The number in the box gives us the correlation between Age and France Population which is 0.994485097.
Now that we know about correlation, our focus would be on looking at the coefficient and the R–squared values of the model. A snapshot of which is added here:
In the above image we see that Age and FrancePop do not have a star (*) or a period (.) at the end of their respective rows. This means that Age and FrancePop do not significantly affect the model that we have created. Let us remove them one by one and not all of them at once. Remember to remove the insignificant variables one by one and not all of them together since there could be a high correlation between 2 models. We can see that the Multiple R-squared and Adjusted R-squared values are 0.8294 and 0.7845. These numbers are pretty good. But since we have 2 insignificant IVs in the model, let’s remove FrancePop from our model and see how the values for R-squared and Adjusted R-squared changes.
In the above image we have removed FrancePop from our model and created a new model with the remaining variables. We see that our Adjusted R-squared has actually increased to 0.7943 from 0.7845. We can also see that Age has now become a very significant IV with 2 stars (**) at the end. The reason for Age to have become significant is due to the removal of FrancePop (remember that we had seen that Age and FrancePop were highly correlated). Since most of the IVs are significant in this model we will stick to this model. So our final equation for predicting the price of wine given the various IVs would be:
Price = -3.4299802 + 0.6072093 * AGST + (-0.0039715) * HarvestRain + 0.0239308 * Age + 0.0010755 * WinterRain
In our next post we would look at how we use this same model to predict values from an unknown data set. We will also touch upon how we could divide our data sets into training and test datasets.
For people who want to have a deeper understanding of Linear regression and possible issues that could be encountered please refer to the R-bloggers.com website