Carseats Data Download For R
Multiple Linear Regression with the Carseats Data Set
Linear regression with more than one predictor
Karen Mazidi
Using the Carsets data set from the ISLR package
The Carseats data set tracks sales information for car seats. It has 400 observations (each at a different store) and 11 variables:
Sales: unit sales in thousands
CompPrice: price charged by competitor at each location
Income: community income level in 1000s of dollars
Advertising: local ad budget at each location in 1000s of dollars
Population: regional pop in thousands
Price: price for car seats at each site
ShelveLoc: Bad, Good or Medium indicates quality of shelving location
Age: age level of the population
Education: ed level at location
Urban: Yes/No
US: Yes/No
library(ISLR) attach(Carseats) names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population" ## [6] "Price" "ShelveLoc" "Age" "Education" "Urban" ## [11] "US"
Multiple linear model
Let's build a model using all the predictors
# we could do it like this: #lm1 = lm(Sales~CompPrice+Income+... but there are a lot of predictors # instead, use the . lm1 = lm(Sales~., data=Carseats) summary(lm1)
## ## Call: ## lm(formula = Sales ~ ., data = Carseats) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.8692 -0.6908 0.0211 0.6636 3.4115 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.6606231 0.6034487 9.380 < 2e-16 *** ## CompPrice 0.0928153 0.0041477 22.378 < 2e-16 *** ## Income 0.0158028 0.0018451 8.565 2.58e-16 *** ## Advertising 0.1230951 0.0111237 11.066 < 2e-16 *** ## Population 0.0002079 0.0003705 0.561 0.575 ## Price -0.0953579 0.0026711 -35.700 < 2e-16 *** ## ShelveLocGood 4.8501827 0.1531100 31.678 < 2e-16 *** ## ShelveLocMedium 1.9567148 0.1261056 15.516 < 2e-16 *** ## Age -0.0460452 0.0031817 -14.472 < 2e-16 *** ## Education -0.0211018 0.0197205 -1.070 0.285 ## UrbanYes 0.1228864 0.1129761 1.088 0.277 ## USYes -0.1840928 0.1498423 -1.229 0.220 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.019 on 388 degrees of freedom ## Multiple R-squared: 0.8734, Adjusted R-squared: 0.8698 ## F-statistic: 243.4 on 11 and 388 DF, p-value: < 2.2e-16
Dummy variables
Why do we have variables ShelveLocGood, ShelveLocGood, USYes and UrbanYes when those variables don't exist in the data set?
R generates dummy variables for us from qualitative variables. The contrasts() function returns the coding that R uses.
## Good Medium ## Bad 0 0 ## Good 1 0 ## Medium 0 1
Correlations with quantitative data
cor(subset(Carseats, select=-c(ShelveLoc,Urban,US))) # omit qualitative data
## Sales CompPrice Income Advertising Population ## Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 ## CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 ## Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 ## Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 ## Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 ## Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 ## Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 ## Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 ## Price Age Education ## Sales -0.44495073 -0.231815440 -0.051955242 ## CompPrice 0.58484777 -0.100238817 0.025197050 ## Income -0.05669820 -0.004670094 -0.056855422 ## Advertising 0.04453687 -0.004557497 -0.033594307 ## Population -0.01214362 -0.042663355 -0.106378231 ## Price 1.00000000 -0.102176839 0.011746599 ## Age -0.10217684 1.000000000 0.006488032 ## Education 0.01174660 0.006488032 1.000000000
Model with selected predictors: Price, Urban, US
The linear regression suggests a relationship between price and sales given the low p-value of the t-statistic. The coefficient states a negative relationship between Price and Sales: as Price increases, Sales decreases.
The linear regression suggests that there isn't a relationship between the location of the store and the number of sales based on the high p-value of the t-statistic.
The linear regression suggests there is a relationship between whether the store is in the US or not and the amount of sales. The coefficient states a positive relationship between USYes and Sales: if the store is in the US, the sales will increase by approximately 1201 units.
Model in equation form
Sales = 13.04 + -0.05 Price + -0.02 UrbanYes + 1.20 USYes
lm2 = lm(Sales~Price+Urban+US) summary(lm2)
## ## Call: ## lm(formula = Sales ~ Price + Urban + US) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.9206 -1.6220 -0.0564 1.5786 7.0581 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 13.043469 0.651012 20.036 < 2e-16 *** ## Price -0.054459 0.005242 -10.389 < 2e-16 *** ## UrbanYes -0.021916 0.271650 -0.081 0.936 ## USYes 1.200573 0.259042 4.635 4.86e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.472 on 396 degrees of freedom ## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335 ## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Fitting a smaller model
Since the Urban variable had a high p-value, lets build a model without it.
lm3 = lm(Sales ~ Price + US) summary(lm3)
## ## Call: ## lm(formula = Sales ~ Price + US) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.9269 -1.6286 -0.0574 1.5766 7.0515 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 13.03079 0.63098 20.652 < 2e-16 *** ## Price -0.05448 0.00523 -10.416 < 2e-16 *** ## USYes 1.19964 0.25846 4.641 4.71e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.469 on 397 degrees of freedom ## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354 ## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Comparing the models
Based on the RSE and R^2 of the linear regressions, they both fit the data similarly, with linear regression from (e) fitting the data slightly better.
Confidence intervals
## 2.5 % 97.5 % ## (Intercept) 11.79032020 14.27126531 ## Price -0.06475984 -0.04419543 ## USYes 0.69151957 1.70776632
Looking for outliers
All studentized residuals appear to be bounded by -3 to 3, so no potential outliers are suggested from the linear regression.
plot(predict(lm3), rstudent(lm3))
Let's plot the residuals.
par(mfrow=c(2,2)) plot(lm3)
Here is an explanation of these plots:
Posted by: