The Kaggle data set on house prices challenge has detailed information on various aspects of houses from number of bedrooms to quality of basement in the city of Ames, Iowa. The challenge is to predict house prices using this data. I used penalized linear regression models, ridge and lasso, and they have a fairly good price prediction accuracy. The minimum mean cross validation training error is around 0.017 with a standard deviation of 0.002 for both models.
train <- read.csv("~/Google Drive/kaggle/houseprices/train.csv", stringsAsFactors = F)
#Let's look at missing values in numeric variables to begin with.
Writing a quick function to visualize density of missing values in all numeric variables. Each black line corresponds to a missing value. Therefore a row with high density of black lines shows large number of missing values are present in this variable. From the figure below, 3 numeric variables have missing values. Lot frontage has large number of missing values.
# missing values in LotFrontage (259), MasVnrArea (8), GarageYrBlt (81). Let's visualize.
numeric_train <- train[sapply(train, is.numeric)]
plot_start = 2 # we don't want to visualize variable id
plot_end = 38 # length(numeric_train) = 38
plots = plot_end - plot_start + 1
plot(y = as.numeric(is.na(numeric_train[, plot_start]))/plots , x = 1:nrow(numeric_train), typ = "l", ylim = c(0,1))
col = plot_start + 1
n = 2
for (i in c(plot_start:(plot_end-1))) {
lines(y = as.numeric(is.na(numeric_train[, col]))/plots + (n-1)/plots, x = 1:nrow(numeric_train), typ = "l")
col = col + 1
n = n + 1
}
Let’s see the number of missing values in each colum.
test <- read.csv("~/Google Drive/kaggle/houseprices/test.csv", stringsAsFactors = F)
missing.test <- sapply(test, function(x) {sum(is.na(x))}) #number of missing data in each column of test
missing.test[missing.test != 0] # printing only non-zero values
## MSZoning LotFrontage Alley Utilities Exterior1st
## 4 227 1352 2 1
## Exterior2nd MasVnrType MasVnrArea BsmtQual BsmtCond
## 1 16 15 44 45
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## 44 42 1 42 1
## BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual
## 1 1 2 2 1
## Functional FireplaceQu GarageType GarageYrBlt GarageFinish
## 2 730 76 78 78
## GarageCars GarageArea GarageQual GarageCond PoolQC
## 1 1 78 78 1456
## Fence MiscFeature SaleType
## 1169 1408 1
Combine train and test to get rid of missing values all missing values in one go.
# create indicators train. train = 1 for train and = 0 for test.
train$train <- 1
test$train <- 0
# exclude SalePrice from train. Now train and test can be combined into a single df.
y <- train$SalePrice
train$SalePrice <- NULL
# combine train and test data.
train.test <- rbind(train, test, stringsAsFactors = F)
Sometimes missing values are not really missing values - but just the absence of that feature. For example, for houses having no pool, the pool quality is NA. I fill this with a new quality ‘no’. Houses having a pool and a NA for pool quality is a real missing value. Let us separate out the unreal missing values and fill them first.
# NA in PoolQC when PoolArea = 0 is "no" pool, therefore "no" PoolQC.
train.test$PoolQC <- ifelse(train.test$PoolArea == 0, "no", train.test$PoolQC)
# sum(is.na(train.test$PoolQC)) the remaining are missing values. PoolQC has 3 missing values
Similarly, for the rest of the variables having missing values.
train.test[ ,c("Fence", "Alley", "MiscFeature") ] <- lapply(train.test[ ,c("Fence", "Alley", "MiscFeature") ], function(x) {ifelse(is.na(train.test$Fence), "no", train.test$Fence)})
train.test$FireplaceQu <- ifelse(is.na(train.test$FireplaceQu) & train.test$Fireplaces == 0, "no", train.test$FireplaceQu) # 730 houses in test don't have a fireplace and FireplaceQu is NA for 730 houses
# sum(is.na(train.test$FireplaceQu)) = 0
From the data dictionary, Lot Frontage missing implies there is no street connected. Fill missing values with 0.
train.test[is.na(train.test$LotFrontage), "LotFrontage"] <- 0 # there is no street connected to property.
# if TotalBsmtSF == 0, NA in other bsmt variables is not missing info.
# ex: sum(is.na(test$BsmtQual) & test$TotalBsmtSF == 0, na.rm = T ) = 41
train.test[ , c("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2" )] <-
lapply(train.test[ , c("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2" )],
function(x) {ifelse(is.na(x) & train.test$TotalBsmtSF == 0, "no", x)} )
# if TotalBsmtSF != 0 or na, NA in other bsmt variables is missing info.
# GarageArea = 0
train.test[ , c("GarageQual", "GarageCond", "GarageFinish", "GarageType", "GarageYrBlt", "GarageCars")] <- lapply( train.test[ , c("GarageQual", "GarageCond", "GarageFinish", "GarageType", "GarageYrBlt", "GarageCars")], function(x) {ifelse(is.na(x) & train.test$GarageArea == 0, "no", x)})
Now, remaining missing variables are actually missing.
missing.train.test <- sapply(train.test, function(x) {sum(is.na(x))}) #number of missing data in each column of train.test
missing.train.test[missing.train.test != 0]
## MSZoning Utilities Exterior1st Exterior2nd MasVnrType
## 4 2 1 1 24
## MasVnrArea BsmtQual BsmtCond BsmtExposure BsmtFinType1
## 23 3 4 4 1
## BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 2 1 1 1
## Electrical BsmtFullBath BsmtHalfBath KitchenQual Functional
## 1 2 2 1 2
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 2 2 1 1 2
## GarageCond PoolQC SaleType
## 2 3 1
# na in remaining variables is missing values
# filling these missing values with mode of the variable.
mode <- function(x){
# returns mode of vector, if not unique returns first element
y = table(x)
if (is.numeric(x)) mode = as.numeric(dimnames(y)$x)[y == max(y)]
else if (is.character(x)) mode = dimnames(y)$x[y == max(y)]
return(mode[1])
}
train.test <- as.data.frame(lapply(train.test, function(x) {x <- ifelse(is.na(x), mode(x), x)}))
Now let’s check if we’re done with missing values.
missing.train.test <- sapply(train.test, function(x) {sum(is.na(x))}) #number of missing data in each column of train.test
missing.train.test[missing.train.test != 0]
## named integer(0)
Looks like it! Let’s move on to estimation.
# 1. Separate out the train and test part from train.test
train <- subset(train.test, train == 1)
test <- subset(train.test, train == 0)
# put back SalePrice into train
train$SalePrice <- y
# drop variable train.
train$train <- NULL
test$train <- NULL
# let's look at the distribution of sales prices
par(mfrow = c(1,2), mar = c(3,2,3,2))
hist(train$SalePrice) # looks skewed to the right
hist(log(train$SalePrice)) #looks normal!
Log transformed variable look normal. Therefore I log transform SalePrice and other right skewed variables.
library("moments")
numeric_train <- train[sapply(train, is.numeric)] # numeric features of data set
numeric_train[ , c("MSSubClass", "Id")] <- NULL
skewed_feats <- sapply(numeric_train, skewness) > 0.75 # TRUE for skewed variables
train[names(skewed_feats)[skewed_feats]] <- NULL # deleting skewed features from training set
# log transform skewed features
skewed_feats <- log(numeric_train[skewed_feats] + 1)
train <- cbind(train, skewed_feats)
# log transform test values
numeric_test <- test[sapply(test, is.numeric)] # numeric features of data set
numeric_test[ , c("MSSubClass", "Id")] <- NULL
skewed_feats <- sapply(numeric_test, skewness) > 0.75 # TRUE for skewed variables
test[names(skewed_feats)[skewed_feats]] <- NULL # deleting skewed features from testing set
skewed_feats <- log(numeric_test[skewed_feats] + 1)
test <- cbind(test, skewed_feats)
# combine train and test and create dummy variables for all categorical variables
train$train <- 1
test$train <- 0
y <- train$SalePrice
train$SalePrice <- NULL
train.test <- rbind(train, test, stringsAsFactors = F)
library("dummies")
train.test$MSSubClass <- as.character(train.test$MSSubClass)
train.test <- dummy.data.frame(train.test)
# 4. separate out train for training the model
train <- subset(train.test, train == 1)
test <- subset(train.test, train == 0)
train$SalePrice <- y
train$train <- NULL
test$train <- NULL
# fit linear model, choose lambda; which determines the flexibility of the model
library("glmnet")
# creating matrices, x and y for glmnet
x <- model.matrix(SalePrice ~. , train[names(train) != "Id"])
#x <- model.matrix(SalePrice ~. ,train)[ ,-1]
y <- train$SalePrice
lambda = c(0.001, 0.002, 0.006, 0.01, 0.1, 0.3, 0.5, 1, 1.5, 3, 6, 8, 10, 15,20, 40, 50, 60, 80, 100)
ridge.reg <- cv.glmnet(x, y, alpha = 0, lambda = lambda)
plot(ridge.reg$lambda[ridge.reg$lambda < 20], ridge.reg$cvm[ridge.reg$lambda < 20], typ = "b", xlab = "lambda", ylab = "cross validated error mean") # plot of the cvm.
optimum_lambda <- ridge.reg$lambda.min
y_ridge <- predict(ridge.reg, x, optimum_lambda)
cvm_ridge <- min(ridge.reg$cvm)
cvsd_ridge <- ridge.reg$cvsd[ridge.reg$cvm == cvm_ridge]
error_ridge <- abs(y_ridge - y)
rmse_prices <- sqrt(mean((exp(y) - exp(y_ridge))^2))
rmse_ridge <- sqrt((mean((y - y_ridge)^2)))
# fit lasso model and choose optimum lambda
lasso.reg <- cv.glmnet(x, y, alpha = 1, lambda = lambda)
optimum_lambda <- lasso.reg$lambda.min
y_lasso <- predict(lasso.reg, x, optimum_lambda)
cvm_lasso <- min(lasso.reg$cvm)
cvsd_lasso <- lasso.reg$cvsd[lasso.reg$cvm == cvm_lasso]
lasso.coef <- coef(lasso.reg)
error_lasso <- abs(y_lasso - y)
rmse_lasso <- sqrt((mean((y - y_lasso)^2)))
Both models give similar cross validation mean and standard deviation, with the lasso regression performing slightly better. For ridge regression, the minimum cv error mean is achieved when lambda is 0.016 with sd being 0.0026. For lasso, the minimum of cv error is 0.016 and sd is 0.0018, and this is achieved when lambda is 0.006. Based on cv error values the models seem to be doing a good job of predicting sale price.
par(mfrow = c(1,2))
plot(train$SalePrice, error_ridge)
plot(train$SalePrice, error_lasso)
Since most values of log of Sale Price lies around 12, model predicts well for this price range. For prices far from this, error increases. In addition, error plots for both ridge and lasso models are similar, in agreement with rmse calculations.
Finally predict on test data and creating data frame submission to submit on kaggle.
#create x for prediction
test.id <- test$Id
test$Id <- NULL
test.x <- as.matrix(test)
# predicting using the fitted model
#optimum.ridge.reg <- glmnet(x, y, alpha = 0, lambda = optimum_lambda)
y.test = predict(ridge.reg, newx = test.x, s = "lambda.min")
#optimum.lasso.reg <- glmnet(x, y, alpha = 1, lambda = optimum_lambda)
y.test.lasso = predict(lasso.reg, newx = test.x, s = "lambda.min")
# creating data frame with id and sale price
price <- exp(y.test.lasso) -1
submission <- data.frame(Id = test.id, SalePrice = as.numeric(price), stringsAsFactors = F)