House Prices Prediction

The Kaggle data set on house prices challenge has detailed information on various aspects of houses from number of bedrooms to quality of basement in the city of Ames, Iowa. The challenge is to predict house prices using this data. I used penalized linear regression models, ridge and lasso, and they have a fairly good price prediction accuracy. The minimum mean cross validation training error is around 0.017 with a standard deviation of 0.002 for both models.

train <- read.csv("~/Google Drive/kaggle/houseprices/train.csv", stringsAsFactors = F)
#Let's look at missing values in numeric variables to begin with.

Missing values

Writing a quick function to visualize density of missing values in all numeric variables. Each black line corresponds to a missing value. Therefore a row with high density of black lines shows large number of missing values are present in this variable. From the figure below, 3 numeric variables have missing values. Lot frontage has large number of missing values.

# missing values in LotFrontage (259), MasVnrArea (8), GarageYrBlt (81). Let's visualize.
numeric_train <- train[sapply(train, is.numeric)] 
plot_start = 2 # we don't want to visualize variable id
plot_end = 38 # length(numeric_train) = 38
plots = plot_end - plot_start + 1 
plot(y = as.numeric(is.na(numeric_train[, plot_start]))/plots , x = 1:nrow(numeric_train), typ = "l", ylim = c(0,1))
col = plot_start + 1
n = 2 
for (i in c(plot_start:(plot_end-1))) {
    lines(y = as.numeric(is.na(numeric_train[, col]))/plots + (n-1)/plots, x = 1:nrow(numeric_train), typ = "l")
    col = col + 1
    n = n + 1
}

Let’s see the number of missing values in each colum.

test <- read.csv("~/Google Drive/kaggle/houseprices/test.csv", stringsAsFactors = F)
missing.test <- sapply(test, function(x) {sum(is.na(x))})   #number of missing data in each column of test
missing.test[missing.test != 0] # printing only non-zero values

##     MSZoning  LotFrontage        Alley    Utilities  Exterior1st 
##            4          227         1352            2            1 
##  Exterior2nd   MasVnrType   MasVnrArea     BsmtQual     BsmtCond 
##            1           16           15           44           45 
## BsmtExposure BsmtFinType1   BsmtFinSF1 BsmtFinType2   BsmtFinSF2 
##           44           42            1           42            1 
##    BsmtUnfSF  TotalBsmtSF BsmtFullBath BsmtHalfBath  KitchenQual 
##            1            1            2            2            1 
##   Functional  FireplaceQu   GarageType  GarageYrBlt GarageFinish 
##            2          730           76           78           78 
##   GarageCars   GarageArea   GarageQual   GarageCond       PoolQC 
##            1            1           78           78         1456 
##        Fence  MiscFeature     SaleType 
##         1169         1408            1

Combine train and test to get rid of missing values all missing values in one go.

# create indicators train. train = 1 for train and = 0 for test.
train$train <- 1
test$train <- 0

# exclude SalePrice from train. Now train and test can be combined into a single df.
y <- train$SalePrice
train$SalePrice <- NULL

# combine train and test data.  
train.test <- rbind(train, test, stringsAsFactors = F)

Sometimes missing values are not really missing values - but just the absence of that feature. For example, for houses having no pool, the pool quality is NA. I fill this with a new quality ‘no’. Houses having a pool and a NA for pool quality is a real missing value. Let us separate out the unreal missing values and fill them first.

# NA in PoolQC when PoolArea = 0 is "no" pool, therefore "no" PoolQC. 
train.test$PoolQC <- ifelse(train.test$PoolArea == 0, "no", train.test$PoolQC)
# sum(is.na(train.test$PoolQC)) the remaining are missing values. PoolQC has 3 missing values

Similarly, for the rest of the variables having missing values.

train.test[ ,c("Fence", "Alley", "MiscFeature") ] <- lapply(train.test[ ,c("Fence", "Alley", "MiscFeature") ], function(x) {ifelse(is.na(train.test$Fence), "no", train.test$Fence)})

train.test$FireplaceQu <- ifelse(is.na(train.test$FireplaceQu) & train.test$Fireplaces == 0, "no", train.test$FireplaceQu) # 730 houses in test don't have a fireplace and FireplaceQu is NA for 730 houses
# sum(is.na(train.test$FireplaceQu)) = 0

From the data dictionary, Lot Frontage missing implies there is no street connected. Fill missing values with 0.

train.test[is.na(train.test$LotFrontage), "LotFrontage"] <- 0 # there is no street connected to property.

# if TotalBsmtSF == 0, NA in other bsmt variables is not missing info.
# ex: sum(is.na(test$BsmtQual) & test$TotalBsmtSF == 0, na.rm = T )  =  41

train.test[ , c("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2" )] <- 
lapply(train.test[ , c("BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2" )], 
function(x) {ifelse(is.na(x) & train.test$TotalBsmtSF == 0, "no", x)} )
# if TotalBsmtSF != 0 or na, NA in other bsmt variables is missing info.

# GarageArea = 0   
 train.test[ , c("GarageQual", "GarageCond", "GarageFinish", "GarageType", "GarageYrBlt", "GarageCars")] <- lapply( train.test[ , c("GarageQual", "GarageCond", "GarageFinish", "GarageType", "GarageYrBlt", "GarageCars")], function(x) {ifelse(is.na(x) & train.test$GarageArea == 0, "no", x)})

Now, remaining missing variables are actually missing.

missing.train.test <- sapply(train.test, function(x) {sum(is.na(x))})   #number of missing data in each column of train.test
missing.train.test[missing.train.test != 0]

##     MSZoning    Utilities  Exterior1st  Exterior2nd   MasVnrType 
##            4            2            1            1           24 
##   MasVnrArea     BsmtQual     BsmtCond BsmtExposure BsmtFinType1 
##           23            3            4            4            1 
##   BsmtFinSF1 BsmtFinType2   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            2            1            1            1 
##   Electrical BsmtFullBath BsmtHalfBath  KitchenQual   Functional 
##            1            2            2            1            2 
##  GarageYrBlt GarageFinish   GarageCars   GarageArea   GarageQual 
##            2            2            1            1            2 
##   GarageCond       PoolQC     SaleType 
##            2            3            1

# na in remaining variables is missing values

# filling these missing values with mode of the variable. 
mode <- function(x){
    # returns mode of vector, if not unique returns first element
    y = table(x)
    if (is.numeric(x)) mode = as.numeric(dimnames(y)$x)[y == max(y)]
    else if (is.character(x)) mode = dimnames(y)$x[y == max(y)]
    return(mode[1])
}


train.test <- as.data.frame(lapply(train.test, function(x) {x <- ifelse(is.na(x), mode(x), x)}))

Now let’s check if we’re done with missing values.

missing.train.test <- sapply(train.test, function(x) {sum(is.na(x))})   #number of missing data in each column of train.test
missing.train.test[missing.train.test != 0]

## named integer(0)

Looks like it! Let’s move on to estimation.

Estimation

# 1. Separate out the train and test part from train.test

train <- subset(train.test, train == 1)
test <- subset(train.test, train == 0)

# put back SalePrice into train

train$SalePrice <- y

# drop variable train.
train$train <- NULL
test$train <- NULL

Log transform skewed variables

# let's look at the distribution of sales prices
par(mfrow = c(1,2), mar = c(3,2,3,2))
hist(train$SalePrice) # looks skewed to the right
hist(log(train$SalePrice)) #looks normal!

Distribution of sale prices before and after log transform

Log transformed variable look normal. Therefore I log transform SalePrice and other right skewed variables.

library("moments")
numeric_train <- train[sapply(train, is.numeric)] # numeric features of data set
numeric_train[ , c("MSSubClass", "Id")] <- NULL  
skewed_feats <- sapply(numeric_train, skewness) > 0.75 # TRUE for skewed variables
train[names(skewed_feats)[skewed_feats]] <- NULL  # deleting skewed features from training set

# log transform skewed features
skewed_feats <- log(numeric_train[skewed_feats] + 1) 
train <- cbind(train, skewed_feats)

# log transform test values
numeric_test <- test[sapply(test, is.numeric)] # numeric features of data set
numeric_test[ , c("MSSubClass", "Id")] <- NULL  
skewed_feats <- sapply(numeric_test, skewness) > 0.75 # TRUE for skewed variables
test[names(skewed_feats)[skewed_feats]] <- NULL  # deleting skewed features from testing set
skewed_feats <- log(numeric_test[skewed_feats] + 1) 
test <- cbind(test, skewed_feats)

Create dummy variables for all categorical variables

# combine train and test and create dummy variables for all categorical variables
train$train <- 1
test$train <- 0
y <- train$SalePrice
train$SalePrice <- NULL
train.test <- rbind(train, test, stringsAsFactors = F)
library("dummies")
train.test$MSSubClass <- as.character(train.test$MSSubClass)
train.test <- dummy.data.frame(train.test) 

# 4. separate out train for training the model
train <- subset(train.test, train == 1)
test <- subset(train.test, train == 0)
train$SalePrice <- y
train$train <- NULL
test$train <- NULL

Fit linear model

# fit linear model, choose lambda; which determines the flexibility of the model

library("glmnet")
# creating matrices, x and y for glmnet 
x <- model.matrix(SalePrice ~.  , train[names(train) != "Id"])
#x <- model.matrix(SalePrice ~.  ,train)[ ,-1]
y <- train$SalePrice

lambda = c(0.001, 0.002, 0.006, 0.01, 0.1, 0.3, 0.5, 1, 1.5, 3, 6, 8, 10, 15,20, 40, 50, 60, 80, 100)
ridge.reg <- cv.glmnet(x, y, alpha = 0, lambda = lambda) 
plot(ridge.reg$lambda[ridge.reg$lambda < 20], ridge.reg$cvm[ridge.reg$lambda < 20], typ = "b", xlab = "lambda", ylab = "cross validated error mean")  # plot of the cvm.

optimum_lambda <- ridge.reg$lambda.min
y_ridge <- predict(ridge.reg, x, optimum_lambda)
cvm_ridge <- min(ridge.reg$cvm)
cvsd_ridge <- ridge.reg$cvsd[ridge.reg$cvm == cvm_ridge]
error_ridge <- abs(y_ridge - y)
rmse_prices <- sqrt(mean((exp(y) - exp(y_ridge))^2))
rmse_ridge <- sqrt((mean((y - y_ridge)^2)))

# fit lasso model and choose optimum lambda 

lasso.reg <- cv.glmnet(x, y, alpha = 1, lambda = lambda)
optimum_lambda <- lasso.reg$lambda.min
y_lasso  <- predict(lasso.reg, x, optimum_lambda)
cvm_lasso <- min(lasso.reg$cvm)
cvsd_lasso <- lasso.reg$cvsd[lasso.reg$cvm == cvm_lasso]
lasso.coef <- coef(lasso.reg)
error_lasso <- abs(y_lasso - y)
rmse_lasso <- sqrt((mean((y - y_lasso)^2)))

Both models give similar cross validation mean and standard deviation, with the lasso regression performing slightly better. For ridge regression, the minimum cv error mean is achieved when lambda is 0.016 with sd being 0.0026. For lasso, the minimum of cv error is 0.016 and sd is 0.0018, and this is achieved when lambda is 0.006. Based on cv error values the models seem to be doing a good job of predicting sale price.

par(mfrow = c(1,2))
plot(train$SalePrice, error_ridge)
plot(train$SalePrice, error_lasso)

Figure: error vs sale price for ridge and lasso

Since most values of log of Sale Price lies around 12, model predicts well for this price range. For prices far from this, error increases. In addition, error plots for both ridge and lasso models are similar, in agreement with rmse calculations.

Finally predict on test data and creating data frame submission to submit on kaggle.

#create x for prediction
test.id <- test$Id
test$Id <- NULL
test.x <- as.matrix(test)

# predicting using the fitted model 

#optimum.ridge.reg <- glmnet(x, y, alpha = 0, lambda = optimum_lambda)
y.test = predict(ridge.reg, newx = test.x, s = "lambda.min")

#optimum.lasso.reg <- glmnet(x, y, alpha = 1, lambda = optimum_lambda)
y.test.lasso = predict(lasso.reg, newx = test.x, s = "lambda.min")

# creating data frame with id and sale price 
price <- exp(y.test.lasso) -1 
submission <- data.frame(Id = test.id, SalePrice = as.numeric(price), stringsAsFactors = F)