Given data about users who used the site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site). The project is to:
- Predict conversion rate
- Come up with recommendations for the product team and the marketing team to improve conversion rate
I will do the following steps.
1. create dummies for categorical variables 2. set random 10% of the sample as test data and set remaining rows to training data.
3. implement logistic regression and get predicted probabilites.
4. find mean sq error. 5. evaluate predictive power of model. 7) draw insights.
conversion <- read.csv("~/Google Drive/take_home_challenge/challenge_1/conversion_data.csv", stringsAsFactors = F)
# create dummy variables for all categorical variables
library("dummies")
## dummies-1.5.6 provided by Decision Patterns
conversion <- dummy.data.frame(conversion)
testRows <- sample(1:nrow(conversion), 0.1*nrow(conversion)) # 10 percent random sample for test data
test <- conversion[testRows, ]
training <- conversion[-testRows, ]
logit.reg <- glm(formula = converted ~ age + new_user + total_pages_visited + countryChina + countryUS + countryUK + sourceAds + sourceDirect, data = training, family = binomial(link = "logit"))
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -6.485532 0.114905 -56.442 < 2e-16 ***
# age -0.073999 0.002502 -29.579 < 2e-16 ***
# new_user -1.753462 0.037573 -46.669 < 2e-16 ***
# total_pages_visited 0.758075 0.006545 115.826 < 2e-16 ***
# countryChina -3.927790 0.141169 -27.823 < 2e-16 ***
# countryUS -0.631205 0.071361 -8.845 < 2e-16 ***
# countryUK -0.268258 0.077433 -3.464 0.000531 ***
# sourceAds 0.020707 0.042051 0.492 0.622426
# sourceDirect -0.137314 0.046554 -2.950 0.003182 **
T = 0.5 # threashold value
test.prob <- predict(logit.reg, test, type = "response")
test.predict <- ifelse(test.prob >= T, 1, 0)
conversionRate <- sum(test.predict)/length(test.predict)
test.true <- test$converted
# evaluate error
rmse <- sqrt(mean((test.prob - test.true)^2)) # doesn't depend on T
# fpr and tpr
sum(test.predict == 1) # total values for which 1 was predicted
## [1] 795
fp <- sum(test.true == 0 & test.predict == 1) # num of negative events wrongly categorized as positive
tp <- sum(test.true[test.predict == 1] == 1)
total.n <- sum(test.true == 0)
total.p <- sum(test.true == 1)
fpr <- fp/total.n # prob that you reject null when null is true = 0.004, which is good
tpr <- tp/total.p # prob that you reject null when null is false. 0.697 ~ 0.7, which means 70% of conversions are predicted "conversion". Not bad.
# confusion matrix for training and test data
library("caret")
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
## Need help? Try the ggplot2 mailing list:
## http://groups.google.com/group/ggplot2.
library("e1071")
train.prob <- predict(logit.reg, training, type = "response")
train.predict <- ifelse(train.prob >= 0.5, 1, 0)
t(confusionMatrix(factor(train.predict), factor(training$converted), positive = '1')$table)
## Prediction
## Reference 0 1
## 0 274282 1082
## 1 2860 6356
# Prediction
# Reference 0 1
# 0 274340 1078 1078/275418 = 0.0039 FPR
# 1 2854 6308 6308/9162 = 0.68 TPR
t(confusionMatrix(factor(test.predict), factor(test$converted), pos = '1')$table)
## Prediction
## Reference 0 1
## 0 30518 118
## 1 307 677
# Prediction
# Reference 0 1
# 0 30459 123 FPR = 0.004
# 1 314 724 TPR 0.69
# similar values of FPR and TPR show we are not overfitting
# threashold selection; although T selecting is not very important in this context we'll see how f0(x) and f1(x) looks.
f0x <- test.prob[test.true == 0]
f1x <- test.prob[test.true == 1]
plot(density(f1x))
par(new = T)
lines(density(f0x), col = "red")
From plot of distributions it looks like we would gain by reducing T, let us try 0.3. Confusion matrix for threashold = 0.3
t(confusionMatrix(factor(test.predict), factor(test$converted), pos = '1')$table)
# Prediction
# Reference 0 1
# 0 30296 286 FPR = 0.009
# 1 238 800 TPR = 0.77
Even though FPR increases slightly this T is better at predicting a conversion when a conversion occurs so I would choose T = 0.3. since our model has a fairly good predictive power, we can extract trustworthy insights from the coefficients.
x = unique(conversion$age)
x <- x[ x != 111 & x != 123 ] # age 111 and 123 don't make sense.
y = tapply(conversion$converted, conversion$age, mean)
y <- y[ names(y) != '111' & names(y) != '123']
plot(x = sort(x), y = y, typ = "b")
y shows that prob of conversion reduces with ages, try improving UI to make site useful for older people.