Problem Statement

Given data about users who used the site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site). The project is to:
- Predict conversion rate
- Come up with recommendations for the product team and the marketing team to improve conversion rate


Solution

I will do the following steps.
1. create dummies for categorical variables 2. set random 10% of the sample as test data and set remaining rows to training data.
3. implement logistic regression and get predicted probabilites.
4. find mean sq error. 5. evaluate predictive power of model. 7) draw insights.

conversion <- read.csv("~/Google Drive/take_home_challenge/challenge_1/conversion_data.csv", stringsAsFactors = F)

# create dummy variables for all categorical variables
library("dummies")
## dummies-1.5.6 provided by Decision Patterns
conversion <- dummy.data.frame(conversion)

testRows <- sample(1:nrow(conversion), 0.1*nrow(conversion)) # 10 percent random sample for test data
test <- conversion[testRows, ]
training <- conversion[-testRows, ]

logit.reg <- glm(formula = converted ~  age + new_user + total_pages_visited + countryChina + countryUS + countryUK + sourceAds + sourceDirect, data = training, family = binomial(link = "logit"))

# Coefficients:
                     # Estimate Std. Error z value Pr(>|z|)
# (Intercept)         -6.485532   0.114905 -56.442  < 2e-16 ***
# age                 -0.073999   0.002502 -29.579  < 2e-16 ***
# new_user            -1.753462   0.037573 -46.669  < 2e-16 ***
# total_pages_visited  0.758075   0.006545 115.826  < 2e-16 ***
# countryChina        -3.927790   0.141169 -27.823  < 2e-16 ***
# countryUS           -0.631205   0.071361  -8.845  < 2e-16 ***
# countryUK           -0.268258   0.077433  -3.464 0.000531 ***
# sourceAds            0.020707   0.042051   0.492 0.622426
# sourceDirect        -0.137314   0.046554  -2.950 0.003182 **

T = 0.5 # threashold value
test.prob <- predict(logit.reg, test, type = "response")
test.predict <- ifelse(test.prob >= T, 1, 0)
conversionRate <- sum(test.predict)/length(test.predict)

test.true <- test$converted
# evaluate error
rmse <- sqrt(mean((test.prob - test.true)^2))  # doesn't depend on T

# fpr and tpr
sum(test.predict == 1)  # total values for which 1 was predicted
## [1] 795
fp <- sum(test.true == 0 & test.predict == 1) # num of negative events wrongly categorized as positive
tp <- sum(test.true[test.predict == 1] == 1)
total.n <- sum(test.true  == 0)
total.p <- sum(test.true == 1)
fpr <-  fp/total.n # prob that you reject null when null is true = 0.004, which is good
tpr <-  tp/total.p  # prob that you reject null when null is false. 0.697 ~ 0.7, which means 70% of conversions are predicted "conversion". Not bad.

# confusion matrix for training and test data
library("caret")
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
## Need help? Try the ggplot2 mailing list:
## http://groups.google.com/group/ggplot2.
library("e1071")
train.prob <- predict(logit.reg, training, type = "response")
train.predict <- ifelse(train.prob >= 0.5, 1, 0)

t(confusionMatrix(factor(train.predict), factor(training$converted), positive = '1')$table)
##          Prediction
## Reference      0      1
##         0 274282   1082
##         1   2860   6356
#          Prediction
# Reference      0      1
#         0 274340   1078  1078/275418 = 0.0039 FPR
#         1   2854   6308  6308/9162 = 0.68 TPR

t(confusionMatrix(factor(test.predict), factor(test$converted), pos = '1')$table)
##          Prediction
## Reference     0     1
##         0 30518   118
##         1   307   677
#          Prediction
# Reference     0     1
#         0 30459   123    FPR = 0.004
#         1   314   724    TPR 0.69

# similar values of FPR and TPR show we are not overfitting

# threashold selection; although T selecting is not very important in this context we'll see how f0(x) and f1(x) looks.
f0x <- test.prob[test.true == 0]
f1x <- test.prob[test.true == 1]
plot(density(f1x))
par(new = T)
lines(density(f0x), col = "red")


From plot of distributions it looks like we would gain by reducing T, let us try 0.3. Confusion matrix for threashold = 0.3

t(confusionMatrix(factor(test.predict), factor(test$converted), pos = '1')$table)
#          Prediction
# Reference     0     1
#         0 30296   286   FPR = 0.009
#         1   238   800   TPR = 0.77


Insights

Even though FPR increases slightly this T is better at predicting a conversion when a conversion occurs so I would choose T = 0.3. since our model has a fairly good predictive power, we can extract trustworthy insights from the coefficients.

  1. Younger people are more likely to convert: Use marketing channels which are used by young people
  2. Old account users are more likely to convert: Targeted emails with offers to bring old account users back to the site.
  3. Germans seem to convert but currently are a small percentage of the users. Market more to Germans. (Can have German translation, site specific to Germany)
  4. Check if something is wrong with Chinese version of site since they are many users but do not convert as much
  5. Let’s plot conversion and age.
x = unique(conversion$age)
x <- x[ x != 111 & x != 123 ]  # age 111 and 123 don't make sense.

y = tapply(conversion$converted, conversion$age, mean)
y <- y[ names(y) != '111' & names(y) != '123']
plot(x = sort(x), y = y, typ = "b")

y shows that prob of conversion reduces with ages, try improving UI to make site useful for older people.

  1. since total pages shows intent of purchase, someone with high total_pages_visited can be sent offers. Send reminders to people when they have something in the cart.