Fraud Detection

Solution:

Build a model that predicts whether a user has a high probability of using the site to perform some illegal activity. 1. Create features and visualize importance of features in predicting probability of default. 2. Fit model 3. Use model to decide which user transaction is fraudulent 4. Find variable importance and compare with insights from plots.

Get user’s country using the ip_address provided and data frame ip_country.

library(data.table)
library("bit64")

fraud <- fread( "~/Google Drive/take_home_challenge/challenge_4/Fraud/fraud.csv")
library("bit64")
ip_country <- fread( "~/Google Drive/take_home_challenge/challenge_4/Fraud/IpAddress_to_Country.csv")

country.ip <- function(ip) {
     
     x <- ip_country[ lower_bound_ip_address <= ip & upper_bound_ip_address >= ip, country ]
     if (length(x) == 1) return(x)
     else return(NA_character_)
     
    }
fraud$country <- sapply(fraud$ip_address, country.ip )

1a. Creating variables/ features.

# variables: 
fraud$sign_pur_diff <- with(fraud, as.POSIXct(purchase_time) - as.POSIXct(signup_time))
#uniq.dev <- unique(sub.fraud$device_id)
fraud$dev.mult.users <-  table(fraud$device_id)[fraud$device_id] >  1  
# unique ip address
fraud$not.uniq.ip <- table(as.character(fraud$ip_address))[as.factor(fraud$ip_address)] > 1 
fraud$purchase_week <- format(as.Date(fraud$purchase_time), "%U")
fraud$purch_day_wk <- format(as.Date(fraud$purchase_time), "%a")
# fill missing values in country:
fraud$country[is.na(fraud$country)] <- "Not_found"
# reducing country categories by keeping top 50 countries
top50 <- names(sort(table(fraud$country), dec = T))[1:50]
fraud$country <- with(fraud, ifelse(country %in% top50, country, "Other"))

# delete unwanted variables
fraud[ , c("V1", "user_id", "signup_time", "device_id", "ip_address", "purchase_time")] <- NULL

1b. Visualizing: Plots of variables vs probability of fraud to judge variable importance.

Purchase week looks like an important determinant of fraud. Purchases made in the first few weeks of the year (around New years) are highly likely to be fraudulent. Similarly, purchases by people who sign up and purchase on the same day are more likely to be fraudulent - makes sense - people using illegal means will want to signup and purchase as soon as possible. Multiple users on the same device seems to have predictive power - people committing fraud might use public computers.

2. Modeling - random forest.

#Converting the character variables and the variable named 'class' to factors.
#convert to data.frame
fraud <- as.data.frame(fraud)
# convert character variables into factors
fraud[sapply(fraud, is.character)] <- lapply(fraud[sapply(fraud, is.character)], as.factor)
# convert class to factor
fraud$class <- as.factor(fraud$class)

x <- fraud[ , ! (names(fraud) %in% "class") ]
library(randomForest)
rf <- randomForest( x = x , y = fraud$class)

Define function that returns true positive rate and false positive rate to choose appropriate threshold.

p_prob <- predict(rf, newdata = x, type = "prob")

fpr_tpr <- function( thr, prob = p_prob[, 2] ){
    
     true <- as.numeric(as.character(fraud$class))
     true_0 <- sum(true == 0)
     true_1 <- sum(true == 1)
     pred_thr <- ifelse(prob > thr, 1, 0)
     fp <- sum(true == 0 & pred_thr == 1)
     tp <- sum(true == 1 & pred_thr == 1)
     fpr <- fp/(true_0)
     tpr <- tp/(true_1)
     return(c(fpr, tpr))
}

thresholds = seq(0.001, 0.999, 0.009)
roc_values <- data.frame(t(sapply(thresholds, fpr_tpr)))
colnames(roc_values) <- c("fpr", "tpr")

plot(roc_values$fpr, roc_values$tpr, type = "l")
abline(a = 0, b = 1, col = "red")

Figure: ROC curve

3. Using the model

If we are looking for a low false positive rate, we can pick a TPR of around 0.5, where the FPR is ~0. If we try to optimize both, i.e a low FPR but a decently high TPR, the best the model can do is have a TPR of 0.70 and FPR ~ 0.06. If the person has committed a fraud, 70% of the times the model predicts ‘fraud’. However, if a person hasn’t committed fraud, it predicts ‘fraud’ 6% of times. Therefore, additional checks or inquiries are required. Or perhaps a different model with more accurate predictions can be used. However what is most important is good data. For example, the data set doesn’t have past credit card usage of each person to track user behavior - this might the most helpful in detecting abnormal behavior and hence possibility of fraud.

In any case, using the current model, false ‘fraud’ labelling can be reduced if we don’t use a single threshold. We can directly use the probability that a person commits a fraud and have multiple thresholds. for example:

u <- fraud[2, -6]
prob <- predict(rf, newdata = u, type = "prob")

Gives probability that that person commits a fraud. Using that one can make a decision, say by having two thresholds. T1 and T2.
1. if prob(fraud = 1) < T1 –> innocent.
2. < T1 and > T2 –> further review - create additional verification step- like enter code sent on phone number.
3. > T2 most likely fraud - put session on hold - review manually.
the good feature about the current model is that false positive for a wide range of threshold values. So a non-fraud won’t be labeled as fraud. but since true positive rate is not very high, frauds may go undetected, so additional check may be required.

4. Further let’s see which variables are important

 importance(rf)
                 MeanDecreaseGini
# purchase_value       1816.52789
# source                313.98526
# browser               546.60445
# sex                   228.00008
# age                  1563.97618
# country              1313.84146
# sign_pur_diff        6236.54362
# dev.mult.users       3075.85805
# not.uniq.ip            22.03062
# purch_day_wk          738.14537
# purchase_wk          6230.42813

Looks like purchase week and signup purchase diff is important. Followed by variable indicating multiple users on device and then the purchase value. Multiple users using the same device is also an indication of fraud. Looks like ip address being unique and purchase day of the week is not important in determining the fraud probability. This inference is in agreement with the inference drawn from plot of probability of fraud against variables (drawn above).

Note: Random forest does not provide the sign of the variable. At times the sign is is obvious but at times it is not. One must use visualizations to draw inferences. Also with less data and few trees, one can draw the trees to see how the outcome is splitting at different nodes. getTree(rf, 2, TRUE) gives the second tree. Here, it is hard to see using tree the increase of probability of fraud due to the variable since the tree is so big.

Final Recommendations: watch out for users making purchases around New Years. Users signing up from a device that has been used previously (this indicates creating a new account for the purposes of fraud transaction) and users signing up and making a purchase right away.