In this machine learning fraud detection tutorial, I will elaborate how got I started on the Credit Card Fraud Detection competition on Kaggle. The goal of the task is to automatically identify fraudulent credit card transactions using Machine Learning. My Pythonic approach is explained step-by-step.
The PwC global economic crime survey of 2016 suggests that approximately 36% of organizations experienced economic crime. Therefore, there is definitely a need to solve the problem of credit card fraud detection. The task of fraud detection often boils down to outlier detection, in which a dataset is scanned through to find potential anomalies in the data. In the past, this was done by employees which checked all transactions manually. With the rise of machine learning, artificial intelligence, deep learning and other relevant fields of information technology, it becomes feasible to automate this process and to save some of the intensive amount of labor that is put into detecting credit card fraud. In the following sections, my machine learning based Pythonic approach is explained.
In this article, I use Python 3. Execute the following line to install all the dependencies:
pip install pandas==0.19.2 scikit-learn==0.18.1
The following list gives an overview of what all the dependencies do:
- Pandas is a library which allows you to perform common statistical operations on your data and quickly skim through your dataset.
- The scikit-learn library has a lot of out-of-the-box Machine Learning algorithms. This is great for testing some simple models.
Gather and First Glance at the Data
The very first step is to gather the data. Download creditcard.csv into your Python project folder. Now we can take a quick look at the data using Pandas. Make a file named explore.py with the following contents:
import pandas as pd # Read the CSV file df = pd.read_csv('creditcard.csv') # Show the contents print(df)
For the Pandas newcomers, pd is a commonly used abbreviation for Pandas and df is a commonly used abbreviation for DataFrame. A DataFrame is one of the main data structures in the Pandas library.
If the code is executed, the following is shown:
Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.
There is a small trick for getting more information than only the raw records. We can use the following code:
This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.
Splitting the data
It would be unfair to train a Machine Learning algorithm on the data and then test the approach on the same data. If you do that, then memorization could be used to achieve optimal performance: just remember all the data that is seen and by that you can perfectly know what to answer on the same dataset. A problem arises when data is unknown. The memorization technique can by no means predict a label for unseen data. We will split the data in a train set and a test set. The Machine Learning algorithm is then trained on the train set and its performance is computed by letting it predict labels on the test set. The test set is unseen data for the algorithm: it was not shown to the algorithm before.
This is not the only problem. Another problem is that we are dealing with unbalanced data. Luckily, the scikit-learn library provides us some tools for splitting the unbalanced data fairly.
As a model, I will use Logistic Regression. This model is often used in problems with binary target variables. Our Class variable is indeed a binary variable. It is not the best approach, but at least it offers some insights in the data.
The very first step is to include all the dependencies. This is done by the following lines of code:
from sklearn.model_selection import StratifiedShuffleSplit from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report import pandas as pd
The next step is to read the data:
data = pd.read_csv('creditcard.csv')
Now we have the data in place, we can select the features we would like to use during the training:
# Only use the 'Amount' and 'V1', ..., 'V28' features features = ['Amount'] + ['V%d' % number for number in range(1, 29)] # The target variable which we would like to predict, is the 'Class' variable target = 'Class' # Now create an X variable (containing the features) and an y variable (containing only the target variable) X = data[features] y = data[target]
Notice that some of the variables have a wide range of values (like the Amount variable). In order to get all variables in an equivalent range, we subtract the mean and divide by the standard deviation such that the distribution of the values is normalized:
def normalize(X): """ Make the distribution of the values of each variable similar by subtracting the mean and by dividing by the standard deviation. """ for feature in X.columns: X[feature] -= X[feature].mean() X[feature] /= X[feature].std() return X
Okay, now it is time for some action! We will first define the model (the Logistic Regression model) and then loop through a train and test set which have approximately the same Class distribution. First, we train the model by the train set and then valide the results with the test set. The StratisfiedShuffleSplit makes sure that the Class variable has roughly the same distribution in both the train set and the test set. The random state specification makes sure that the result is deterministic: in other words, we will get the same results if we would run the analysis again. The normalization is done for both the train set and test set. If this was done before the split, some information of the test set would be used in the normalization of the train set and this is not fair since the test set is not completely unseen then. The following code does the job:
# Define the model model = LogisticRegression() # Define the splitter for splitting the data in a train set and a test set splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0) # Loop through the splits (only one) for train_indices, test_indices in splitter.split(X, y): # Select the train and test data X_train, y_train = X.iloc[train_indices], y.iloc[train_indices] X_test, y_test = X.iloc[test_indices], y.iloc[test_indices] # Normalize the data X_train = normalize(X_train) X_test = normalize(X_test) # Fit and predict! model.fit(X_train, y_train) y_pred = model.predict(X_test) # And finally: show the results print(classification_report(y_test, y_pred))
This results into the following:
This is actually a great result! The 0 classes (transactions without fraud) are predicted with 100% precision and recall. It has some issues with detecting the 1 classes (transactions which are fraudulent). It can predict fraud with 88% precision. This means that 12% of the transactions which are fraudulent remain undetected by the system. But, 88% is still quite good!
This machine learning fraud detection tutorial showed how to tackle the problem of credit card fraud detection using machine learning. It is fairly easy to come up with a simple model, implement it in Python and get great results for the Credit Card Fraud Detection task on Kaggle. If you have any questions or comments, feel free to ask in the comment section below!