Skip to main content
A credit card.

Fraud Detection: A Simple Machine Learning Approach

In this tutorial, I will elaborate how got I started on the Credit Card Fraud Detection competition on Kaggle. The goal of the task is to automatically identify fraudulent credit card transactions using Machine Learning. My Pythonic approach is explained step-by-step.

Setup

In this article, I use Python 3. Execute the following line to install all the dependencies:

pip install pandas==0.19.2 scikit-learn==0.18.1

The following list gives an overview of what all the dependencies do:

  • Pandas is a library which allows you to perform common statistical operations on your data and quickly skim through your dataset.
  • The scikit-learn library has a lot of out-of-the-box Machine Learning algorithms. This is great for testing some simple models.

Gather and First Glance at the Data

The very first step is to gather the data. Download creditcard.csv into your Python project folder. Now we can take a quick look at the data using Pandas. Make a file named explore.py with the following contents:

import pandas as pd

# Read the CSV file
df = pd.read_csv('creditcard.csv')

# Show the contents
print(df)

For the Pandas newcomers, pd is a commonly used abbreviation for Pandas and df is a commonly used abbreviation for DataFrame. A DataFrame is one of the main data structures in the Pandas library.

If the code is executed, the following is shown:

Time V1 V2 V3 V4 V5 
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198

              V6        V7        V8        V9  ...         V21       V22  
0       0.462388  0.239599  0.098698  0.363787  ...   -0.018307  0.277838   
1      -0.082361 -0.078803  0.085102 -0.255425  ...   -0.225775 -0.638672   
2       1.800499  0.791461  0.247676 -1.514654  ...    0.247998  0.771679   

             V23       V24       V25       V26       V27       V28  Amount  
0      -0.110474  0.066928  0.128539 -0.189115  0.133558 -0.021053  149.62   
1       0.101288 -0.339846  0.167170  0.125895 -0.008983  0.014724    2.69   
2       0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752  378.66   

        Class  
0           0  
1           0  
2           0

Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.

There is a small trick for getting more information than only the raw records. We can use the following code:

print(df.describe())

This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.

Splitting the data

It would be unfair to train a Machine Learning algorithm on the data and then test the approach on the same data. If you do that, then memorization could be used to achieve optimal performance: just remember all the data that is seen and by that you can perfectly know what to answer on the same dataset. A problem arises when data is unknown. The memorization technique can by no means predict a label for unseen data. We will split the data in a train set and a test set. The Machine Learning algorithm is then trained on the train set and its performance is computed by letting it predict labels on the test set. The test set is unseen data for the algorithm: it was not shown to the algorithm before.

This is not the only problem. Another problem is that we are dealing with unbalanced data. Luckily, the scikit-learn library provides us some tools for splitting the unbalanced data fairly.

Model

As a model, I will use Logistic Regression. This model is often used in problems with binary target variables. Our Class variable is indeed a binary variable. It is not the best approach, but at least it offers some insights in the data.

Implementation

The very first step is to include all the dependencies. This is done by the following lines of code:

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd

The next step is to read the data:

data = pd.read_csv('creditcard.csv')

Now we have the data in place, we can select the features we would like to use during the training:

# Only use the 'Amount' and 'V1', ..., 'V28' features
features = ['Amount'] + ['V%d' % number for number in range(1, 29)]

# The target variable which we would like to predict, is the 'Class' variable
target = 'Class'

# Now create an X variable (containing the features) and an y variable (containing only the target variable)
X = data[features]
y = data[target]

Notice that some of the variables have a wide range of values (like the Amount variable). In order to get all variables in an equivalent range, we subtract the mean and divide by the standard deviation such that the distribution of the values is normalized:

def normalize(X):
    """
    Make the distribution of the values of each variable similar by subtracting the mean and by dividing by the standard deviation.
    """
    for feature in X.columns:
        X[feature] -= X[feature].mean()
        X[feature] /= X[feature].std()
    return X

Okay, now it is time for some action! We will first define the model (the Logistic Regression model) and then loop through a train and test set which have approximately the same Class distribution. The StratisfiedShuffleSplit makes sure that the Class variable has roughly the same distribution in both the train set and the test set. The random state specification makes sure that the result is deterministic: in other words, we will get the same results if we would run the analysis again. The normalization is done for both the train set and test set. If this was done before the split, some information of the test set would be used in the normalization of the train set and this is not fair since the test set is not completely unseen then. The following code does the job:

# Define the model
model = LogisticRegression()

# Define the splitter for splitting the data in a train set and a test set
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)

# Loop through the splits (only one)
for train_indices, test_indices in splitter.split(X, y):
    # Select the train and test data
    X_train, y_train = X.iloc[train_indices], y.iloc[train_indices]
    X_test, y_test = X.iloc[test_indices], y.iloc[test_indices]
    
    # Normalize the data
    X_train = normalize(X_train)
    X_test = normalize(X_test)
    
    # Fit and predict!
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # And finally: show the results
    print(classification_report(y_test, y_pred))

This results into the following:

             precision    recall  f1-score   support

          0       1.00      1.00      1.00    142158
          1       0.88      0.61      0.72       246

avg / total       1.00      1.00      1.00    142404

This is actually a great result! The 0 classes (transactions without fraud) are predicted with 100% precision and recall. It has some issues with detecting the 1 classes (transactions which are fraudulent). It can predict fraud with 88% precision. This means that 12% of the transactions which are fraudulent remain undetected by the system. But, 88% is still quite good!

Conclusion

It is fairly easy to come up with a simple model, implement it in Python and get great results for the Credit Card Fraud Detection task on Kaggle. If you have any questions or comments, feel free to ask in the comment section below!

Kevin Jacobs

Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: mail@kevinjacobs.nl.