In this tutorial, I will elaborate how got I started on the Credit Card Fraud Detection competition on Kaggle. The goal of the task is to automatically identify fraudulent credit card transactions using Machine Learning. My Pythonic approach is explained step-by-step.
In this article, I use Python 3. Execute the following line to install all the dependencies:
The following list gives an overview of what all the dependencies do:
- Pandas is a library which allows you to perform common statistical operations on your data and quickly skim through your dataset.
- The scikit-learn library has a lot of out-of-the-box Machine Learning algorithms. This is great for testing some simple models.
Gather and First Glance at the Data
The very first step is to gather the data. Download creditcard.csv into your Python project folder. Now we can take a quick look at the data using Pandas. Make a file named explore.py with the following contents:
For the Pandas newcomers, pd is a commonly used abbreviation for Pandas and df is a commonly used abbreviation for DataFrame. A DataFrame is one of the main data structures in the Pandas library.
If the code is executed, the following is shown:
Time V1 V2 V3 V4 V5 0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 V6 V7 V8 V9 ... V21 V22 0 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 1 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 2 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 V23 V24 V25 V26 V27 V28 Amount 0 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 1 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 2 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 Class 0 0 1 0 2 0
Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.
There is a small trick for getting more information than only the raw records. We can use the following code:
This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.
Splitting the data
It would be unfair to train a Machine Learning algorithm on the data and then test the approach on the same data. If you do that, then memorization could be used to achieve optimal performance: just remember all the data that is seen and by that you can perfectly know what to answer on the same dataset. A problem arises when data is unknown. The memorization technique can by no means predict a label for unseen data. We will split the data in a train set and a test set. The Machine Learning algorithm is then trained on the train set and its performance is computed by letting it predict labels on the test set. The test set is unseen data for the algorithm: it was not shown to the algorithm before.
This is not the only problem. Another problem is that we are dealing with unbalanced data. Luckily, the scikit-learn library provides us some tools for splitting the unbalanced data fairly.
As a model, I will use Logistic Regression. This model is often used in problems with binary target variables. Our Class variable is indeed a binary variable. It is not the best approach, but at least it offers some insights in the data.
The very first step is to include all the dependencies. This is done by the following lines of code:
The next step is to read the data:
Now we have the data in place, we can select the features we would like to use during the training:
Notice that some of the variables have a wide range of values (like the Amount variable). In order to get all variables in an equivalent range, we subtract the mean and divide by the standard deviation such that the distribution of the values is normalized:
Okay, now it is time for some action! We will first define the model (the Logistic Regression model) and then loop through a train and test set which have approximately the same Class distribution. The StratisfiedShuffleSplit makes sure that the Class variable has roughly the same distribution in both the train set and the test set. The random state specification makes sure that the result is deterministic: in other words, we will get the same results if we would run the analysis again. The normalization is done for both the train set and test set. If this was done before the split, some information of the test set would be used in the normalization of the train set and this is not fair since the test set is not completely unseen then. The following code does the job:
This results into the following:
precision recall f1-score support 0 1.00 1.00 1.00 142158 1 0.88 0.61 0.72 246 avg / total 1.00 1.00 1.00 142404
This is actually a great result! The 0 classes (transactions without fraud) are predicted with 100% precision and recall. It has some issues with detecting the 1 classes (transactions which are fraudulent). It can predict fraud with 88% precision. This means that 12% of the transactions which are fraudulent remain undetected by the system. But, 88% is still quite good!
It is fairly easy to come up with a simple model, implement it in Python and get great results for the Credit Card Fraud Detection task on Kaggle. If you have any questions or comments, feel free to ask in the comment section below!