A Simple Spam Filter Using Python and Machine Learning

Written by Kevin Jacobs

I'm Kevin, a Data Scientist, PhD student in NLP and Law and blog writer for Data Blogger.

Many of the e-mails you’ll find in the average inbox are spam e-mails. In this tutorial, I will guide you through the steps of building a simple spam classifier written in Python.

In this tutorial, we will use the Python library Scikit-learn which contains many machine learning model. So, make sure that you install this library first. The installation of this package can be done by using the following command:

pip install -U scikit-learn

Introduction

A spam filter can be seen as a text classification problem. An e-mail (a text document) belongs to either the “spam” or “not spam” class. This is called single label text classification, since there is only one label: “spam”. A classifier is an algorithm that is capable of telling whether a text document belongs to either the “spam” or “no spam” category. In this article, we will first extract textual features from our documents. Then, we create a small dataset on which we will train a classifier. And in the end, we will create a test dataset and view the results.

Feature extraction

Take a look at the following e-mails:

$1,000 ALARMING!
An appointment on July 2nd
New course content for Text Classification
BUY VIAGRA!!

The first and the last e-mail most probably belong to the “spam” class as these are trying to advertise something and as they are increasing the pressure to buy the product as soon as possible. First of all, do the following imports in Python:

import numpy as np
from sklearn.linear_model import SGDClassifier

The next step is to decide which features we will use. It seems that e-mails with many punctuations such as “!”, “$” or e-mails with a lot of capital letters are probably spam e-mails. So, we will create 3 features and our training and test set. The feature vector consists of three entries where the first entry is a 1 if there is a “!” in the text (a 0 otherwise). The second entry is 1 if there is a “$” in the text (0 otherwise) and the last entry is the ratio of uppercase characters with respect to the sum of the number of uppercase characters and the number of lowercase characters.

In Python we can create a training set and a test set as follows (the “isspam” variables define whether the text is spam or not):

# Define a training set
training_data = ["$1,000 ALARMING!", "An appointment on July 2nd", "New course content for Text Classification", "BUY VIAGRA!!"]
training_isspam = [True, False, False, True]

# Define the testing set
testing_data = ["New course content for Information Retrieval", "MAkE $$$!", "Grades available for Text Mining", "SELL HOUSE FOR $1,000,000!!"]
testing_isspam = [False, True, False, True]

For the feature extraction, we can write the following method:

def extract_features(text):
    """
    Extract features from a given text.

    :param text: Text to extract features for.
    :return:     A vector where:
                    - The 0th element is 1 if there is a "!" inside the text (0 otherwise).
                    - The 1th element is 1 if there is a "$" inside the text (0 otherwise).
                    - The 2nd element is the ratio of uppercase characters with respect to the sum of all uppercase and
                      lowercase characters.
    """
    features = np.zeros((3,))
    if "!" in text:
        features[0] = 1
    if "$" in text:
        features[1] = 1
    # A list consisting of lowercase characters
    lowercase = list('abcdefghijklmnopqrstuvwxyz')
    # A list consisting of uppercase characters
    uppercase = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
    # Set the counts of lowercase and uppercase characters to 0
    num_lowercase = 0
    num_uppercase = 0
    # And count the lowercase and uppercase characters
    for char in text:
        if char in lowercase:
            num_lowercase += 1
        elif char in uppercase:
            num_uppercase += 1
    # Define the third feature as the ratio of uppercase characters
    features[2] = num_uppercase / (num_lowercase + num_uppercase)
    return features

Text Classification

Note that in our case the numerical values per document are in a three dimensional space: there are three dimensions/features. A linear classifier tries to separate datapoints in this space by fitting a hyperplane. You can think of an hyperplane as a line or a plane in higher dimensions. Everything at one side of the hyperplane is classified as spam and everything on the other side of the hyperplane is classified as not spam. Now, there are many choices for a Text Classification algorithm. A choice that does work well in simple cases is the Stochastic Gradient Descent classifier. We can now train our classifier as follows:

# Make an array of features where the ith row corresponds to the ith documents and the columns correspond to the features
training_features = np.vstack([extract_features(training_data[i]) for i in range(len(training_data))])

# Make an Stochastic Gradient Descent classifier
clf = SGDClassifier()
# And fit it to the training set
clf.fit(training_features, training_isspam)

# Predict the labels of the test set
for test_index in range(len(testing_data)):
    features = extract_features(testing_data[test_index])
    print("Test case:")
    print(20 * '=')
    print('Text:', 4 * "t", testing_data[test_index])
    print('Features:', 3 * "t", features)
    print('Predicted is spam:', "t", clf.predict(features))
    print('Is spam:', 3 * "t", testing_isspam[test_index])
    print('')

You might also be interested in text classification using Neural Networks. This article shows an advanced implementation of a neural network in Python.

Results

After executing the code, we get the following results:

Test case:
====================
Text: 				 New course content for Information Retrieval
Features: 			 [ 0.          0.          0.07692308]
Predicted is spam: 	         False
Is spam: 			 False

Test case:
====================
Text: 				 MAkE $$!
Features: 			 [ 1.    1.    0.75]
Predicted is spam: 	         True
Is spam: 			 True

Test case:
====================
Text: 				 Grades available for Text Mining
Features: 			 [ 0.          0.          0.10714286]
Predicted is spam: 	         False
Is spam: 			 False

Test case:
====================
Text: 				 SELL HOUSE FOR $1,000,000!!
Features: 			 [ 1.  1.  1.]
Predicted is spam: 	         True
Is spam: 			 True

As you can see, the classifier does the right thing! Now we can filter spammy e-mail messages ourselves!

Exercise

Try to implement more features, like the words used in text messages. For example, the word “VIAGRA” is often found in spam e-mails. Try to classify some more text messages. If you need any help, you can send me a message.

Share this post on:

Get updates in your inbox

Join over 8,000 data science learners.

Get updates in your inbox

Join over 8,000 data science learners.

Share this post on