Nowadays, many e-mail are being sent and many of them are spam e-mails. In this tutorial, I will guide you through the steps of building a small spam detection application (written in Python).
Many data scientist prefer to use this Python library: Scikit-learn. This library is capable of doing all kinds of Machine Learning magic. So, make sure that you install this library first.
The spam detection problem is in fact a text classification problem. An e-mail (a text document) is either “spam” or “no spam”. In text mining, this is called single-label text classification, since there is only one label: “spam”. A classifier is an algorithm that is capable of telling whether a text document is either “spam” or “no spam”. In this article, we first setup a small dataset on which we will train a classifier. Then, we will create a test dataset and view the results. Before we can do all of this, we need to extract features from the texts.
Take a look at the following e-mails:
$1,000 ALARMING!An appointment on July 2ndNew course content for Text ClassificationBUY VIAGRA!!
It is hopefully clear that the first and the last e-mail are definitely spam. The third e-mail can be viewed as spam, but since we can make decisions, we make the decision that the second and the third e-mail are not spam.
First of all, do the following imports in Python:
So, the burning question is which features should we use? It seems that e-mails with many “!”, “$” or capitals are probably spam e-mails. So, lets create 3 features and lets create our training and test set. The feature vector consists of three entries where the first entry is 1 if there is a “!” in the text (a 0 otherwise). The second entry is 1 if there is a “$” in the text (0 otherwise) and the last entry is the ratio of uppercase characters with respect to the sum of the number of uppercase characters and the number of lowercase characters.
In Python we can create a training set and a test set as follows (the “isspam” variables define whether the text is spam or not):
For the feature extraction, we can write the following method:
What to do next now we have translated the text into numerical values? First note that in our case, our numerical values per document are in a three dimensional space: there are three dimensions/features. A linear classifier tries to separate datapoints in this space by placing a so called hyperplane. You can think of an hyperplane as a line or a plane in higher dimensions. Everything below the hyperplane is classified as spam and everything above the hyperplane is classified as not-spam (or the otherway around, just how you define the hyperplane). Now, there are many choices for a Text Classification algorithm. A choice that does work well in simple cases is the Stochastic Gradient Descent classifier. In Python, it is easy to train the classifier on the training set and test it on the test set. This can be done as follows:
You might also be interested in text classification using Neural Networks. This article explains what neural networks are and this article shows an advanced implementation of a neural network in Python.
After executing the code, we get the following results:
Test case: ==================== Text: New course content for Information Retrieval Features: [ 0. 0. 0.07692308] Predicted is spam: False Is spam: False Test case: ==================== Text: MAkE $$$! Features: [ 1. 1. 0.75] Predicted is spam: True Is spam: True Test case: ==================== Text: Grades available for Text Mining Features: [ 0. 0. 0.10714286] Predicted is spam: False Is spam: False Test case: ==================== Text: SELL HOUSE FOR $1,000,000!! Features: [ 1. 1. 1.] Predicted is spam: True Is spam: True
As you can see, the classifier does the right thing! Now we can classify little spammy e-mail messages.
Try to implement more features, like the words used in text messages. For example, the word “VIAGRA” is often used in spam e-mails. Try to classify some more text messages. If you need any help, you can send me a (not so spammy) e-mail message.