Active Learning: An Introduction
What is active learning? How can it help you to speed up your model development? In this blog post, we will see what Active Learning is, how it can help you and how you can implement Active Learning in pseudo code. But first things first, what is Active Learning?
What Is Active Learning?
Active Learning is a method in which data is annotated in s smart way. With data annotation, you would normally get to see a randomly selected item which you need to label. This however can lead to a lot of repetition of similar items which you have to label. This is a waste of time. A better way would be to use Active Learning. For Active Learning, a batch of random items is selected first. Then, a lightweight classifier is used for evaluating the previously annotated data. If this would be the first round, then the classifier is not trained at all. This classifier will assign probabilities to each of its predictions. Then, the items at which the classifier is not sure are selected to be annotated first.

Why Active Learning?
So, why would you do this? Research has shown that it is beneficial to use Active Learning as it speeds up the annotation process. Besides that, you will also not waste time on annotating near duplicate items. Potential pitfalls are that the lightweight classifier is too different from the model you will use in production. If this is the case, then active learning might find the weak spots of the lightweight classifier which might differ from the weak spots of the model used in production. Another issue is that it might introduce a bias into your dataset based on the bias of the lightweight classifier. In the following plot you can see the effect of Active Learning versus Random Sampling:
Active Learning Implementation
In this section, we will see how Active Learning can be implemented in practice. In this example, we will take a look at pseudo code. For simplicity, we will examine a simple spam classification task in which a message is either classified as spam
or no_spam
.
First, define two files for each of the classes: spam.txt
and no_spam.txt
. Each file should consist of lines where each line in spam.txt
is a spammy message and each line in no_spam.txt
contains a message which is not spam. Furthermore, create an additional file dataset.txt
which contains unlabeled items in each line. In the beginning, only dataset.txt
needs to contain data. The other files can be empty.
Now we can implement Active Learning. First, load the data of dataset.txt
. Then, build a classifier of your choice. For example, a TF-IDF + Logistic Regression classifier. Train this classifier on a balanced selection of spam.txt
and no_spam.txt
. Then, predict scores for all items found in dataset.txt
and select the item with the score the closest to 0.5. You can take for example the items with the smallest squared distance to 0.5. Then, put this item in either spam.txt
or no_spam.txt
and repeat the process. After you have obtained a satisfying score, you can stop.
Conclusion
Active Learning is a powerful method for labeling new items and outperforms Random Sampling. Some effort might be required for setting up Active Learning, but it will pay-off, even after a few iterations. In this blog post, pseudo code is given for the implementation of Active Learning in practice. If you have any questions, feel free to ask them via Twitter.