In this article, I will share my findings on creating a character-based Sequence-to-Sequence model (Seq2Seq) and I will share some of the results I have found. All of this is just a tiny part of my Master Thesis and it took quite a while for me to learn how to convert the theoretical concepts into practical models. I will also share the lessons that I have learned.
Natural Language Processing (NLP)
This blog post is about Natural Language Processing (NLP in short). It is not easy for computers to interpret texts. The way you are able to read these sentences is way better than computers can do. However, it depends on the task. A computer is good at extracting facts but not good at understanding a text. Understanding is not well-defined here, but you get the message. If we read, the information transfers from the medium (a paper, your computer or another screen) to your eyes and then the information gets translated to brain signals. For a computer, a similar process happens. The computer needs to transfer the textual information into numerical information and nowadays we use Neural Networks for that. There are other ways of converting text to numbers, but that will be discussed in a different blog post.
Character-based Seq2Seq model
Recurrent Neural Networks are a special type of Neural Networks. An overview of different types of neural networks can be found here. An LSTM is a special kind of RNNs but don’t worry about the terminology! We will use these networks as building block for our architecture. You do not need to worry about the details. The only thing to remember is that RNNs take into account sequences of data and merge information of the items seen so far with new information.
The Seq2Seq (sequence-to-sequence) model has the following architecture:
As you can see, ‘HEY’ is the input. The processing is based on the sequence. So first ‘H’ is fed into the network. A intermediate state vector is formed containing the information of ‘H’. This is an internal vector which is used for remembering important parts of the data. Then, the intermediate state vector is fed into the second RNN and it combines the information seen thus far (‘H’) with the new data (‘E’). At the last step, ‘Y’ is added to the state vector and this intermediate state vector is then shown in yellow. The procedure so far is done by a so-called encoder. It takes characters (or other items) as input and it forms a thought vector as output.
Now comes the decoder into play. It takes the thought vector as input and should produce an output. We can train it on anything we like! You could train a chatbot using the same model but by using words instead of characters as inputs! I will devote a future blog post on that topic. For now, we will train our model such that it can reproduce the input.
Now, why is this so special? Think about it. The decoder is able to convert some numbers to a text. If the thought vector is small enough, then all input data is compressed into a small vector which contains just enough information. The model acts as a compressor. It learns how to store information as efficient as possible into the thought vector and how to decode the compression into text. So the model has one clear advantage:
- Compressing input information, i.e. make the input information smaller.
You will see at the end of this article why this is useful!
To be or not to be a dataset, that’s the question. You probably guessed it, we will use Shakespear as input data! In particular, we will use the following file: Hamlet. First, we use NLTK to extract words and then we convert the words to
Character RNN implementation in Python and Tensorflow
I will share the full code I used for the implementation. I will leave out the details. If you have questions or if you like to learn more, please leave a comment below. The code can be found here on GitHub.
- Complex models need sanity checks.
- Build your model step-by-step. After adding a new component to your network, check whether the full model still works.
- Work on small batches of data when developing a model.
Feel free to share your experience in the comments!
Actually, the model is quite good at compression. Consider the following sentence:
to b or not to b tht s th qston
Are you able to figure out what the original sentence is? Probably, you can! If not, please leave a message in the comments! So how is this result obtained? In the model, we train character embeddings. If we visualize the embeddings in 2D (using PCA), we get the following plot:
As you can see, the most frequent characters like “e” and “n” are close to the origin (where the black lines cross). Why? If information is irrelevant, its vector is close to zero because it does not contribute a lot to the information. If I leave out all the e’s in a sentence, you can perfectly read that sentence. So the further away the characters are from the origin, the more they contribute to the information in a sentence.
Not only character information is available. It is also possible to compare thought vectors of entire words. In particular, it would be interesting to see the relation between suffixes (-ly, -s) between the word “most” and “mostly” and “dog” and “dogs” and the relation between prefixes (il-, dis-) as in “illegal” versus “legal” and “agree” versus “disagree”. The results (visualized using PCA) are shown in the following plot:
You can clearly see the similarity between the pair (“dog”, “dogs”) and (“cat”, “cats”). The difference between the two vectors approximately has the same direction. The same holds for (“mostly”, “most”) and (“frequently”, “frequent”). Clearly, this information is stored in the thought vector!
It is not easy to implement a sequence to sequence model, but it can lead to promising results. I described the things I learned during my Seq2Seq model building journey. If you liked this article, please share it, sharing is caring :-)!
Get updates in your inbox
Join over 8,000 data science learners.