# Implement spelling correction using Language Models

Spelling correction is not a trivial task for a computer. Better and better models are invented to tackle problems such as spelling correction. Language models are the kind of models that are being used for this task. Language models are also used for correcting errors in speech recognition, machine translation, for language and authorship identification, text compression and topic relevance ranking. In this article, language models are being used for a simple spelling correction application.

There are many ways to tackle spelling errors. One of the simplest ideas is to check each word against a dictionary. There are some problems that can occur here. For example, if I made an error in the word “ra”, did I mean “rat” or “read” originally? It depends on the context. In the sentence “I saw a ra running”, I probably mean “I saw a rat running” (and not “I saw a read running”). Furthermore, it would take a vast amount of space to store the dictionary and besides that it does not take into account new words that are not stored in the dictionary. Another way of handling spelling errors, is to convert a sentence to its Soundex and then convert it back to a sentence. The Soundex was invented by (Knuth 1973, Davidson 1962) and it convert words to only their sound related features. The word “bicycle” is converted code “B224” in the Soundex system. Also “bycicle” and “bicycle” are converted to “B224”, so even with spelling errors we get the same code. Creating the code is done by discarding vowels and grouping consonant letters. If we have converted all words to their corresponding Soundex code, we can try to reconstruct the original sentence without spelling errors.

## Language Model?

Now what is a language model? A language model is a simplified statistical model of text. It is a data driven approach instead of the many rule based approaches that exist. Rule based approaches are based on expert knowledge. The problem here is that languages (at least the human language) are approximately infinite. Every word has at least many variants and to make things worse, words can be combined to form new words. If a new brand pops up, then new words are introduced. So language is changing over time. That means that a static set of rules can never define all aspects of a language. Therefore, we need another approach to model language. This can be done by a statistical language model.

## A Topic Model

Suppose we have a language model $M_{\mbox{English}}$. Then we are able to compare how well-formed a sentence is. Suppose we have a sentence “The bycicle was green.” You can see the spelling error “bycicle” here. The correct sentence would be “The bicycle was green.” So, in our statistical model, we would say that it is more probable that “The bicycle was green.” was generated by the English language model than “The bycicle was green.” In mathematics, we would say: $P(\mbox{The bicycle was green.} | M_{\mbox{English}}) > P(\mbox{The bycicle was green.} | M_{\mbox{English}})$.

Notice that we are also capable of finding the right topic for a text. Suppose we have two models: $M_{\mbox{sports}}$ and $M_{\mbox{food}}$. Suppose we have a sentence “Hot coffee.” Then, we now that $P(\mbox{Hot coffee.} | M_{\mbox{food}} )$$> P(\mbox{Hot coffee.} | M_{\mbox{sports}} )$. In words: it is more probable that “Hot coffee.” is generated by a language model about food than that it was generated by a language model about sports.

By the way, maybe this article is interesting for you. It is about the implementation of a e-mail spam detector and is closely related to language modeling. Or maybe you are interested in a more advanced model in which we will use a sequence-to-sequence model.

## The Spelling Correction problem

Now we can turn to our spelling correction problem. First, we need a sentence which has a spelling error in it. Take the sentence “buy chocolat chip cookies” as an example. Our goal is to find that “chocolat” has a spelling error. Suppose we have a language model $M$ which we would like to train and use for our example sentence. So we first need some correctly spelled sentences. For simplicity we will use the following correctly spelled sentences:

"cookies", "chocolate", "chip", "chocolate chip cookie", "chocolate chip cookies", "buy"

## Decompose it

The first goal is to see that the sentence has a spelling error in it. So we will first look at $P(\mbox{buy, chocolat, chip, cookies} | M)$. Notice that this probability can be decomposed: $P(\mbox{buy, chocolat, chip, cookies} | M) = P(\mbox{buy} | M) P(\mbox{chocolat} | M) P(\mbox{chip} | M) P(\mbox{cookies} | M)$. This decomposition might be done if and only if we assume that the words are independent of each other (which is actually not the case). But for simplicity, we will assume that the words are independent of each other (and identically distributed).

## A Simple Unigram Model

First, we look at the unigrams. Unigrams are single words in a text. So from the example sentences, we now that $P(\mbox{buy} | M)=1$. In words, it is 100% probable that “buy” is a unigram (a word). We also now from the example sentences that $P(\mbox{chocolat} | M)=0$. We haven’t seen the word “chocolat” in our training sentences. And furthermore we now that $P(\mbox{chip} | M) = 1$ and $P(\mbox{cookies} | M) = 1$. So using this simple model, we can already see that “chocolat” is not spelled correctly. So $P(\mbox{buy, chocolat, chip, cookies} | M)$$=P(\mbox{buy} | M) P(\mbox{chocolat} | M) P(\mbox{chip} | M) P(\mbox{cookies} | M)$$= 1 \cdot 0 \cdot 1 \cdot 1 = 0$. The probability that “buy chocolate chip cookies” is a valid sentence (according to our model) is $0$! Now we only need to correct “chocolat”.

The correction can be done easily by looking at the context. For example, around “chocolat”, we have the words “chip” and “cookies”. We can ask our model to fill in the blanks in the following sentence: “_ chip cookies” (notice that we could also take another number of words, but for simplicity we will use 3 words). The model would come up with “chocolate” since it has seen the sentence “chocolate chip cookies”. The “distance” (here: Levenshtein distance) between “chocolat” and “chocolate” is 1, so there is a small difference between the words (probably a typing error, which often is equal to one mistake). But the word “chocolate” would make the sentence perfectly valid! In mathematics: $P(\mbox{buy, chocolate, chip, cookies} | M)$$=P(\mbox{buy} | M) P(\mbox{chocolate} | M) P(\mbox{chip} | M) P(\mbox{cookies} | M)$$= 1 \cdot 1 \cdot 1 \cdot 1 = 1$ which is indeed a perfectly valid sentence. So now we have succesfully applied spelling error detection and spelling error correction, which was our goal.

The previous example is shown in the following Python code. Only the unigram spelling correction is implemented.

from collections import defaultdict
import math

def get_words(sentence):
"""
Get a list of all words which are found in a given sentence.

Parameters
----------
sentence : str
The sentence to find words in.

Returns
-------
list
A list containing all words found in the sentence.
"""
return sentence.split()

def calculate_distance(word1, word2):
"""
Calculate the distance between two words. This method could be replaced by a method calculating the Levenshtein
distance.

Parameters
----------
word1 : str
First word.
word2 : str
Second word.

Returns
-------
float
Distance between word1 and word2
"""
return len(word1.replace(word2, ''))

# Define the training and test data
training_sentences = ["cookies", "chocolate", "chip", "chocolate chip cookie", "chocolate chip cookies", "buy"]
sentence = "buy chocolat chip cookies"

# Find all words in the data
training_data = [get_words(sentence) for sentence in training_sentences]
words = get_words(sentence)

# Show the sentence with spelling errors
print("Incorrect sentence:\t{0}\n".format(sentence))

# Count unigrams
counts = defaultdict(int)
for sentence_data in training_data:
for unigram in sentence_data:
counts[unigram] += 1

# Normalize probabilities
totals = sum(counts.values())
for unigram in counts:
counts[unigram] /= totals

# Now check every word in the test sentence
correct_unigrams = []
for word in words:
if counts[word] == 0:
# The word did not occur in the training data!
print("Spelling error detected:\t{0}".format(word))
# Now find a good replacement
minimum_distance = math.inf
best_replacement = None
for vocab_word in counts:
distance = calculate_distance(vocab_word, word)
# Make sure the vocabulary word is not the same as the spelling error and make sure it is better than the
# result found so far
if 0 < distance < minimum_distance:
minimum_distance = distance
best_replacement = vocab_word
# Show the correction
print("Best found correct word:\t{0} (distance: {1})\n".format(best_replacement, minimum_distance))
correct_unigrams.append(best_replacement)
else:
correct_unigrams.append(word)

print("Corrected sentence:\t{0}".format(' '.join(correct_unigrams)))

## Conclusions (TL;DR)

So by using a statistical approach with a language model, we were able to find and correct a spelling error. By using a rule based approach, this would also be possible, but it would not work for relatively new words. There are many extensions possible to this statistical approach, but notice that it is a powerful approach which can be easily transformed to give solutions to related problems (such as finding a topic for a given sentence). If you have any questions or additions, feel free to write a comment.

#### Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: kevin8080nl@gmail.com. Want to write for us? Then please check out Meta Blogger!