Spelling correction using Language Models

Words

Spelling correction is not a trivial task for a computer. Better and better models are invented to tackle problems such as spelling correction. Language models are the kind of models that are being used for this task. Language models are also used for correcting errors in speech recognition, machine translation, for language and authorship identification, text compression and topic relevance ranking. In this article, language models are being used for a simple spelling correction application.

There are many ways to tackle spelling errors. One of the simplest ideas is to check each word against a dictionary. There are some problems that can occur here. For example, if I made an error in the word “ra”, did I mean “rat” or “read” originally? It depends on the context. In the sentence “I saw a ra running”, I probably mean “I saw a rat running” (and not “I saw a read running”). Furthermore, it would take a vast amount of space to store the dictionary and besides that it does not take into account new words that are not stored in the dictionary. Another way of handling spelling errors, is to convert a sentence to its Soundex and then convert it back to a sentence. The Soundex was invented by (Knuth 1973, Davidson 1962) and it convert words to only their sound related features. The word “bicycle” is converted code “B224” in the Soundex system. Also “bycicle” and “bicycle” are converted to “B224”, so even with spelling errors we get the same code. Creating the code is done by discarding vowels and grouping consonant letters. If we have converted all words to their corresponding Soundex code, we can try to reconstruct the original sentence without spelling errors.

Now what is a language model? A language model is a simplified statistical model of text. It is a data driven approach instead of the many rule based approaches that exist. Rule based approaches are based on expert knowledge. The problem here is that languages (at least the human language) are approximately infinite. Every word has at least many variants and to make things worse, words can be combined to form new words. If a new brand pops up, then new words are introduced. So language is changing over time. That means that a static set of rules can never define all aspects of a language. Therefore, we need another approach to model language. This can be done by a statistical language model.

Suppose we have a language model M_{\mbox{English}}. Then we are able to compare how well-formed a sentence is. Suppose we have a sentence “The bycicle was green.” You can see the spelling error “bycicle” here. The correct sentence would be “The bicycle was green.” So, in our statistical model, we would say that it is more probable that “The bicycle was green.” was generated by the English language model than “The bycicle was green.” In mathematics, we would say: P(\mbox{The bicycle was green.} | M_{\mbox{English}}) > P(\mbox{The bycicle was green.} | M_{\mbox{English}}).

Notice that we are also capable of finding the right topic for a text. Suppose we have two models: M_{\mbox{sports}} and M_{\mbox{food}}. Suppose we have a sentence “Hot coffee.” Then, we now that P(\mbox{Hot coffee.} | M_{\mbox{food}} ) > P(\mbox{Hot coffee.} | M_{\mbox{sports}} ). In words: it is more probable that “Hot coffee.” is generated by a language model about food than that it was generated by a language model about sports.

Now we can turn to our spelling correction problem. First, we need a sentence which has a spelling error in it. Take the sentence “buy chocolat chip cookies” as an example. Our goal is to find that “chocolat” has a spelling error. Suppose we have a language model M which we would like to train and use for our example sentence. So we first need some correctly spelled sentences. For simplicity we will use the following correctly spelled sentences:

"cookies", "chocolate", "chip", "chocolate chip cookie", "chocolate chip cookies", "buy"

The first goal is to see that the sentence has a spelling error in it. So we will first look at P(\mbox{buy, chocolat, chip, cookies} | M). Notice that this probability can be decomposed: P(\mbox{buy, chocolat, chip, cookies} | M) = P(\mbox{buy} | M) P(\mbox{chocolat} | M) P(\mbox{chip} | M) P(\mbox{cookies} | M). This decomposition might be done if and only if we assume that the words are independent of each other (which is actually not the case). But for simplicity, we will assume that the words are independent of each other (and identically distributed).

First, we look at the unigrams. Unigrams are single words in a text. So from the example sentences, we now that P(\mbox{buy} | M)=1. In words, it is 100% probable that “buy” is a unigram (a word). We also now from the example sentences that P(\mbox{chocolat} | M)=0. We haven’t seen the word “chocolat” in our training sentences. And furthermore we now that P(\mbox{chip} | M) = 1 and P(\mbox{cookies} | M) = 1. So using this simple model, we can already see that “chocolat” is not spelled correctly. So P(\mbox{buy, chocolat, chip, cookies} | M) =P(\mbox{buy} | M) P(\mbox{chocolat} | M) P(\mbox{chip} | M) P(\mbox{cookies} | M) = 1 \cdot 0 \cdot 1 \cdot 1 = 0. The probability that “buy chocolate chip cookies” is a valid sentence (according to our model) is 0! Now we only need to correct “chocolat”.

The correction can be done easily by looking at the context. For example, around “chocolat”, we have the words “chip” and “cookies”. We can ask our model to fill in the blanks in the following sentence: “_ chip cookies” (notice that we could also take another number of words, but for simplicity we will use 3 words). The model would come up with “chocolate” since it has seen the sentence “chocolate chip cookies”. The “distance” (here: Levenshtein distance) between “chocolat” and “chocolate” is 1, so there is a small difference between the words (probably a typing error, which often is equal to one mistake). But the word “chocolate” would make the sentence perfectly valid! In mathematics: P(\mbox{buy, chocolate, chip, cookies} | M) =P(\mbox{buy} | M) P(\mbox{chocolate} | M) P(\mbox{chip} | M) P(\mbox{cookies} | M) = 1 \cdot 1 \cdot 1 \cdot 1 = 1 which is indeed a perfectly valid sentence. So now we have succesfully applied spelling error detection and spelling error correction, which was our goal.

The previous example is shown in the following Python code. Only the unigram spelling correction is implemented.

from collections import defaultdict
import math


def get_words(sentence):
    """
    Get a list of all words which are found in a given sentence.

    Parameters
    ----------
    sentence : str
        The sentence to find words in.

    Returns
    -------
    list
        A list containing all words found in the sentence.
    """
    return sentence.split()


def calculate_distance(word1, word2):
    """
    Calculate the distance between two words. This method could be replaced by a method calculating the Levenshtein
    distance.

    Parameters
    ----------
    word1 : str
        First word.
    word2 : str
        Second word.

    Returns
    -------
    float
        Distance between word1 and word2
    """
    return len(word1.replace(word2, ''))

# Define the training and test data
training_sentences = ["cookies", "chocolate", "chip", "chocolate chip cookie", "chocolate chip cookies", "buy"]
sentence = "buy chocolat chip cookies"

# Find all words in the data
training_data = [get_words(sentence) for sentence in training_sentences]
words = get_words(sentence)

# Show the sentence with spelling errors
print("Incorrect sentence:\t{0}\n".format(sentence))

# Count unigrams
counts = defaultdict(int)
for sentence_data in training_data:
    for unigram in sentence_data:
        counts[unigram] += 1

# Normalize probabilities
totals = sum(counts.values())
for unigram in counts:
    counts[unigram] /= totals

# Now check every word in the test sentence
correct_unigrams = []
for word in words:
    if counts[word] == 0:
        # The word did not occur in the training data!
        print("Spelling error detected:\t{0}".format(word))
        # Now find a good replacement
        minimum_distance = math.inf
        best_replacement = None
        for vocab_word in counts:
            distance = calculate_distance(vocab_word, word)
            # Make sure the vocabulary word is not the same as the spelling error and make sure it is better than the
            # result found so far
            if 0 < distance < minimum_distance:
                minimum_distance = distance
                best_replacement = vocab_word
        # Show the correction
        print("Best found correct word:\t{0} (distance: {1})\n".format(best_replacement, minimum_distance))
        correct_unigrams.append(best_replacement)
    else:
        correct_unigrams.append(word)

print("Corrected sentence:\t{0}".format(' '.join(correct_unigrams)))

So by using a statistical approach with a language model, we were able to find and correct a spelling error. By using a rule based approach, this would also be possible, but it would not work for relatively new words. There are many extensions possible to this statistical approach, but notice that it is a powerful approach which can be easily transformed to give solutions to related problems (such as finding a topic for a given sentence). If you have any questions or additions, feel free to write a comment.

Kevin Jacobs

Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: mail@kevinjacobs.nl.