Spelling correction is not a trivial task for a computer. Better and better models are invented to tackle problems such as spelling correction. Language models are the kind of models that are being used for this task. Language models are also used for correcting errors in speech recognition, machine translation, for language and authorship identification, text compression and topic relevance ranking. In this article, language models are being used for a simple spelling correction application.
There are many ways to tackle spelling errors. One of the simplest ideas is to check each word against a dictionary. There are some problems that can occur here. For example, if I made an error in the word “ra”, did I mean “rat” or “read” originally? It depends on the context. In the sentence “I saw a ra running”, I probably mean “I saw a rat running” (and not “I saw a read running”). Furthermore, it would take a vast amount of space to store the dictionary and besides that it does not take into account new words that are not stored in the dictionary. Another way of handling spelling errors, is to convert a sentence to its Soundex and then convert it back to a sentence. The Soundex was invented by (Knuth 1973, Davidson 1962) and it convert words to only their sound related features. The word “bicycle” is converted code “B224” in the Soundex system. Also “bycicle” and “bicycle” are converted to “B224”, so even with spelling errors we get the same code. Creating the code is done by discarding vowels and grouping consonant letters. If we have converted all words to their corresponding Soundex code, we can try to reconstruct the original sentence without spelling errors.
Now what is a language model? A language model is a simplified statistical model of text. It is a data driven approach instead of the many rule based approaches that exist. Rule based approaches are based on expert knowledge. The problem here is that languages (at least the human language) are approximately infinite. Every word has at least many variants and to make things worse, words can be combined to form new words. If a new brand pops up, then new words are introduced. So language is changing over time. That means that a static set of rules can never define all aspects of a language. Therefore, we need another approach to model language. This can be done by a statistical language model.
Suppose we have a language model . Then we are able to compare how well-formed a sentence is. Suppose we have a sentence “The bycicle was green.” You can see the spelling error “bycicle” here. The correct sentence would be “The bicycle was green.” So, in our statistical model, we would say that it is more probable that “The bicycle was green.” was generated by the English language model than “The bycicle was green.” In mathematics, we would say: .
Notice that we are also capable of finding the right topic for a text. Suppose we have two models: and . Suppose we have a sentence “Hot coffee.” Then, we now that . In words: it is more probable that “Hot coffee.” is generated by a language model about food than that it was generated by a language model about sports.
Now we can turn to our spelling correction problem. First, we need a sentence which has a spelling error in it. Take the sentence “buy chocolat chip cookies” as an example. Our goal is to find that “chocolat” has a spelling error. Suppose we have a language model which we would like to train and use for our example sentence. So we first need some correctly spelled sentences. For simplicity we will use the following correctly spelled sentences:
"cookies", "chocolate", "chip", "chocolate chip cookie", "chocolate chip cookies", "buy"
The first goal is to see that the sentence has a spelling error in it. So we will first look at . Notice that this probability can be decomposed: . This decomposition might be done if and only if we assume that the words are independent of each other (which is actually not the case). But for simplicity, we will assume that the words are independent of each other (and identically distributed).
First, we look at the unigrams. Unigrams are single words in a text. So from the example sentences, we now that . In words, it is 100% probable that “buy” is a unigram (a word). We also now from the example sentences that . We haven’t seen the word “chocolat” in our training sentences. And furthermore we now that and . So using this simple model, we can already see that “chocolat” is not spelled correctly. So . The probability that “buy chocolate chip cookies” is a valid sentence (according to our model) is ! Now we only need to correct “chocolat”.
The correction can be done easily by looking at the context. For example, around “chocolat”, we have the words “chip” and “cookies”. We can ask our model to fill in the blanks in the following sentence: “_ chip cookies” (notice that we could also take another number of words, but for simplicity we will use 3 words). The model would come up with “chocolate” since it has seen the sentence “chocolate chip cookies”. The “distance” (here: Levenshtein distance) between “chocolat” and “chocolate” is 1, so there is a small difference between the words (probably a typing error, which often is equal to one mistake). But the word “chocolate” would make the sentence perfectly valid! In mathematics: which is indeed a perfectly valid sentence. So now we have succesfully applied spelling error detection and spelling error correction, which was our goal.
The previous example is shown in the following Python code. Only the unigram spelling correction is implemented.
So by using a statistical approach with a language model, we were able to find and correct a spelling error. By using a rule based approach, this would also be possible, but it would not work for relatively new words. There are many extensions possible to this statistical approach, but notice that it is a powerful approach which can be easily transformed to give solutions to related problems (such as finding a topic for a given sentence). If you have any questions or additions, feel free to write a comment.