Artificial Vision and Language Processing for Robotics
上QQ阅读APP看书,第一时间看更新

Language Modeling

So far, we have reviewed the most basic techniques for pre-processing text data. Now we are going to dive deep into the structure of natural language – language models. We can consider this topic an introduction to machine learning in NLP.

Introduction to Language Models

A statistical Language Model (LM) is the probability distribution of a sequence of words, which means, to assign a probability to a particular sentence. For example, LMs could be used to calculate the probability of an upcoming word in a sentence. This involves making some assumptions about the structure of the LM and how it will be formed. An LM is never totally correct with its output, but using one is often necessary.

LMs are used in many more NLP tasks. For example, in machine translation, it is important to know what sentence precedes the next. LMs are also used for speech recognition, to avoid ambiguity, for spelling corrections, and for summarization.

Let's see how an LM is mathematically represented:

  • P(W) = P(w1, w2,w3,w4,…wn)

P(W) is our LM and wi are the words included in W, and as we mentioned before, we can use it to compute the probability of an upcoming word in this way:

  • P(w5|w1,w2,w3,w4)

This (w1, w2, w3, w4) states what the probability of w5 (the upcoming word) could be in a given sequence of words.

Looking at this example, P (w5|w1, w2, w3, w4), we can assume this:

  • P(actual word | previous words)

Depending on the number of previous words we are looking at to obtain the probability of the actual word, there are different models we can use. So, now we are going to introduce some important concepts regarding such models.

The Bigram Model

The bigram model is a sequence of two consecutive words. For example, in the sentence "My cat is white," there are these bigrams:

My cat

Cat is

Is white

Mathematically, a bigram has this form:

  • Bigram model: P(wi|wi-1)

N-gram Model

If we change the length of the previous word, we obtain the N-gram model. It works just like the bigram model but considers more words than the previous set.

Using the previous example of "My cat is white," this is what we can obtain:

  • Trigram

    My cat is

    Cat is white

  • 4-gram
  • My cat is white

N-Gram Problem

At this point, you could think the n-gram model is more accurate than the bigram model because the n-gram model has access to additional "previous knowledge." However, n-gram models are limited to a certain extent, because of long-distance dependencies. An example would be, "After thinking about it a lot, I bought a television," which we compute as:

  • P(television| after thinking about it a lot, I bought a)

The sentence "After thinking about it a lot, I bought a television" is probably the only sequence of words with this structure in our corpus. If we change the word "television" for another word, for example "computer," the sentence "After thinking about it a lot, I bought a computer" is also valid, but in our model, the following would be the case:

  • P(computer| after thinking about it a lot, I bought a) = 0

This sentence is valid, but our model is not accurate, so we need to be careful with the use of n-gram models.

Calculating Probabilities

Unigram Probability

The unigram is the simplest case for calculating probabilities. It counts the number of times a word appears in a set of documents. Here is the formula for this:

Figure 3.27: Unigram probability estimation

  • c(wi) is the number of times
  • wi appears in the whole corpus. The size of the corpus is just how many tokens are in it.

Bigram Probability

To estimate bigram probability, we are going to use maximum likelihood estimation:

Figure 3.27: Bigram probability estimation

To understand this formula better, let's look at an example.

Imagine our corpus is composed of these three sentences:

My name is Charles.

Charles is my name.

My dog plays with the ball.

The size of the corpus is 14 words, and now we are going to estimate the probability of the sequence "my name":

Figure 3.28: Example of bigram estimation

The Chain Rule

Now we know the concepts of bigrams and n-grams, we need to know how we can obtain those probabilities.

If you have basic statistics knowledge, you might think the best option is to apply the chain rule and join each probability. For example, in the sentence "My cat is white," the probability is as follows:

  • P(my cat is white) = p(white|my cat is) p(is|my cat) p(cat|my) p(my)

It seems to be possible with this sentence, but if we had a much longer sentence, long-distance dependency problems would appear and the result of the n-gram model could be incorrect.

Smoothing

So far, we have a probabilistic model, and if we want to estimate the parameters of our model, we can use the maximum likelihood of estimation.

One of the big problems of LMs is insufficient data. Our data is limited, so there will be many unknown events. What does this mean? It means we'll end up with an LM that gives a probability of 0 to unseen words.

To solve this problem, we are going to use a smoothing method. With this smoothing method, every probability estimation result will be greater than zero. The method we are going to use is add-one smoothing:

Figure 3.29: Add-one smoothing in bigram estimation

V is the number of distinct tokens in our corpus.

Note

There are more smoothing methods with better performance; this is the most basic method.

Markov Assumption

Markov assumption is very useful for estimating the probabilities of a long sentence. With this method, we can solve the problem of long-distance dependencies. Markov assumption simplifies the chain rule to estimate long sequences of words. Each estimation only depends on the previous step:

Figure 3.30: Markov assumption

We can also have a second-order Markov assumption, which depends on two previous terms, but we are going to use first-order Markov assumption:

Figure 3.31: Example of Markov

If we apply this to the whole sentence, we get this:

Figure 3.32: Example of Markov for a whole sentence

Decomposing the sequence of words in the aforementioned way will output the probabilities more accurately.

Exercise 13: Create a Bigram Model

In this exercise, we are going to create a simple LM with unigrams and bigrams. Also, we will compare the results of creating the LM both without add-one smoothing and with it. One application of the n-gram is, for example, in keyboard apps. They can predict your next word. That prediction could be done with a bigram model:

  1. Open up your Google Colab interface.
  2. Create a folder for the book.
  3. Declare a small, easy training corpus:

    import numpy as np

    corpus = [

         'My cat is white',

         'I am the major of this city',

         'I love eating toasted cheese',

         'The lazy cat is sleeping',

    ]

  4. Import the required libraries and load the model:

    import spacy

    import en_core_web_sm

    from spacy.lang.en.stop_words import STOP_WORDS

    nlp = en_core_web_sm.load()

  5. Tokenize it with spaCy. To be faster in doing the smoothing and the bigrams, we are going to create three lists:

    Tokens: All tokens of the corpus

    Tokens_doc: List of lists with the tokens of each corpus

    Distinc_tokens: All tokens removing duplicates:

    tokens = []

    tokens_doc = []

    distinc_tokens = []

    Let's create a first loop to iterate over the sentences in our corpus. The doc variable will contain a sequence of the sentences' tokens:

    for c in corpus:

        doc = nlp(c)

        tokens_aux = []

    Now we are going to create a second loop to iterate through the tokens to push them into the corresponding list. The t variable will be each token of the sentence:

        for t in doc:

            tokens_aux.append(t.text)

            if t.text not in tokens:

                distinc_tokens.append(t.text) # without duplicates

            tokens.append(t.text)

        tokens_doc.append(tokens_aux)

        tokens_aux = []

        print(tokens)

        print(distinc_tokens)

        print(tokens_doc)

  6. Create the unigram model and test it:

    def unigram_model(word):

        return tokens.count(word)/len(tokens)

    unigram_model("cat")

    Result = 0.1388888888888889

  7. Add the smoothing and test it with the same word:

    def unigram_model_smoothing(word):

        return (tokens.count(word) + 1)/(len(tokens) + len(distinc_tokens))

    unigram_model_smoothing("cat")

    Result = 0.1111111111111111

    Note

    The problem with this smoothing method is that every unseen word has the same probability.

  8. Create the bigram model:

    def bigram_model(word1, word2):

        hit = 0

  9. We need to iterate through all of the tokens in the documents to try to find the number of times that word1 and word2 appear together:

        for d in tokens_doc:

            for t,i in zip(d, range(len(d))): # i is the length of d

                if i <= len(d)-2:

                    if word1 == d[i] and word2 == d[i+1]:

                        hit += 1

        print("Hits: ",hit)

        return hit/tokens.count(word1)

    bigram_model("I","am")

    The output is as follows:

    Figure 3.33: Output showing the times word1 and word2 appear together in the document

  10. Add the smoothing to the bigram model:

    def bigram_model_smoothing(word1, word2):

        hit = 0

        for d in tokens_doc:

            for t,i in zip(d, range(len(d))):

                if i <= len(d)-2:

                    if word1 == d[i] and word2 == d[i+1]:

                        hit += 1

        return (hit+1)/(tokens.count(word1)+len(distinc_tokens))

    bigram_model("I","am")

    The output is as follows:

Figure 3.34: Output after adding smoothing to the model

Congratulations! You have completed the last exercise of this chapter. In the next chapter, you will see that this LM approach is a fundamental deep NLP approach. You can now take a huge corpus and create your own LM.

Note

Applying the Markov assumption, the final probability will round the 0. I recommend using log() and adding each component. Also, check the precision bits of your code (float16 < float32 < float64).