N-gram Models

 

What are N-grams?

An n-gram is a sequence of n adjacent phonemes, syllables, letters or words in a particular order. N-grams can be classified into many divisions based on their n value. For example, if n is 1, we call it a unigram; n = 2, a bigram and so on with the prefixes being describing of their sequences.

Let's take the sentence "The quick brown fox jumps over the lazy dog." and generate n-grams from it:

As you can see, the Bigrams contain more context than the Unigrams and Trigrams contain more context than both. Now we will move on to the N-gram models.
If you want to learn more about n-grams you can go to here and here.

 

What are N-gram Language Models?

An N-gram language model predicts the probability of a given N-gram within any sequence of words in a language (or the ones in its training data). The model predicts the next word by taking the last sequences of words in the input and comparing it with the available data to find which word is the most likely to occur next. Some may find this useful as it provides an accurate relationsip between words. On the other hand, if we desire uniqueness, we might need to randomize the selection of the words.

You can find more information about this here.

 

How my model works

My N-gram model is fairly simple. It doesn't use any of those smoothing techniques nor does it require different functions to create and use different n-grams.
It basically works by using nested dictionaries where the previous word is the key for the current one. It generates dictionaries for a bigram like this:

The numbers at the last of the dictionary tell the probability of that combination of words happening.
The N-gram model then uses this to predict more words. You can go to the github repository to see how its done.

 

How to use

Training

  1. Enter large text you want to train the N-gram with in the corpus field.
  2. Choose the type of N-gram (1 = unigram; 2 = bigram ...).
  3. Choose what symbols to remove from the "Additional settings" dropdown.
  4. Click the download button (It will download the json file to your computer).

Upoloading

This is where you upload the N-gram that will be used to predict text.
If you have a json file already, you don't need to train it again.
  1. Click the field that asks you to upload (drag and drop not implemented yet).
  2. Browse to your Json file. It should look like "Ngram_{Time-stamp}.json".
  3. Click the upload button.

Generating

  1. Enter text that appears in your corpus.
  2. Check "choose?" if you want to select only one word.
    Uncheck "choose?" if you want to see what the available words are.
  3. Check "predict many?" if you want to generate words repeatedly. ("choose?" will be disabled and new field would appear.)
  4. Choose the word count. (This is the amount of words that the n-gram will generate.)
  5. Click "Generate" and you will find your output under "Output:".

Train

Additional settings

Upload

Generate



Output: