Part 1 – extracting n-grams from a sentence (10 pts)
Complete the function get_ngrams, which takes a list of strings and an integer n as input,
and returns padded n-grams over the list of strings. The result should be a list of Python
For example:
>>> get_ngrams([“natural”,”language”,”processing”],1)
[(‘START’,), (‘natural’,), (‘language’,), (‘processing’,), (‘STOP’,)]
>>> get_ngrams([“natural”,”language”,”processing”],2)
(‘START’, ‘natural’), (‘natural’, ‘language’), (‘language’, ‘processing’), (‘processi
ng’, ‘STOP’)]
>>> get_ngrams([“natural”,”language”,”processing”],3)
[(‘START’, ‘START’, ‘natural’), (‘START’, ‘natural’, ‘language’), (‘natural’, ‘langua
ge’, ‘processing’), (‘language’, ‘processing’, ‘STOP’)]
Part 2 – counting n-grams in a corpus (10 pts)
We will work with two different data sets. The first data set is the Brown corpus, which is
a sample of American written English collected in the 1950s. The format of the data is a
plain text file brown_train.txt, containing one sentence per line. Each sentence has
already been tokenized. For this assignment, no further preprocessing is necessary.
Don’t touch brown_test.txt yet. We will use this data to compute the perplexity of our
language model.
Reading the Corpus and Dealing with Unseen Words
This part has been implemented for you and are explained in this section. Take a look at
the function corpus_reader in trigram_model.py. This function takes the name of a text
file as a parameter and returns a Python generator object. Generators allow you to iterate
over a collection, one item at a time without ever having to represent the entire data set
in a data structure (such as a list). This is a form of lazy evaluation. You could use this
function as follows: