Movie Name Generation Using GPT-2

Since its reveal in 2017 in the popular paper Attention Is All You Need (https://arxiv.org/abs/1706.03762), the Transformer quickly became the most popular model in NLP. The ability to process text in a non-sequential way (as opposed to RNNs) allowed for training of big models. The attention mechanism it introduced proved extremely useful in generalizing text.

Following the paper, several popular transformers surfaced, the most popular of which is GPT. GPT models are developed and trained by OpenAI, one of the leaders in AI research. The latest release of GPT is GPT-3, which has 175 billion parameters. The model was very advanced to the point where OpenAI chose not to open-source it. People can access it through an API after a signup process and a long queue.

However, GPT-2, their previous release is open-source and available on many deep learning frameworks.

In this excercise, we use Huggingface and PyTorch to fine-tune a GPT-2 model for movie name generation.

Overview:

  • Imports and Data Loading
  • Data Preprocessing
  • Setup and Training
  • Movie Name Generation
  • Model Saving and Loading

Imports and Data Loading

Please use pip install {library name} in order to install the libraries below if they are not installed. "transformers" is the Huggingface library.

In [2]:
import re
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch.optim as optim

We set the device to enable GPU processing.

In [3]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device
Out[3]:
device(type='cuda', index=0)

Data Preprocessing

In [5]:
movies_file = "movies.csv"

Since the file is in CSV format, we use pandas.read_csv() to read the file

In [7]:
raw_df = pd.read_csv(movies_file)
raw_df
Out[7]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
9737 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy
9738 193583 No Game No Life: Zero (2017) Animation|Comedy|Fantasy
9739 193585 Flint (2017) Drama
9740 193587 Bungo Stray Dogs: Dead Apple (2018) Action|Animation
9741 193609 Andrew Dice Clay: Dice Rules (1991) Comedy

9742 rows × 3 columns

We can see that we have 9742 movie names in the title column. Since the other columns are not useful for us, we will only keep the title column.

In [29]:
movie_names = raw_df['title']
movie_names
Out[29]:
0                                Toy Story (1995)
1                                  Jumanji (1995)
2                         Grumpier Old Men (1995)
3                        Waiting to Exhale (1995)
4              Father of the Bride Part II (1995)
                          ...                    
9737    Black Butler: Book of the Atlantic (2017)
9738                 No Game No Life: Zero (2017)
9739                                 Flint (2017)
9740          Bungo Stray Dogs: Dead Apple (2018)
9741          Andrew Dice Clay: Dice Rules (1991)
Name: title, Length: 9742, dtype: object

As seen, the movie names all end with the release year. While it may be interesting to keep the years in the names and let the model output years for generated movies, we can safely assume it does not help the model in understanding movie names.

We remove them with a simple regex expression:

In [30]:
movie_list = list(movie_names)
In [31]:
def remove_year(name):
    return re.sub("\([0-9]+\)", "", name).strip()
In [32]:
movie_list = [remove_year(name) for name in movie_list]

The final movie list looks ready for training. Notice that we do not need to tokenize or process the text any further since GPT2 comes with its own tokenizer that handles text in the approriate way.

In [34]:
movie_list[:5]
Out[34]:
['Toy Story',
 'Jumanji',
 'Grumpier Old Men',
 'Waiting to Exhale',
 'Father of the Bride Part II']

However, we should still acquire a fixed length input. We use the average movie name length in words in order to place a safe max length.

In [39]:
avg_length = sum([len(name.split()) for name in movie_list])/len(movie_list)
avg_length
Out[39]:
3.2991172243892426

Since the average movie name length in words is 3.3, we can assume that a max length of 10 will cover most of the instances.

In [40]:
max_length = 10

Setup and Training

Before creating the dataset, we download the model and the tokenizer. We need the tokenizer in order to tokenize the data.

In [120]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
/usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:698: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
  FutureWarning,

We send the model to the device and initialize the optimizer.

In [121]:
model = model.to(device)
In [122]:
optimizer = optim.AdamW(model.parameters(), lr=3e-4)

According to the GPT-2 paper, to fine-tune the model, use a task designator.

For our purposes, the designator is simply "movie: ". This will be added to the beginning of every example.

To correctly pad and truncate the instances, we find the number of tokens used by this designator:

In [108]:
tokenizer.encode("movie: ")
Out[108]:
[41364, 25, 220]
In [109]:
extra_length = len(tokenizer.encode("movie: ")) 

We create a simple dataset that extends the PyTorch Dataset class:

In [110]:
class MovieDataset(Dataset):  
    def __init__(self, tokenizer, init_token, movie_titles, max_len):
        self.max_len = max_len
        self.tokenizer = tokenizer
        self.eos = self.tokenizer.eos_token
        self.eos_id = self.tokenizer.eos_token_id
        self.movies = movie_titles
        self.result = []

        for movie in self.movies:
            # Encode the text using tokenizer.encode(). We ass EOS at the end
            tokenized = self.tokenizer.encode(init_token + movie + self.eos)
            
            # Padding/truncating the encoded sequence to max_len 
            padded = self.pad_truncate(tokenized)            

            # Creating a tensor and adding to the result
            self.result.append(torch.tensor(padded))

    def __len__(self):
        return len(self.result)


    def __getitem__(self, item):
        return self.result[item]

    def pad_truncate(self, name):
        name_length = len(name) - extra_length
        if name_length < self.max_len:
            difference = self.max_len - name_length
            result = name + [self.eos_id] * difference
        elif name_length > self.max_len:
            result = name[:self.max_len + 2]+[self.eos_id] 
        else:
            result = name
        return result

Then, we create the dataset:

In [111]:
dataset = MovieDataset(tokenizer, "movie: ", movie_list, max_length)

Using a batch_size of 32, we create the dataloader:

In [112]:
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)

GPT-2 is capable of several tasks, including summarization, generation, and translation. To train for generation, use the same as input as labels:

In [114]:
def train(model, optimizer, dl, epochs):    
    for epoch in range(epochs):
        for idx, batch in enumerate(dl):
             with torch.set_grad_enabled(True):
                optimizer.zero_grad()
                batch = batch.to(device)
                output = model(batch, labels=batch)
                loss = output[0]
                loss.backward()
                optimizer.step()
                if idx % 50 == 0:
                    print("loss: %f, %d"%(loss, idx))

When training a language model, it is easy to overfit the model. This is due to the fact that there is no clear evaluation metric. With most tasks, one can use cross-validation to guarantee not to overfit. For our purposes, we only use 2 epochs for training

In [123]:
train(model=model, optimizer=optimizer, dl=dataloader, epochs=2)
loss: 9.313371, 0
loss: 2.283597, 50
loss: 1.748692, 100
loss: 2.109853, 150
loss: 1.902950, 200
loss: 2.051265, 250
loss: 2.213011, 300
loss: 1.370941, 0
loss: 1.346577, 50
loss: 1.278894, 100
loss: 1.373716, 150
loss: 1.419072, 200
loss: 1.505586, 250
loss: 1.493220, 300

The loss decreased consistently, which means that the model was learning.

Movie Name Generation

In order to verify, we generate 20 movie names that are not existent in the movie list.

The generation methodology is as follows:

  1. The task designator is initially fed into the model
  2. A choice from the top-k choices is selected. A common question is why not use the highest ranked choice always. The simple answer is that introducing randomness helps the model create different outputs. There are several sampling methods in the literature, such as top-k and nucleus sampling. Im this example, we use top-k, where k = 9. K is a hyperparameter that improves the performance with tweaking. Feel free to play around with it to see the effects.
  3. The choice is added to the sequence and the current sequence is fed to the model.
  4. Repeat steps 2 and 3 until either max_len is achieved or the EOS token is generated.
In [116]:
def topk(probs, n=9):
    # The scores are initially softmaxed to convert to probabilities
    probs = torch.softmax(probs, dim= -1)
    
    # PyTorch has its own topk method, which we use here
    tokensProb, topIx = torch.topk(probs, k=n)
    
    # The new selection pool (9 choices) is normalized
    tokensProb = tokensProb / torch.sum(tokensProb)

    # Send to CPU for numpy handling
    tokensProb = tokensProb.cpu().detach().numpy()

    # Make a random choice from the pool based on the new prob distribution
    choice = np.random.choice(n, 1, p = tokensProb)
    tokenId = topIx[choice][0]

    return int(tokenId)
In [125]:
def model_infer(model, tokenizer, init_token, max_length=10):
    # Preprocess the init token (task designator)
    init_id = tokenizer.encode(init_token)
    result = init_id
    init_input = torch.tensor(init_id).unsqueeze(0).to(device)

    with torch.set_grad_enabled(False):
        # Feed the init token to the model
        output = model(init_input)

        # Flatten the logits at the final time step
        logits = output.logits[0,-1]

        # Make a top-k choice and append to the result
        result.append(topk(logits))

        # For max_length times:
        for i in range(max_length):
            # Feed the current sequence to the model and make a choice
            input = torch.tensor(result).unsqueeze(0).to(device)
            output = model(input)
            logits = output.logits[0,-1]
            res_id = topk(logits)

            # If the chosen token is EOS, return the result
            if res_id == tokenizer.eos_token_id:
                return tokenizer.decode(result)
            else: # Append to the sequence 
                result.append(res_id)
    # IF no EOS is generated, return after the max_len
    return tokenizer.decode(result)

Generating 20 unique movie names:

In [131]:
results = set()
while len(results) < 20:
    name = model_infer(model, tokenizer, "movie:").replace("movie: ", "").strip()
    if name not in movie_list:
        results.add(name)
        print(name)
The Final Days
American Psycho II
The Last Christmas
Last Kiss, The (Koumashi-
American Pie Presents: The Last Christmas
American Psycho II
My Best Fiend
The Final Cut
Last Summer
Last Night's Night
I Love You, I Love You
My Best Fiend
American Pie Presents: American Pie 2
I'm Sorry I Feel That Way
American Pie Presents: The Next Door, The (
Last Summer, The
I'll Do Anything... Any...
My Girl the Hero
My Best Fiend (La vie en f
The Man with the Golden Arm
The Last Train Home
I'm Here To Help

As shown, the movie names look realistic, meaning that the model learned how to generate movie names correctly.

Model Saving and Loading

PyTorch makes it very easy to save the model:

In [ ]:
torch.save(model.state_dict(), "movie_gpt.pth")

And, if you need to load the model in the future for quick inference without having to train:

In [ ]:
model.load_state_dict(torch.load("movie_gpt.pth"))

In this tutorial, we learnt how to fine-tune the Huggingface GPT model to perform movie name generation. The same methodology can be applied to any language model available on https://huggingface.co/models