Author: Michał Brzozowski
Models based on the transformers architecture have become a state-of-the-art solution in NLP. The word “transformer” is indeed what the letter “T” stands for in the names of the famous BERT, GPT3 and the massively popular nowadays ChatGPT. The common obstacle while applying these models is the constraint on the input length. For example, the BERT model cannot process texts which are longer than 512 tokens (roughly speaking, one token is associated with one word).
The method to overcome this issue was proposed by Devlin (one of the authors of BERT) in the discussion. In this article, we will describe in detail how to modify the process of fine-tuning a pre-trained BERT model for the classification task. The code is available as open source here.
Overview of BERT classification
Let us start with the description of the three stages in the life of the BERT classifier model:
- Model pre-training.
- Model fine-tuning.
In the first stage, BERT is pre-trained on a large corpus of data in a self-supervised fashion. That is, the training data consists of raw texts only, without human labelling. The model is evaluated by two objectives: guessing the masked word in a sentence and prediction if one sentence comes after another.
Observe that both tasks are concerned only with separate sentences, not the entire context. Hence there is no truncation of longer texts. The book saga, “In Search of Lost Time”, can be used during pre-training despite having more than 1 200 000 words. It is just done sentence by sentence.
We can load the pre-trained base BERT model using the transformers library:
The warning informs us that the downloaded model must be fine-tuned on the downstream task (in our case, this will be a binary classification of sequences). This step will be described in the following subsection.
We use a similar approach to get the tokenizer:
Note the parameter model_max_length=512 listed above. It is the main obstacle we will work around in this article. In fact, applying this model without modification just truncates every text to 512 tokens. All the information and context in the rest of the document are discarded during fine-tuning and prediction stages.
The most straightforward and natural idea is to divide the text into smaller chunks and feed them to the model separately. It is our strategy; however, as we will see, the devil is in the detail.
Clearly, after reading many books and the entire Wikipedia, the downloaded pre-trained model is knowledgeable. However, its knowledge is very general.
Assume that we only need to predict if the movie review is positive or negative and ignore its vast and intricate wisdom of quantum mechanics and Proust. More importantly, we need to adapt the model for our specific task of binary sequence classification. Assume we want to train the model to recognize that the movie review is positive or negative based on its text.
To do this, we use the supervised learning approach. More precisely, prepare the training set of reviews manually labelled as positive or negative and then feed it to the model with an additional classification layer on top of the model.
Modifying the fine-tuning step to look at the entire text and not just the first 512 tokens turned out to be untrivial and will be described in detail later.
The last stage is applying the trained model to the new data and obtaining classifications.
Using the fine-tuned classifier on longer texts
It will be instructive first to describe the more straightforward process of modifying the already fine-tuned BERT classifier to apply it to longer texts. This section will be mainly based on the excellent tutorial article: How to Apply Transformers to Any Length of Text.
The main difference between our approaches here is allowing the chunks of text to overlap.
Finding a long review
In what follows, we will consider the well-known dataset of movie reviews from IMDB. We are interested in classifying them based on their sentiment. That is, if they are positive or negative.
After basic exploration, we load the dataset from huggingface and find a very long review of David Lynch’s Mulholland Drive:
As we can see, the review is rather elaborate and consists of 2278 words. We want to split it into chunks which are small enough to fit into the 512 limits of the BERT input.
Loading the already fine-tuned BERT classifier
In this section, we will assume that we have an already fine-tuned BERT classifier. Let us download the one trained on the IMDB dataset from huggingface:
Tokenization of the whole text
Now we want to tokenize the entire review:
Observe the following:
- We set add_special_tokens to False, because we will add special tokens at the beginning and the end manually after the splitting procedure.
- We set truncation to False, because we do not want to throw away any part of the text.
- We set return_tensor to “pt”, to get the result in the form of the torch Tensor.
The warning informs us that the tokenized sequence is too long (after tokenization we obtained 3155 tokens which is even significantly more than the number of words). If we just put such a tensor into the model it will not work.
Indeed, let us try it:
What are the tokens?
Let us now take a look at what exactly are these tokens we are referring to.
As we can see, the tokenized text is equivalent to the Python dictionary with the following keys:
- input_ids — this part is crucial — it encodes the words as integers. It can also contain some special tokens indicating the beginning (value 101) and the end of the text (value 102). We will add them manually after the splitting procedure.
- token_type_ids — this binary tensor is used to separate question and answer in some specific applications of BERT. Because we are interested only in the classification task, we can ignore this part.
- attention_mask — this binary tensor indicates the position of the padded indices. Later we will manually add zeroes there to ensure that all chunks have precisely the demanded size 512.
Splitting the tokens
To fit the tokens to the model, we need to split them into chunks with the length 512 tokens or less. However, we also need to put 2 special tokens at the beginning and the end; hence the upper bound is 510.
Three parameters will determine the splitting procedure: chunk_size, stride and minimal_chunk_size with the following meaning:
- The parameter chunk_size defines the length of each chunk. More precisely, splitting the tokens into equal parts might be impossible, and chunks at the end might be smaller than chunk_size.
- The parameter stride modifies the amount of movement over the token list (this is analogous to the meaning of this parameter in the context of convolutional neural networks). In other words, this allows chunks to overlap.
- The parameter minimal_chunk_size identifies the minimal size of the chunk. As we have already mentioned, after splitting the token list, we may obtain some leftover parts at the end, which might be too small to contain any meaningful information.
For clarity, we will demonstrate this procedure with a few examples:
Adding special tokens
After splitting into smaller chunks, we must add special tokens at the beginning and the end:
Next, we must add some padding tokens to ensure that all chunks have the size of precisely 512:
Stacking the tensors
After applying this procedure to a single text, the input_ids is a list of K tensors of the size 512, where K is the number of chunks. To put this into the BERT model, we must stack these K tensors into one tensor of the size K x 512 and ensure that tensor values have the appropriate type:
Wrapping it into one function
For convenience, we can wrap all the previous steps into the single function:
Procedure for the selected long review
Let us now combine all the mentioned steps for our example long review. We will use the parameters chunk_size = 510, stride=510 and minimal_chunk_size=1, which means just splitting into non-overlapping parts:
Hence the review was divided into 7 chunks.
Using the fine-tuned model on the prepared data
The prepared data is ready to plug into our fine-tuned classifier:
Let us summarize:
- The fine-tuned model returned logit values for each chunk.
- We applied the softmax function and slicing to get the probability that the review is positive.
- We obtained the list of probabilities for each: [0.9997, 0.9996, 0.5399, 0.9994, 0.9995, 0.9975, 0.9987]
- Finally, we can apply some pooling function (mean or maximum) to obtain one aggregated probability for the entire review.
In this part, I presented how to use the already- fine-tuned BERT on arbitrarily long texts. However, what to do when we want to fine-tune it ourselves? I will answer this question in Part 2 of my series, which will be published soon.