
Fine-tuning BERT model for arbitrarily long texts, Part 1
Author: Michał Brzozowski Models based on the transformers architecture have become a state-of-the-art solution in NLP. The word “transformer” is indeed what the letter “T” stands for in the names of the famous BERT, GPT3 and the massively popular nowadays ChatGPT. The common obstacle while applying these models is the constraint on the input length. For example, the BERT model cannot process texts which are longer than 512 tokens (roughly speaking, one token is associated with one word). The method to overcome this issue was proposed by Devlin (one of the authors of BERT) in the discussion. In this