Language Models, Pre-training and Transformers

CellStrat
3 min readJan 16, 2022

Natural Language Processing is the domain of Computer Science that aids in making the machines understand and learn the natural language that we humans speak. Language Models (LMs)are one of the subdomains within NLP, which help us in the use of statistical and probabilistic techniques in prediction of sequences of words or generation of text. Just like most of the traditional methods, earlier techniques used in LMs are being replaced by new and improved versions of machine learning. Today, Deep learning model designs and architectures have unfolded context in Natural Language Processing, which has made it possible to create remarkable results as the form of deep generative language models. Along with the power of these models and transfer learning, pre-trained models came into existence which enabled us to do various language tasks like summarization, translation, and question answering even in a zero-shot learning capacity.

The advent of pre-trained models was also rooted in the ideas of deep learning. It helped the researchers in creating unparalleled results like never before. But all of this comes with an expense. Even though these models showed excellent performance, they required intensive training that demanded extreme training requirements, high computational time and money. Word embeddings were the first kind of pre-trained models used in NLP. They are fixed and context-independent in nature. Application of Transfer Learning in NLP helped in pre-training networks with a self-supervised objective. It started in the less complex architectures of RNN based algorithms which gradually took off along the way. ULMFiT, Elmo are examples of a combination of RNN and Attention-based models. Then came the next generation pre-trained models which each outperformed their respective predecessors namely OpenAI’s GPT, BERT, GPT2, XLNet, T5. Fine-tuning for task-specific objectives these models has shown even better outputs. Thus they became the aid to provide cheaper, faster, and easier practical application of Natural Language Processing (NLP).

The language modelling problems put simply, predicting the following word and translation systems were the LSTM and GRU architectures together with the attention mechanism reigning areas in the past. These architectures take a sentence and process each word in a sequential way, so as the sentence length increases, so does the entire runtime. The transformer architecture was first introduced in the paper “Attention is All You Need” and it avoided recurrence all at once and instead relies entirely on an attention mechanism to draw dependencies between input and output. A transformer is a stack of encoder and decoder layers. An encoder layer is used to encode our English sentence into the numerical form via the assistance of attention mechanism. On the other hand, the decoder tries to use the encoded information from the encoder layers to translate it to the German version.

Consider the encoder architecture, it contains six encoder layers on top of each other. A multi-head self-attention Layer and a position-wise fully connected feed-forward network is the primary component of these layers. Here we have many self-attention layers stacked on top of each other and that’s why it’s called a multi-head. Positional encodings help the model to make use of the order of the sequence via providing information about the relative or absolute position of the tokens in the sequence. Both the encoder architecture and the decoder architecture have skip level residual connections. It helps go through the information to a greater extent in a Deep Neural Network. The decoder contains 6 decoder layers in a heap and each decoder in the stack consists of the following three layers namely Masked multi-head self-attention Layer and Multi-head self-attention Layer, and a position-wise fully connected feed-forward network. Masking enables the model to train better by making the input to the decoder invisible, so that the network is rarely able to see the subsequent words. Finally, we add a task-specific output head on the top of the decoder output which is done by adding linear layers and softmax on top to get the probability across all the words within the German vocab.

USEFUL LINKS :-

--

--

CellStrat

A Simple and Unified AI Platform for Developers and Researchers.