TOWARDSAI.NET
Building GPT From First Principles: Code and Intuition
Author(s): Akhil Shekkari Originally published on Towards AI. Figure — 0 The main goal of this blog post would be to Understand each component inside GPT with Intuition and be able to Code it in Plain PyTorch. Please have a look at the below two Figures. Our Implementation will heavily follow the Figure-1. 2. I will be taking lots of Ideas and concepts from Figure – 2 (taken from Anthropic’s paper on https://transformercircuits.pub/2021/framework/index.html ). we will use this Figure for our Intuition and Understanding.Figure — 1 Figure — 2 For Every component, I will go over the theory required. This is important because we have to understand why that particular component / concept is used. Next I will go over coding part. Lets look at all the Individual components of a Transformer: 1. Residual Stream(Also Known as Skip Connections) 2. Embedding Matrix3. Layer Normalization4. Positional Encoding5. Self Attention Mechanism(Causal Masking)6. Multi — Layer Perceptron7. UnEmbedding Matrix Before looking at Residual Stream, It is always good to approach concepts with an example in mind. One of the main reason I came across when people say they find it difficult to code is, the problem with input output dimensions. Before and After every Transformation, we should know how the vector changes and its dimension changes. Let the example sentence be “Messi is the greatest of all time”. For this example, there are 7 tokens(1 word = 1 token for simplicity). let us take 1 token can be represented in 50 dimensions. we call this d_model. Batch size is usually the no. of examples we feed to model at given point of time. since we are working with demo example and we have only one example, let us consider a batch_size of 1.Let us Assume max length of any sentence in our dataset is less than or equal to 10. we call this seq_len. Let the total no. of tokens in our vocabulary is 5000. we call this d_vocab. So the Configurations of our toy example is: d_model = 50 d_vocab = 5000Note: The above config is for toy example. In our Code, we will be working with Actual GPT level Configs. (See Below) Let’s define our Config Note: There are lot oh hyper parameters, which you haven’t seen. But don’t worry, we will cover all of them in later parts of the blogfrom dataclasses import dataclass## lets define all the parameters of our model@dataclassclass Config: d_model: int = 768 debug: bool = True layer_norm_eps: float = 1e-5 d_vocab: int = 50257 init_range: float = 0.02 n_ctx: int = 1024 d_head: int = 64 d_mlp: int = 3072 n_heads: int = 12 n_layers: int = 12cfg = Config()print(cfg) Note that @dataclass is simplifying a lot of stuff for us. With @dataclass we get constructor and a clean output representation when we want to print the parameters of that class. No Need of huge boilerplate code. Without that, we would have to write the same class like this. class Config: def __init__(self, d_model=768, d_vocab=50257): self.d_model = d_model self.d_vocab = d_vocab def __repr__(self): return f"Config(d_model={self.d_model}, d_vocab={self.d_vocab})" Some Common Implementation details for all the Components: 1. For Every Component , we define a class. 2. Every Class needs to subclass nn.Module. This is important for many reasons like storing model parameters, using helper functions etc., You can read more about this at https://pytorch.org/tutorials/beginner/nn_tutorial.html3. super().__init__() makes sure the constructor of nn.Module gets called. https://www.geeksforgeeks.org/python-super-with-__init__-method/4. We then pass config obj to that class, to set values for our parameters as required.What is Embedding matrix ? This is just the plain lookup table. You look for an embedding vector of a particular token. Questions to ask before coding: Q. What is the Input for Embedding Matrix ? A. Int[Tensor, ‘batch position’] Here [batch, position] represent dimensions. position refers to token postion.Q. What is the Output for Embedding Matrix ? A. Float[Tensor, ‘batch seq_len d_model’]Returns corresponding embedding vectors in the above shape.class Embed(nn.Module): def __init__(self, cfg: Config): super().__init__() self.cfg = cfg self.W_E = nn.Parameter(t.empty((cfg.d_vocab, cfg.d_model))) nn.init.normal_(self.W_E, std = self.cfg.init_range) def forward(self, tokens: Int[Tensor, 'batch position']) -> Float[Tensor, 'batch seq_len d_model']: return self.W_E[tokens] Every trainable parameter in a neural network needs to be tracked and updated based on gradients. PyTorch simplifies this with nn.Parameter(). Any tensor wrapped in nn.Parameter is automatically registered as a trainable weight.nn.init.normal_ fills the tensor with values drawn from a normal distribution (a.k.a. Gaussian), in-place. Our Embedding matrix will be of the shape (d_vocab, d_model). Intuitively, we can read it as, for every token the matrix row will represent its corresponding embedding vector. What is Positional Embedding ? This can also be thought of as a lookup table. Here Instead of token Ids, we have numbers/positions.Positional Embedding Refers to a learned vector assigned to each position (like a token embedding).Think of this as model learns that certain positions have certain tokens and relationships between them which is useful for attention tasks downstream. Small Clarification: In the original paper “Attention is all you Need”, authors came up with Positional Encoding. It’s not learned it’s a fixed function (based on sine and cosine) you add to the input embeddings. In our GPT, we use Positional Embedding. More Intuition: For the example"Akhil plays football."Positional embeddings evolve such that: pos[0] → helps identify "Akhil" as the subjectpos[1] → contributes to verb detectionpos[2] → contributes to object predictionQuestions to ask before coding: Q. What is the Input for Positional Embedding ? A. Int[Tensor, ‘batch position’] Here [batch, position] represent dimensions. position refers to token postion.Q. What is the Output for Positional Embedding ? A. Float[Tensor, ‘batch seq_len d_model’]Returns corresponding embedding vectors in the above shape.class PosEmbed(nn.Module): def __init__(self, cfg: Config): super().__init__() self.cfg = cfg self.W_pos = nn.Parameter(t.empty((cfg.n_ctx, cfg.d_model))) nn.init.normal_(self.W_pos, std=self.cfg.init_range) def forward(self, tokens: Int[Tensor, "batch position"]) -> Float[Tensor, "batch position d_model"]: batch, seq_len = tokens.shape return einops.repeat(self.W_pos[:seq_len], "seq d_model -> batch seq d_model", batch=batch) Here n_ctx is the context length of our model. That means at any given time, we will have atmost of n_ctx number token to be positioned. In the forward pass, we slice out the relevant position vectors from our learned embedding matrix, and repeat them across the batch. This gives us a tensor of shape [batch, seq_len, d_model], so each token gets a learnable embedding for its position — which we can then add to the token embeddings. What is a Residual Stream ? The straight path from Embed to UnEmbed from Figure — 2. You can kind of think of this as a central part in a Transformer. Information inside this stream flows forward. By forward, I mean from Embedding stage to UnEmbedding stage. The tokens will be represented with their corresponsing embeddings via the Embedding Table. These Embeddings then enter the Residual Stream. We represent the example Messi is the greatest of all time, inside the residual stream in the following dimensions. [batch_size, seq_len, d_model] ==> [1, 10, 50] (since each token is 50 dimensional vector, and we have 7 tokens. Here we pad the remaining 3 tokens with zeros to maintain dimensions.) Next Steps In general, Input gets sent to LayerNorm Attention Heads, Read Information from this Residual Stream. Attention heads are responsible for moving information within tokens, based on Attention Matrix. (More on this in Attention Section) MLP does explicit read and write operations(new vectors) onto this Residual stream. They can also delete information from Residual Stream. (Will explain more on this in later sections) What is Layer Normalization ? Fundamental reason why we do normalization is to keep the data flowing nicely through gradients without gradients vanishing or exploding. Figure — 5 From Figure — 5 , we can see two hyper parameters. These are Gamma(scaling factor) and beta(shifting factor). We make the values inside our embedding vector in Normalized form. E[x] is mean. Then we allow the model a little bit of room as training progresses for the purpose of Scaling and Shifting. we can see small epsilon in order to avoid division by zero error. Questions to Ask: Q. What does Layer Norm take as input ? A. It takes the residual after attention. [Batch posn d_model]Q. What does it return ? A. It just normalizes existing values on the embedding vector. Doesn’t add anything new. So returns normalized values.Note: dim = -1 denotes perform operations on the last dimension. Here last dimension is d_model. So, we take mean and varience along the embedding vector of each token independently. ### LayerNorm Implementationclass LayerNorm(nn.Module): def __init__(self, cfg: Config ): ### has the x input vector super().__init__() self.cfg = cfg self.w = nn.Parameter(t.ones(cfg.d_model)) ## these are gamma and beta self.b = nn.Parameter(t.zeros(cfg.d_model)) ## learnable def forward(self, residual: Float[Tensor, 'batch posn d_model']) -> Float[Tensor, 'batch posn d_model']: residual_mean = residual.mean(dim = -1, keepdim=True) residual_std = (residual.var(dim = -1, keepdim=True, unbiased=False)+ cfg.layer_norm_eps).sqrt() residual = (residual - residual_mean) / residual_std residual = residual * self.w + self.b return residual Multi Head Attention: Okay. Let’s think in simple terms first. Before talking about Multiple Attention Heads, let us understand what happens in a Single Attention Head. Questions to Ask: Q. What does an Attention Head get as an Input ? A. The Attention head reads what is present in residual stream. i.e., Float [batch seq_len d_model]. From our toy examples, this might be one of the examples like “Messi is the greatest of all time”Q. After the completion if Self Attention Process(from the attention block), what does the output look like ? A. Float[Tensor, ‘batch seq_len d_model’]. The output is still the same example, but there is a lot of information movement. Let’s go through that in detail.Information Movement:(Intuition) Let's take two tokens from above. (for our convenience, we represent each token in 4 dimensions.)Below is the state of embedding vectors before entering the attention block. Messi→ [0.1 0.9 2.3 7.1]greatest → [2.1 4.4 0.6 1.8] Once these tokens enter into Attention block, tokens start to attend to the tokens came before in order to include more context and change its representation. This is called Causal Self Attention. Messi is the greatest of all time. In this example, when greatest wants to encode some context inside of it, it can only use the context from the words Messi, is and the. From these words, the representation of greatest changes. After Attention,Messi→ [0.1 0.9 2.3 7.1]greatest → [0.2 1.1 0.6 1.8] (changed representation) what does that mean ? Look at the greatest vector. Now It represents some “Messi” inside of it. This is kinda like while constructing the embedding vector for “greatest” it is referring to a bit of Messi. This is what is information Movement. But still, we want to know how this process exactly happens. Let me introduce few Matrices which are important in this process. In the literature, these are named as Queries, Keys and Values. Q = Input * Wq K = Input * WkV = Input * WvHere Input is our example “Messi is the greatest of all time”. The Idea of Q, K and V are to do linear transformation of Input into a different space where these Inputs are represented in more meaningful way. Let's see the dimensions of these matrix multiplications on our toy example. Input/redisual = [1 10 50] [batch seq d_model] “Wq” matrix dimension depends on how many heads we want to have in our model. This is a very important statement because, if we decide to have only one attention head, then we can have Wq = [n_head, d_model, d_model] ==> [1, 50,50]If we decided to have n_heads, then the dimension will be [d_model/n_heads]. We represent this quantity as d_head. So, if we want to have 5 heads, The dimensions of Wq will be [n_head, d_model, d_head] ==> [5, 50, 10]. Lets say we want to have 5 heads, then Q = [1 10 50] * [5, 50, 10] ==>[batch seq d_model] * [n_head d_model d_head] ==> [1 10 5 10] [Batch seq_len n_head d_head] The extra dimension in the beginning is batch. For Q, K and V we will clearly see how all of this fits together in a diagram. The same applies for K and V matrices. First let’ s talk about K. K = [1 10 50] * [5 50 10] ==> [1 10 5 10] Attention is calculated by multiplying the Matrices Q and K. Remember, Attention matrix will always be a Square Matrix. Please look at the diagram I made. I tried to communicate what do those dimensions even mean. Look at the left part. [Batch seq d_head d_model] Figure — 3 I took two example sentences. 1. I good 2. You badIn Left representation, for one batch we are having two examples. Of those two examples, we have 2 seq tokens per example. For each token, we have all the heads which is like full d_model dimension. we are not computing attention per head. But, we want it such that for every batch , and for every head we want those tokens to be represented by different Attention heads parallely. Right side of the representation helps in that. That is the reason while computing attention, we permute the shapes. (hope this helps!!!) Note: Don’t worry all of this transformation can be done very intuitively through einsum. you will see this in coding. Now that we have understood, how attention is computed, let’s get back to our Messi example. Earlier we talked about how “greatest” would attend to “Messi”. we get a [10,10] matrix of all words of our toy example attending to all the other words. Here After getting the attention matrix, we apply causal masking to prevent words attending to future words. “greatest” cannot attend to “time”. After that, we apply Softmax on Attention Matrix. Softmax gives us a score that would sum to 1 along that row. For the word greatest, it would tell how much percent it should attend to “Messi”, how much to “is” and How much to “the” and itself.I took some other example from google to make things visually simple. You can easily connect this with our Messi example. Figure — 4 Once this is done, next step is to Multiply this matrix with our Value vector. As I discussed above, value vector will also be nothing but linear transformation of Input to another Space.V = Input * Wv ==> [1 10 50] * [5 50 10] ==> [1 10 5 10] Z = V * A==> [batch seq_len n_head d_head] * [batch n_head q_seq k_seq] ==> [1 10 5 10] * [1 5 10 10] ==> [1 10 5 10] [batch seq_len n_head d_head] Again, once you look at einsum code, this is self explanatory. Z is the output from 1 head. We stack all these outputs of [1 10 5 10] horizontally. There are 5 heads. so the result becomes [1 10 5 50]. The concatenation of all these heads is then multiplied with one final Output Matrix(Wo) which can be intuitively thought of as learning to represent how to combine all these outputs from different heads. (Z from all the heads) * Wo [1 10 5 50] * [5 10 50][n_head d_head d_model]==> [1 10 50] This is how information is moved in between tokens. I know there are a lot of dimensions here, but this is the core part. once you grab gist of it, everything looks straighforward. Now the information is moved inside the Residual Stream. Look at the code for Implementing Attention below. There are bias initializations which are self explanatory. Note: I use “posn” and “seq_len” interchangeably. They are the same. Implementation details: Regarding implementing causal mask is tril and triu functions in PyTorch. please look at them as they are straightforward. Register buffer is the process of creating temperory parameters that doesnt require gradient tracking. They give nice functionality of moving between CPU and GPU if we register them with PyTorch provided buffer. class Attention(nn.Module): ### register your buffer here IGNORE: Float[Tensor, ''] def __init__(self, cfg: Config): super().__init__() self.cfg = cfg self.W_Q = nn.Parameter(t.empty((cfg.n_heads, cfg.d_model, cfg.d_head))) self.W_K = nn.Parameter(t.empty((cfg.n_heads, cfg.d_model, cfg.d_head))) self.W_V = nn.Parameter(t.empty((cfg.n_heads, cfg.d_model, cfg.d_head))) self.W_O = nn.Parameter(t.empty((cfg.n_heads, cfg.d_head, cfg.d_model))) self.b_Q = nn.Parameter(t.zeros((cfg.n_heads, cfg.d_head))) self.b_K = nn.Parameter(t.zeros((cfg.n_heads, cfg.d_head))) self.b_V = nn.Parameter(t.zeros((cfg.n_heads, cfg.d_head))) self.b_O = nn.Parameter(t.zeros((cfg.d_model))) nn.init.normal_(self.W_Q, std=self.cfg.init_range) nn.init.normal_(self.W_K, std=self.cfg.init_range) nn.init.normal_(self.W_V, std=self.cfg.init_range) nn.init.normal_(self.W_O, std=self.cfg.init_range) self.register_buffer('IGNORE', torch.tensor(float('-inf'), dtype=torch.float32, device = device)) # mention device also def forward(self, normalized_resid_pre: Float[Tensor, 'batch pos d_model']) -> Float[Tensor, 'batch pos d_model']: ### calculate query, key and value vectors and go according to the formula q = ( einops.einsum( normalized_resid_pre, self.W_Q, "batch posn d_model, nheads d_model d_head -> batch posn nheads d_head" ) + self.b_Q ) k = ( einops.einsum( normalized_resid_pre, self.W_K, "batch posn d_model, nheads d_model d_head -> batch posn nheads d_head" ) + self.b_K ) v = ( einops.einsum( normalized_resid_pre, self.W_V, "batch posn d_model, nheads d_model d_head -> batch posn nheads d_head" ) + self.b_V ) attn_scores = einops.einsum( q, k, "batch posn_Q nheads d_head, batch posn_K nheads d_head -> batch nheads posn_Q posn_K" ) attn_scores_masked = self.apply_causal_mask(attn_scores / self.cfg.d_head**0.5) attn_pattern = attn_scores_masked.softmax(-1) # Take weighted sum of value vectors, according to attention probabilities z = einops.einsum( v, attn_pattern, "batch posn_K nheads d_head, batch nheads posn_Q posn_K -> batch posn_Q nheads d_head" ) # Calculate output (by applying matrix W_O and summing over heads, then adding bias b_O) attn_out = ( einops.einsum(z, self.W_O, "batch posn_Q nheads d_head, nheads d_head d_model -> batch posn_Q d_model") + self.b_O ) return attn_out def apply_causal_mask() -> Float[Tensor, self, attn_scores: Float[Tensor, "batch n_heads query_pos key_pos"] "batch n_heads query_pos key_pos"]: """ Applies a causal mask to attention scores, and returns masked scores. """ # Define a mask that is True for all positions we want to set probabilities to zero for all_ones = t.ones(attn_scores.size(-2), attn_scores.size(-1), device=attn_scores.device) mask = t.triu(all_ones, diagonal=1).bool() # Apply the mask to attention scores, then return the masked scores attn_scores.masked_fill_(mask, self.IGNORE) return attn_scores Imp takeaway What information we copy depends on the source token’s residual stream, but this doesn’t mean it only depends on the value of that token, because the residual stream can store more information than just the token identity (the purpose of the attention heads is to move information between vectors at different positions in the residual stream).What does that mean ? Messi is the greatest of all timeSo when greatest is referring/Attending back to Messi, it doesn’t just see the value Messi. Residual stream stores much more than just the identity. It refers to things likeMessi is a subject.Messi is a person etc., All of this is stored in the residual stream. Now Input goes into MLP. Multi — layer Perceptron (MLP Layer) This is very important layer. 2/3rds of Model’s parameters are MLPs. These are responsible for Non — Linear Transformation of given Input vectors. The main Intuition of this layer is to form rich projections. To store facts.There is a very Intuitive video made by 3 blue 1 brown about this. It’s a must watch. https://www.youtube.com/watch?v=9-Jl0dxWQs8&t=498s Intuition You can loosely think of the MLP as working like a Key → Value function, where: Input = “Key” (what token currently holds in residual stream)Output = “Value” (what features we want to add to the residual stream) For Example Key = token’s current context vector coming from the residual stream. It Represents the meaning of the token so far (including attention context) Value = non-linear mix of learned featuresCould be:1. “This is a named entity”2. “This clause is negated”3. “A question is being asked”4. “Boost strength-related features”5. “Trigger next layer’s copy circuit”So the MLP says: “Oh you’re a token that’s the subject of a sentence AND you were just negated? Cool. Let me output features relevant to that situation.” Hope you got the intuition. The first hidden layer has 3072 neurons. we call this as d_mlp and have declared it in our config. Also the 2nd hidden layer projects these back to d_model space. These have been shown as W_in and W_out in the code. We use GeLU Non linearity.class MLP(nn.Module): def __init__(self, cfg: Config): super().__init__() self.cfg = cfg self.W_in = nn.Parameter(t.empty(cfg.d_model, cfg.d_mlp)) self.b_in = nn.Parameter(t.zeros(cfg.d_mlp)) self.W_out = nn.Parameter(t.empty(cfg.d_mlp, cfg.d_model)) self.b_out = nn.Parameter(t.zeros(cfg.d_model)) nn.init.normal_(self.W_in, std=self.cfg.init_range) nn.init.normal_(self.W_out, std=self.cfg.init_range) def forward(self, normalized_resid_mid: Float[Tensor, 'batch posn d_model']): ## Its going to do per token level matmul pre = einops.einsum(normalized_resid_mid, self.W_in, 'batch posn d_model,d_model d_mlp->batch posn d_mlp') + self.b_in post = gelu_new(pre) mlp_out = einops.einsum(pre, self.W_out, 'batch posn d_mlp, d_mlp d_model->batch posn d_model') + self.b_out return mlp_out With this, we completed one layer of what we call Transformer Block. There are 12 such layers in GPT-2. Also there are 12 attention heads in GPT that we are implementing. Therefore n_heads = 12 and n_layers = 12. These have already been coded in the config. Our GPT model contains (d_model) 768 dimensions and a vocabulary(d_vocab) of over 50257 tokens. So this Transformer block is repeated 12 times. Code for TransformerBlock is just connecting ( LayerNorm + Attention + MLP & Skip Connections). class TransformerBlock(nn.Module): def __init__(self, cfg: Config): super().__init__() self.cfg = cfg self.ln1 = LayerNorm(cfg) self.attn = Attention(cfg) self.ln2 = LayerNorm(cfg) self.mlp = MLP(cfg) def forward(self, resid_pre: Float[Tensor, 'batch posn d_model']) -> Float[Tensor, 'batch posn d_model']: resid_mid = self.attn(self.ln1(resid_pre)) + resid_pre ### skip connection resid_post = self.mlp(self.ln2(resid_mid)) + resid_mid return resid_post Here skip connections are nothing but adding input directly into the Residual Stream along with Attention and MLP. resid_pre says the residual before normalization, which is raw input. resid mid is the residual after attention and it again gets added. This is done inorder to stabilize training for large amount of time. UnEmbed UnEmbed Matrix is when you want to map the learned representations back to the probability over all the tokens in vocab. Questions to Ask: Q. What input does it take? A. Residual Stream token vector. [batch posn d_model]Q. What does it give out? A. It gives out probability of tokens that are likely given current token. i.e.,a matrix of size [batch posn d_vocab]look at logits for how precisely it is calculated. class UnEmbed(nn.Module): def __init__(self,cfg:Config): super().__init__() self.cfg = cfg self.W_U = nn.Parameter(t.empty(cfg.d_model, cfg.d_vocab)) nn.init.normal_(self.W_U, std=self.cfg.init_range) self.b_U = nn.Parameter(t.zeros((cfg.d_vocab), requires_grad=False)) def forward(self, normalized_resid_final: Float[Tensor, 'batch posn d_model']) -> Float[Tensor, 'batch pos d_vocab']: logits = einops.einsum(normalized_resid_final, self.W_U, 'batch posn d_model, d_model d_vocab -> batch posn d_vocab') + self.b_U return logits Transformer Finally we arrive at the last part. Here, we just need to put all the components we have seen together. Let’s do that !!! class Transformer(nn.Module): def __init__(self, cfg: Config): super().__init__() self.cfg = cfg self.embed = Embed(cfg) self.posembed = PosEmbed(cfg) self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)]) self.ln_final = LayerNorm(cfg) self.unembed = UnEmbed(cfg) def forward(self, tokens: Int[Tensor, 'posn']) -> Float[Tensor, 'batch posn d_vocab']: residual = self.embed(tokens) + self.posembed(tokens) for block in self.blocks: residual = block(residual) logits = self.unembed(self.ln_final(residual)) return logits Here we go from taking tokens as input to calling residual/transformer blocks for 12 times. Implementation detail: Since all the Transformer blocks have their own parameters to be tracked, we need to define them in ModuleList. This is proper way of Initializing a list of blocks we need.Each block will take input from Residual Stream, will learn and contribute their learnings to Residual Stream. Thats it Guys!!!!! Hope you have gained a ton of Knowledge on how to build your own GPT. Support and Follow me for more cool blogs! Thanks to Neel Nanda and Callum McDougall !!!! I have learnt a lot from their Materials and Videos. This blog is inspired from their work. Connect with Me on: https://www.linkedin.com/in/akhilshekkari/ Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor. Published via Towards AI
0 Commentarios 0 Acciones 44 Views