GPT explained visually..

summarized

TLDR

GPT (Generative Pre-trained Transformer) is the core architecture behind modern LLMs from labs like OpenAI, Anthropic, and DeepSeek, each tuning it for faster token generation, longer context, better tool calling, and more intelligence. The architecture builds from token embedding (giving tokens internal representation) and positional embedding, through multi-headed attention (Q, K, V vectors for relational communication), to feed-forward networks, layer normalization, and residual connections—all stacked in blocks to predict the next token.

Key points

  • GPT's token embedding gives each token a multi-dimensional vector to store internal meaning beyond a simple ID, enabling nuanced language understanding.
  • Attention mechanism uses three vectors Q (query), K (key), and V (value) to compute relevance scores between tokens, enabling communication within the sequence.
  • Multi-headed attention splits the sequence into multiple sub-spaces, allowing the model to capture different linguistic features like grammar, short-range, and long-range relationships.
  • Positional embedding is added to token embeddings to give the model awareness of token order, which is critical for meaning in sequences like 'love your job' vs 'job your love'.
  • Feed-forward networks provide additional capacity ('breathing room') after attention to learn deeper representations.
  • Layer normalization stabilizes numerical values across stacked layers, preventing wild fluctuations in the output.
  • Residual connections bypass layers and add the original input back later, preventing distortion and instability as the model depth increases.
  • GPT is trained by passing billions/trillions of tokens through the architecture, using methods like stochastic gradient descent to shape the model's behavior.

Tools mentioned

  • GPT
  • OpenAI
  • Anthropic
  • DeepSeek
  • Claude
  • Cursor
  • Claude Code
  • ChatGPT

Techniques

  • token embedding
  • positional embedding
  • multi-headed attention
  • feed-forward network
  • layer normalization
  • residual connections
  • decoder-only transformer
  • stochastic gradient descent
  • optimizers

Takeaways

  • GPT's core components—token embedding, positional embedding, multi-headed attention, feed-forward layers, normalization, and residual connections—work together to enable next-token prediction.
  • Multi-headed attention allows the model to view sequences from multiple perspectives (e.g., grammar, short/long-range relationships).
  • Training GPT requires massive amounts of data and iterative optimization to shape the model's understanding.
  • Different AI labs optimize various parts of the GPT architecture to meet application-layer demands like speed, context length, and tool use.
Transcript (captions)
Given how fast LLMs are advancing these days, we should really pause to understand how GPT actually works. GPT is what powers LLM and Frontier Labs like OpenAI, Enthropic and even Chinese labs like DeepSeek and Miniax all use GPT but tuned in their own way. And the reason why these labs tune their own version of GPT is when we look at the application layer here, there's a downward pressure since agentic applications at this layer want faster token generation. They want longer context window, better to calling and more intelligent models. So everything comes down to GBT. What is this generative pre-chain transformer and how does it actually work under the hood? Let's start with the Gatsin board as an analogy. I'm sure by now we've all seen this demonstrated before. What initially seemed like it should be random actually starts to show a predictable pattern like this where you have a higher concentration in the middle. Now imagine we label the bottom row with the entire American alphabet including numbers and special characters and we use this board to start generating sentence left to right where each ball that drops determine what the next token is. In this case the sentence will likely sound very gibberish mostly consisting of letters that happen to be in the middle of the board. Not very great. So, how do we tweak this inner mechanics so that I can actually construct a sentence that sounds plausible? What if we swap out the inner mechanics and just insert GPT instead? This way, every ball that drops goes through GPT and we simply let GPT to decide what the next token should be given the current token. The first thing we need to do is to get a bunch of data to train the model. Let's use the script that I'm reading right now for this video to train our GPT to sound like me. We need to first chunk our data set to feed into the model in pieces instead of dropping the entire data set all at once. Since we have to assume that this data is going to be huge, maybe in our example, we can just start by setting four batches that train all in parallel. And within each batch, it has eight blocks of tokens. So, we're sampling essentially 32 tokens total from the training data and we further split them in four ways to train them in parallel. Okay, let's keep going. Now that we chunked our data set to start feeding into the model, what we need to do is create what's called a token embedding. Now, you might be wondering why we need token embedding in the first place. Since we have the entire character set labeled at the bottom, these labels help us know what the token actually is. But the token itself actually doesn't mean anything. In other words, the character A all the way to the left here just means position one, B in position two, and so on. So these IDs don't actually help us understand the nuance behind the token like whether the letter A tends to appear after a space or maybe it tends to repeat itself after or maybe the punctuations tend to appear before them. All these are important nuances to learn from and having just the ID itself doesn't have enough room to store this kind of information. So maybe instead of having just the ID to label the tokens, we can just simply give more room to represent its internal meaning. And this breathing room that we just created for each token that we created here is why we need token embedding in the first place. So in our example, we're going to give 32 dimensions for each token to now start storing its internal representation here beyond just the ID. Pretty cool. So let's do a quick check on what we have so far. There are 128 different possibilities of labels assuming that we're using the ask key set. And under each token here, we give 32 dimensions to store its internal representations per token. So we have a vector size of 128x32 token embedding table. Now that we just created the plumbing to allow token representation, we have to now look at the data that's coming in. We earlier split our data in four batches, each containing eight blocks of tokens. Which means the tokens that are contained in our batch can now sample from the token embedding that we just created. This means that each batch will have 8x32 vectors sampled directly from the embedding table. And the 32 is a depth that we just added earlier to give the token the ability to represent its internal representation. And now that we have that, we're still missing something critical here, which is attention. But why? Before we even try to understand what attention even is, why do we need attention in the first place? What we just built here is just token representation, but not necessarily how each token within the blocks actually communicate with each other. In other words, we need to actually start capturing how these tokens communicate to each other within the blocks themselves. And this is stored outside of the token embedding table called attention. And I'm sure by now you've heard of the term attention mentioned all the time and also the highly acclaimed paper attention is all you need from 2017 by Google. Here's how the basic attention mechanism actually works. Since our goal is to find out how the tokens actually communicate with each other. Just like how earlier we created the breathing room for token internal representation, we need to also create one for attention as well. But this time we're creating three different tables instead of one. Now you might be wondering why exactly three. Having just one table just like earlier won't have enough breathing room for attention to take place because having attention to something means you have to start by separating the act of looking and the act of being looked at. So we have to have one vector that does the searching called Q, another vector that does the labeling called K and finally another vector that contains a value itself called V. Now, this might sound very unintuitive at first because it seems like we're just making up random vectors and hoping that attention will happen somehow, but once you look at how the actual math is done, it might help you to actually see how each three components here play into its own role when it comes to attention. But feel free to skip this math part, but I promise you it's not that difficult. We start by creating the same or smaller vector as a sequence of blocks. Now that we have three randomly generated values of vectors Q, K and V, we can actually start calculating the attention score. And the attention score help us find the strength of relevance within the sequence. So since our sequence is 8 tokens long, we need to have at the end 8 by8 table where each intersection tells us the strength of its token relevance. We can get there by multiplying the Q vector and the K vector but transposed. Then we can have the 8x8 vector and the result showing us the attention score of them. And since this multiplication might end up exploding in values in large numbers, we scale the output by the square root of the head size, which in our case is 32. So roughly 5.66. Now that we have the scaled product of the attention score, we can actually convert these numerical values into a percentage to find the probability distribution after setting the future tokens to negative infinity since we're using decoder only transformer which means we can't see the future tokens. So all this work so far helps us find the strength of the communication of tokens. Which means now we need to find out the value of the actual information each token should receive by multiplying the attention probabilities with the V vectors. This is why we need three different vectors Q, K, and V. And the entire operation that we just did can be summarized by this magical equation that changed the course of history when it comes to scaling LLMs that we have today. Pretty cool, right? So let's do a quick mental review here and look at what we built so far. We started by labeling the bottom of the board with the alphabet. Then we allowed them to have more breathing room to store internal representation by using what's called token embedding table. And then we added the attention mechanism to capture the relational side of how each token communicates by creating three new vectors Q, K, and V to calculate the attention score to find out the weighted information. Perfect. What more could we need? Turns out in order to call what we just built a GBT model, we actually need a few more components in place because what we have here isn't really good enough. But what could be missing here? Seems like we already have enough to actually start training the model with our data. But first, thanks to Outskll for sponsoring this video. Enthropic launched Fable 5 recently and the US government quickly restricted foreign access because of how capable these frontier models are becoming. And whether you're using claude, chachipt, cursor or cloud code, one clear thing is that these models are getting more and more powerful and most people still use them like a search engine while power users are using them to build, automate and ship faster than ever. So if you want to learn how to actually maximize these tools, you can check out the cloud AI mastery workshop from outskill, which is a full deep dive into claude and its use cases. The workshop is happening this weekend from 10:00 a.m. to 7:00 p.m. Eastern time, and you'll be joining a community of millions of learners with a 4.9 rating on Trust Pilot. In two days, you won't just learn Claude conceptually, you'll build your own artifacts and dashboards, create full presentations, and get hands-on experience on cloud code, including how to work with real projects, GitHub repos, and developer workflows. Ourskll is also including three free bonuses, an AI prompt library, 50 cloud code prompts, and a personalized AI toolkit builder. So, if you want to get better at using cloud beyond basic prompting, you can sign up and join the WhatsApp community before seats close. Link in the description below. Turns out what we built so far is good, but the model is incapable of knowing the order of tokens that are coming in. In other words, we have the internal representations that carry deeper meaning of the token. And we also have the attention mechanism to learn how each tokens communicate with each other. But there's no inherent mechanism to store the positional order of the sequence itself. For example, we know that the phrase love your job carries a vastly different meaning than job your love. So we also need to give breathing room for the model to also learn the positional embedding of the sequence as well. So we need to go ahead and create a vector that represent the position of each token in the sequence and add it directly on top of the token embedding we sample from. This way the combined vectors of them mixes in the token embedding as well as the positional embedding. Now there's actually a bit more ground work we need to do when we zoom into the attention portion of the architecture. The attention mechanism we just built earlier has three separate vectors Q, K, and V. And in our example, the size of these were the same size as a sequence embedding. Turns out there's a huge benefit in actually segmenting the attention mechanism to essentially divide and conquer this. And you're going to love this because now we have not just one Q, K, and V vectors, but we have multiple Q, multiple K, and multiple V vectors that each are responsible for a section of the sequence. I know, so many acronyms to keep track of now. And you're probably wondering, why on earth are we making this more complicated than it actually is? This kind of segmentation is called multi-headed attention. And it turns out splitting the model this way results in the model actually looking at the same sequence with multiple perspectives in mind. So if you have four attention heads that split up for example, one might look at how each token communicate in terms of grammar. The other might look at the short range relationships, another one looks at long range relationships and so on. And the original GPT paper uses this specialization rather than a generalization method for attention and uses the multi head attention instead. Now let's zoom back away from attention and look at the GPT architecture we built so far. Not bad. We have the token embedding, positional embedding, and a solid attention mechanism. What more do we want? Comparing our architecture with the paper, we actually are missing three components. First, feed forward network, second layer normalization, and third residual network. Gosh, I know. So many things to keep track of, but the individual components aren't that difficult to learn. As you saw so far, it's just keeping track of all these components to actually unite them into what's called GPT that takes a lot of mental work. Let's start with this blue box called feed forward network. Why does it even need this thing in the first place? Even though we just implemented our multi-headed intention to keep track of how tokens communicate with each other, we haven't really given them enough room to actually think about them. Similar to earlier when we added token embedding to give more breathing room at the token level, we have to add breathing room at the attention level as well. And the simple feed forward neural network gives this exact breathing room for the model to use this and learn a much deeper level information. And that's why we need to have this blue box at the end called feed forward network after attention. We also have this term in the yellow box called layer normalization. Same interrogative question here. Why do we need this exactly? Whenever we stack operations like attention and feed forward networks one after another, the actual numerical representation can start to get out of hand really fast. So in order to reduce this crazy fluctuations in values that might exist in the output, we normalize the values to a much more manageable range. Sweet. Now that we have almost everything covered, now the only thing we have left to do is just make the model longer by elongating these blocks. A block might contain an attention layer followed by a feed for layer. So essentially, you can repeat these blocks multiple times chained together to add more depths to the model for deeper understanding of itself until you get to the final layer where you project your output to the token embedding and then they're used to predict the next token represented in our board that are used earlier. There's one last piece here in terms of architecture that's quite easy to miss which is the very arrow that you're seeing here in the architecture. This arrow allows the input to be bypassed from the normal path and just skips the step entirely and added back on later. This is called residual connection. Because we're elongating the model longer and longer, it has the tendency to get very unstable since every change that we make at each block essentially gets amplified as it passes through all the layers. And this residual connection allows a mechanism where you're essentially teaching the model to apply the modification on top of the input itself rather than modifying the input itself. And this prevents the output from being distorted by the time it reaches the end. This entire thing that we just built is now finally called GPT. And now we can go back to the gout board and replace the inner mechanism and insert GPT instead. And all we have to do is train GBT architecture with our data. In other words, just because we now added the plumbing that resembles GBT, it doesn't mean that it will do the job that it's meant to do. We have to actually put in the work to pass billions and trillions of tokens of data into the model to actually have the model behave like our data set. And there's a huge rabbit hole in the training methods here as well as we get into different techniques like stochastic gradient descent and optimizers that all play an important role in shaping a better GPT model. And it's this exact GPD architecture that we just built is where many labs optimize essentially every component of the architecture itself including the training methods. And each lab might choose to optimize different portions of this architecture at the model layer to meet what they think is important demand to listen to at the application layer as they continue to work on with different constraints that might exist at other layers like infrastructure and chip layer. And the decision that we make at the model layer here not only affects the agentic layer in terms of context window speed and intelligence but also what hardware is needed to get interactivity and throughput at the lower layers.

Jobs for this video

Jobs for this video
Stage Status Attempts Last error Updated
summarize done 0 2026-06-23 22:00:46.459872+00:00
transcript done 0 2026-06-23 22:00:23.219512+00:00
metadata done 0 2026-06-23 22:00:16.045325+00:00