TLDR
VibeThinker 3B, a small model from Weibo AI Lab, beats giant models like Gemini 3 Pro and Claude Opus on math and code reasoning tasks by using reinforcement learning from verifiable rewards. It is based on Qwen 2.5 Code 3B and focuses on structured reasoning rather than memorizing broad knowledge. However, it struggles with tasks requiring general knowledge and is primarily a research project.
Key points
- VibeThinker 3B is a 3B parameter model that outperforms much larger models on math and code reasoning benchmarks.
- The model is built on Qwen 2.5 Code 3B and uses a post-training recipe with RLVR and MGPO.
- The training involves a two-stage curriculum SFT and multi-domain reinforcement learning to encourage long-horizon reasoning.
- The model excels at verifiable reasoning tasks but lacks broad knowledge, performing poorly on tasks like SVG generation or long essays.
- The paper proposes that not all intelligence needs the same number of parameters; structured reasoning can be done with smaller models.
- The presenter tests the model locally and finds it produces very long chains of thought even for simple tasks.
- The model is not production-ready but offers interesting research insights for improving open models.
Tools mentioned
- VibeThinker 3B
- Qwen 2.5 Code 3B
- Weibo AI Lab
- Dell Max Pro
- RTX Pro 6000
Techniques
- RLVR (Reinforcement Learning from Verifiable Rewards)
- MGPO (Max-n-Guided Policy Optimization)
- Curriculum Supervised Fine-Tuning
- Multi-domain RL
- Long-to-short math RL
- Claim-level reliability (CLR)
- Spectrum to Signal principle
- Distillation with diversity
- Test-time compute scaling
Takeaways
- Small models can rival large ones on specific reasoning tasks if trained with verifiable rewards.
- The model is not for general use; it's a research project showing potential for future open models.
- Techniques like RLVR and MGPO can significantly improve reasoning in small models.
- The model's long chain-of-thought can be inefficient for simple tasks.
Transcript (captions)
Okay, so how does a 3B model beat Gemini 3 Pro? In fact, not only beat Gemini 3 Pro, how does it beat Claude Opus, GLM 5.1, and Deep Seek V 3.2 on things like hard math? So, in this video I'm going to look at this model. I think it's very easy to be dismissive of the claims that they're making, but it's very important to understand here that they're claiming that this can beat those big models on a number of very specific tasks and specifically related to reasoning. And that's something that we've always basically presumed the bigger models are going to do better. So, in this video what I want to what I actually do is look at the claims that they're making, look at how they've actually built this model cuz I think in some ways that's perhaps more interesting than the benchmarks and the claims that they're making here. And then we'll have a play with the model. I'm going to show you running it locally and we can do some tests on it. All right, so what is the actual model? The model is Vibe Thinker 3B and this comes out of the Weibo AI Lab. So, for those of you who don't know, Weibo is like the Chinese version of Twitter. It's a sort of social network site there. And this particular LLM comes out of a group that's based in Singapore that's basically part of their AI Lab. Now, the model itself is not something that they've trained from scratch. They've basically taken the Qwen 2.5 Code 3B. So, actually that's quite an old model now, which is kind of interesting that they've obviously been working on that. And they've developed a post-training recipe for that which has allowed them to basically take that base model and create this Vibe Thinker model. So, up until now, the general consensus around sort of small models is that you basically fine-tune them for very narrow tasks and they become really good at one specific thing. In the process of doing that though, they generally lose their ability to reason and generalize over things. So, what the team here is actually doing is trying to train the model so that it can generalize at least to things like math and code. And the thinking behind what they're doing here is actually very interesting. They're proposing the idea that not all intelligence needs the same number of parameters or needs to use parameters in the same way. They're saying tasks that use verifiable reasoning. So, this is things like math, things like code, and mostly a task around search, and sort of constraint satisfaction, along with error correction. And that in this structured space, you don't need to memorize sort of all the facts and stuff like that. It's really all about building an engine that is good at figuring things out. Now, this is an idea that's not new. Even Andrej Karpathy himself has proposed that perhaps one day we get to a 1 billion parameter model where it doesn't really store facts in it or knowledge in it. It basically just has a set of sort of core reasoning principles that it can use with tools like search to be able to generate a whole bunch of different ideas. And this is where this paper is going. So, they talk about the verifiable reasoning as one kind of intelligence, and the other one being broad knowledge. And that includes things like your random science, long-tail facts. And they're basically saying that those kind of tasks, really there is no shortcut. You need a lot of raw parameter capacity to store all of that. Now, if we come in and look at their benchmarks, we can see that they're actually competing with a lot of models that are sort of around 300 times bigger. So, the thing I would say is even if they're not winning on every benchmark, the fact that they're in just similar ranges is kind of amazing considering the size of this. Now, if we look at those benchmarks, we can see that okay, things like the math benchmarks of Amy and Amy 26, they're actually on par, if not beating a lot of the models like Claude Opus 4.5, Kimmy 2.5, GLM 5, Gemini 3 Pro, etc. Then you've got coding benchmarks in here where they're doing really well as well. And if we look at them compared to other small models, you can see that they're just miles ahead of even things like the Gemma 4 12B model, the Olmo models, and the newer Qwen 3 and 3.5 models in here. But clearly on those math tests, on those coding tests, they are things that you can use verifiable rewards with reinforcement learning on. The knowledge ones, you can't. And sure enough, if you look at the GPA diamond here, they're really not that different compared to the open models. In fact, they're behind on the larger reasoning open models. And obviously a long way behind the proprietary models. So it's not necessarily that they're making the claim that this model is generally better than Opus 4.5, Gemini 3 Pro, or even the Kimmy 2.5 and GLM models. It's that if they can take that RLVR, the reinforcement learning from verifiable rewards, and improve that as much as possible, and it looks like they've gone for something a little bit like GRPO, but their own flavor and take on that. Okay, so the cool thing is they have released a very thorough paper here with a good amount of details about what they actually did. So if we come in here and look at what they actually did, the whole thing sort of follows this spectrum to signal principle that they propose, where they first built a bunch of sort of synthetic data with like diverse sets of solution strategies. That's what they call the spectrum. And then they use reinforcement learning to amplify the correct ones in there, which is the signal. So if we look at their training pipeline here, we can see they start off with a two-stage curriculum supervised fine-tuning here. So, stage one is like the broad coverage across math, code, STEM topics, chat, and stage two is retraining only on harder, long problems. And they actually mention here that they throw out reasoning traces that are under 5,000 tokens and anything that's sort of like an easy problem in here. So, here what they're going for is they're trying to force this deep, long-horizon reasoning instead of any sort of just shallow pattern matching that the model can do. So, they're also doing sort of multi-domain RL here with what they're calling MGPO, max-n-guided policy optimization. So, this seems to be sort of a variation on GRPO here. And the idea here is seems to be that they're constantly trying to push the model where they're weighting it so that they don't want examples that are sort of too easy, but they also don't want things that are too hard where the model is perhaps not at the level yet to be able to take on those things. The other thing they're doing here is a version of distillation where they're really going for diversity. They don't want the model to sort of just converge on one way to do each thing. They want to have multiple checkpoints that they can sample from and then they look at merging them. So, it seems like the idea here is that this basically empowers the instruction reinforcement learning, giving it the ability to actually get diverse answers out and then perhaps also to hone in on the ones that are right. They're also doing some things in there where they've got sort of like a long-to-short math RL. So, first they optimize for accuracy and then later on they're basically giving rewards to shorter correct answers and penalizing the long ones. So, this is one of the things that we've seen in the proprietary models that when they went for really long reasoning, they're now trying to scale that back, so it gets the answer correct in the shortest amount of tokens possible. And they've got some other things in here like CLR, so this is claim-level reliability. This is basically their test-time trick for how you generate lots of answers and then scale it back to decide which one actually is the right answer. And you can see looking at the benchmarks that that actually boosts their results quite a decent amount. So, often they're just under the proprietary models or the other big open-weight models, but using this test-time compute technique, that's what gets them over. So, in some ways you could say that that's not really fair because the other big models are perhaps not doing that. And if they are, for example, you should really be comparing to the Gemini DeepMind model, not just the Gemini Pro model. Anyway, overall, it's really interesting just to see what actually is here and get a sense of some of the techniques that might be able to actually unlock open models to be able to have better reasoning, which I believe is the thing that basically is allowing the proprietary models to stay ahead is that they've got techniques which are better at generating reasoning chain of thought, which is just much higher quality than the open models. So, let's just jump in and have a look at the responses that we actually get out of the model. Okay, so I'm going to be using my model testing suite here. This is running on a Dell Max Pro with an RTX Pro 6000 in here. So, it has no problem with running the model. Thank you to Dell for sponsoring the computing here. Now, it has no problem with running the model at all here. Now, we can try it out and we can see that if we try it with what they're saying it's good at, sure enough, we do find that things like coding, things like math, things like logic even, it definitely has extremely long chains of thought where it will basically try to optimize something out. And generally in regards to code, it does pretty well. The thing is you're getting very long chains of thought in here. So the big thing then is to sort of see, okay, for things that it doesn't need that, does it still actually use really long chains of thought? I see even for the simple logic test, it's using a much larger amount of tokens than with the previous the GLM 5.2 model for this. So that's something that I noticed that because it's being trained to basically want to do large long chains of thought, it really doesn't have the flexibility perhaps that the bigger models have of being able to sort of work out that, oh okay, this problem doesn't need a huge amount. So that could be because of a few different things. It could be one that, okay, this is actually a much older model now. We're talking about the base model for pre-training being the Qwen 2.5 3B model. The fact that obviously it's a 3B model, so it's perhaps not going to pick up the nuances and stuff like that. And I guess that's not surprising if we look at the domains that does the long chain of thought it does well. And I guess that's kind of to be expected if we look at the model clearly has been trained for coding, for sort of logic, math, that kind of thing. If we give it something like the long essay task, it totally understands that I want something that's 5,000 words in here. And it will certainly come up with a good plan. And it's used a lot of tokens for the thinking in here, but the actual length of the essay is often a lot shorter. Now this one it's done pretty good. It's used 5,000 tokens in total. You can see it comes back here saying that due to platform constraints, this response contains an overview of the 5,000 words. We've given it 32,000 tokens. It should be able to generate that out. Not surprising though, right? This is something that up until recently most models would really kind of fail at. This idea of sort of a self-understanding of how many tokens we've actually generated and should we be still generating that kind of thing. It does get the concept and I find in the thinking tokens it seems to do quite well. But with a lot of the Even if I go back to the raw from markdown here, you'll see that often it's just doing kind of weird predictions on certain things. So, occasionally it will go into Chinese, it'll go back to English and stuff like that. This is not the general model and I guess like they're not saying it is, right? Another good example of that is the SVG test. So, if I give it asking it to draw the SVG for a pelican, the few times that I've done it testing so far, it consumes huge amount of tokens for thinking and then draws a very small, not great SVG. And again, this is not necessarily a huge fault of the model. This is perhaps not what they're going for. They're certainly not claiming that this actually is better than Opus and stuff like that for generalization. It's just for these very specific tasks. And so, you see there, we've got 5, 6,000 tokens on the thinking and then we've got a very small SVG. If we look at the SVG in here, well, this time it's done really badly. Often it will get the bicycle wheels, it will get something kind of like that. Let me just run it again. Okay, so running it again, again we've got a huge amount of tokens in the thinking, a lot of effort in the thinking here of actually trying to work out how it was supposed to do this. Clearly it hasn't been trained on anything like this, though. When we come to the preview, you can see we're at best getting sort of two wheels. It's done better than this in some of them, but it shows that it just doesn't have that depth of knowledge, which we kind of know that this model is not remembering knowledge, and that knowledge isn't always just necessarily going to be facts and figures. In this case, it's going to be just having some representation of what does a pelican on bicycle actually look like. All right, if we ask it to do HTML kind of things, we've got a sort of split thing here in that the coding part, you know, so the understanding of coding is very much something that's in its domain, but certainly the understanding of design and things like that are not really in its domain. It's not really what it's been trained on for this kind of task. And I don't see this as a huge criticism against this model and the paper and stuff, cuz you could imagine that someone could take one of these models like this and actually make one that's just for coding up websites, etc. So, if we look at the website, it's done it's got the elements in there. It's got an understanding of what a web page is. It's certainly gotten the HTML code right, but the design elements are just not here. And this is what I'm saying is that you could imagine that someone makes something like this just for design, where they focus really on that, and that would become a really powerful, unique model just for when you want an agent to go off and do design. So, this is definitely shining at what it's good at, but the moment you sort of cross out a little bit across the boundary of what it's not good at, you kind of see that, okay, you're running into problems there. If we give it some decent long context in, so this is basically an article and this is article about 5 coder and we've then sort of asked it some questions. So, the questions are at the top explain parametric compression. We've got some things about the models being accused of benchmark saying and let's see okay what it actually comes back. Now notice again it's huge amounts of thinking for this. If we gave the same task to GLM and I'll do that just after this so we could sort of just check. We're going to see far less thinking before it actually gets to the model, probably because it's got that deeper sense of knowledge. Now, it's not a fair comparison at all, right? We're talking about something that's like what 250 times or 300. It's huge difference in size. And so, I certainly don't want to say that as a criticism of the model, but you can see that when we're at this many thousand tokens of thinking tokens to actually get it to generate an output. Now, looking at it certainly understands some of the key things in here and this answer is quite nice for what it is. So, it's not that it can't do these things totally. It's just that it's not very good at once you get out beyond what it's good at etc. Just to show you that exact same thing with the GLM 5.2, we can see we literally had like 15 tokens in thinking. Let me answer each question carefully using only the article. And that sort of shows I guess the confidence in this kind of task, you know, it's definitely got a sense of understanding that just far better than the vibe thinker model in here. So, this is certainly not a model that I'd use for production or for anything like this. It is a research project. That's to be expected. I certainly think that the ideas that they've sort of proposed could end up working out much better for a 9B model, for perhaps something like doing more post training on something like the Gemma 12B model, or even going up to the 30B models. You could imagine that we could get some really good models there where we do have elements of the knowledge in there as well, but it's had this unique way of doing the RL to make it actually better at this kind of task. So, overall I'd say it's definitely an interesting project. It's worth a sort of look at. It's not something that you're going to be using in production, and I don't think it's really meant for that. It's a very interesting piece of research that you could imagine in the future gets used to make much better open models that we could end up actually using in production, etc. So, anyway, check it out if you're interested in that. Thanks again to Dell for sponsoring this video. As always, if you like the video, please click like and subscribe, and I will talk to you in the next video. Bye for now.
Jobs for this video
| Stage | Status | Attempts | Last error | Updated |
|---|---|---|---|---|
| summarize | done | 0 | — | 2026-06-24 03:36:49.674722+00:00 |
| transcript | done | 0 | — | 2026-06-24 03:35:42.326773+00:00 |
| transcript | dead | 5 | handler returned RETRY | 2026-06-19 22:15:49.609670+00:00 |
| metadata | done | 0 | — | 2026-06-19 22:00:21.622904+00:00 |