TLDR
OpenAI's Chief Research Officer Mark Chen discusses the enduring power of scaling laws, the importance of replication for developing research taste, and the company's principled focus on pre-training, RL, and alignment as a stable high-level roadmap. He addresses the evals crisis, the need for new benchmarks, and the shift toward vibe research where models handle more execution. The conversation also covers leadership via meritocracy, the jagged frontier of model capabilities, and the underrated value of connecting research primitives to real-world agentic use cases.
Key points
- Scaling laws continue to hold despite periodic narratives that pre-training is dead, with new engineering and research insights consistently breaking past bottlenecks.
- Developing research taste is best achieved through replication—fully reproducing papers' training curves and results.
- OpenAI maintains a stable high-level research roadmap focused on pre-training, RL, and alignment, while implementation details and compute allocation are reassessed at regular intervals (e.g., compute planning).
- Reinforcement learning works best in domains with objective ground truth (math, coding) and faces headwinds in subjective fields like creative writing where grading is difficult.
- The field faces an evals crisis with too few canonical gold-standard benchmarks; partnering with external organizations and separating eval teams from model optimization teams helps avoid benchmark overfitting.
- Vibe research is emerging where researchers focus on idea generation and orchestration while models handle implementation execution, though models still lack human-level research taste.
- Taking high-risk bets is a key differentiator for OpenAI, and even failed bets produce valuable write-ups that prevent others from pursuing dead ends.
- Pre-training is underrated, and connecting research primitives to real-world agentic use cases is also underappreciated.
Tools mentioned
Techniques
- Scaling laws
- Reinforcement learning (RL)
- Pre-training
- Post-training
- Benchmarking
- Benchmaxing
- Needle-in-a-haystack evaluation
- Context compaction
Takeaways
- Scaling laws are still reliable; don't bet against pre-training.
- Research taste is built through hands-on replication, not just intuition.
- Evals are in crisis; invest in new benchmarks and separate eval teams from model teams.
- The shift to vibe research means researchers will increasingly orchestrate while models execute.
Transcript (captions)
Yeah. Like that. I know. >> That feels like already the situation I'm in real life. Cheers. >> Cheers. >> Hey guys, welcome to the Leighton Space Cooking Series where we invite founders and researchers and just let them cook. Today we have a very special guest, the chief research officer of OpenAI, Mark Chang. Welcome. >> Thanks for inviting me, Alan. >> Yeah, thank you for coming. I mean to begin this all started from the inspiration after hearing a story that Mark Zuckerberg would make soup to try to poach researchers and in response you brought soup to researchers. Is this true? Did this happen? Did it work? >> Oh, you know it's absolutely a true story. Uh and I have brought soup to our own researchers. Um I think that made us calm down a little bit. I think we came out on top. But um yeah, still a very funny story in the craziness of how AI's evolved. >> How often do you cook? Is it something you're familiar with? Well, you know, I I do enjoy cooking, but I don't have the luxury of doing that so often. So, I uh usually have a work dinner every night of the week. And, you know, maybe post AGI, this is going to be my hobby. I've always joked I'm going to start a noodle stand once uh once it's it's all over. >> Yeah. Yeah. You know, post AGI, hopefully that'll, you know, still be there. But Great. And I guess looking at what we have in front of us, do you have an idea generally of what we'll probably making? >> Uh Korean tofu soup, maybe. >> Yeah. Yeah, that's that's generally what it is. So, we inspired an author story of you, you know, bringing such researchers. So, we're making a tofu Korean stew and then we have proms that we'll be cooking. Are you ready to go? >> Yeah, let's do it. Let's do it. >> Great. Okay. So, the first thing we should probably do is we'll separate the veggies and then we can cut them. And basically what we want to do is just um cut the dirty part off with the dirt and then Yeah. separate that across. >> That I know. >> Yeah. Okay. And then just have that there. So, we could do that. And while that's going, I guess I could ask more about your background. So in a previous life, you were once a trader and even Sam, I think last year in April also tweeted about how if you're a high frequency trader, you should consider joining OpenAI because, you know, build AGI. So do you think there's a relation between, you know, being a trader and being a researcher or do you think it's just like a very technical and competitive area where a lot of great employees can come from? really the most important thing is um there are a lot of researchers who just uh started out without a formal training in machine learning or AI research. Um >> we've very much believed in training people up to do this. I I think the real hard thing is the ability to creatively solve problems and think outside of the box and it's not so much you know you have to do a PhD even though that does bring a valuable skill set. Um, with trading in particular, I mean I I don't know that it's that special of a profession like um I kind of think of it as uh you know we've had great mathematicians join, we had great you know physicists join but trading is something where you know it's like >> it's very unhackable you know you're um >> how you can't kind of cheat the real world right like uh you know it's a it's a hard metric to optimize. Um, and there's also a lot of characteristics like uh >> it's it's a field where attention to detail really matters and um you know it's kind of the brutal hard optimization and squeezing out the juice of a system. Um and some of those >> uh those skills transfer over. >> Gotcha. Yeah. And I guess for people who want to get into research who let's say don't have a PhD, what do you think are the main attributes or things that they can learn to develop research taste? because I guess that's the main part of um getting into this field that may be very foreign to them. >> Yeah, I mean I think it's a little bit overrated. Um it is something you have to develop but um the best mechanism I've found for developing that is really just um replication. So I think you should take >> papers that you really look up to and just try to fully replicate it. Um >> I like I I think a lot of replication stood out in mine my mind. um you know back in 2018 there is um you know like ResNet there were pixel CNN's and I learned so much just um trying to replicate the training curves exactly get to the exact amount of you know like training loss or perplexity that the papers um hinted towards it teaches you a lot of techniques right that um people don't really >> kind of talk about but you know once you dive in a couple layers deeper um you learn those techniques and um yeah I think really the first thing too that got me into the field was >> when AlphaGo played Lucid Doll and you know I think that was the turning point for so many people and yeah I mean it was inspirational and it the the first big project that I I really went after was um can I get a DQN working? >> Yeah. >> Yeah. Yeah, that's true. I think it was move 37 or like one of the games. It was pretty Yeah. insane watching it happen and seeing all that develop and see also where we have gotten to today especially with research. I mean, isn't it crazy that you're seeing move 37s in almost every field now? It's like there's move 37s in in math, there's uh in computer science, and coding. Um >> I think >> even yeah, just it feels like a lot of people woke up at the start of this year and were like, man, agents are working in my profession and um you know, they're they're essentially realizing that these models can just do long horizon meaningful work for them. >> Yeah. No, that's true. It is it is very impressive to see like even just using in my own work but okay the next thing we could do is just as simple as just cutting the onion. So what we have to do here is just like dicing it. Do you think um there's jobs that RL maybe will have like a much harder time to kind of break into? So for example coding maybe easier since a lot of the context is accessible whether it be the code bases or even the work you're trying to do. But let's say if you're trying to do the job that a junior consultant may do where all the context is a little scattered maybe a little more difficult. How do you view through like those different scenarios? Is there a way that you kind of assess what can be the right approach? >> Yeah, I mean I I think it's RL's traditionally had um headwinds when it's come to fields that you know it's more um kind of subjective than objective. So if you kind of think of like you know one one kind of >> you know uh example of this is creative writing where you know you can take two pieces of creative writing and two experts can have wildly different opinions. So it's these fields where things are hard to grade >> um where >> you know RL has the least amount of ability to kind of go and um and directly apply there. Yeah, >> I know a lot of people are developing techniques to apply RL in these um these settings, but um for now it's just where there's cold hard truth things like math and computer science where you implement it correctly or wrong. >> Um that's where you kind of see it really taking off. >> Yeah. No, that actually brings up a thought on in terms of evaluating those fields. So um you know as models get much much more powerful and they even saturate for example solving like the IMO questions. Yeah. Yeah. >> Um, how do you view evaluating like superhuman intelligence? Like get to a point where it's so good at things that even the top what 0.1% of humans can do that like you know how can we push past that frontier of intelligence? No, it's kind it's kind of crazy and I feel like um a lot of it centers in on on kind of interfacing with the real world and um when when we've thought about how to evolve past things like programming context in the past um >> I think a lot of the initial direction we took was you should move it to real world research right and we've seen that the models uh they've gotten a lot better at uh just kind of discovering novel theorems and uh pushing the frontiers of of hard sciences but even today right that's no longer a surprise I think like we we almost take it for granted now that um these these models can solve very very difficult problems. They can make contributions and even kind of draw relationships between um fields that you know um that are that are novel and insightful. Yeah. So I think um you know we we think of coding co-working um as really a a domain for that that tests if our models can learn in high context settings and in real world long horizon settings. >> Gotcha. Okay. Yeah, that makes sense. And since you're done with all the vegetables, we can now do the next step which is sautéing it. So cool. >> Yeah, we can use the imple stove which we've seen before and very powerful. Let me turn it on. Um, and yeah, so we'll just sauté it. Great. >> With some oil. So I put the pan in the front burner. And then simple stoves. >> Yeah, you could use oil to pour some in. >> And then we can also >> Yeah, just a good doll. Perfect. >> And yeah, and then we could turn on the stove. So just press it. And then um yes, spin the knob. >> Great. Perfect. And then while that heats up, we can just wait and then add the vegetables. But yeah, I guess more so on views for research are there, I guess, you know, commonly accepted ideas that are, you know, you disagree with, whether it be like pre-training is dead or language models will never get us to AGI. I think there's a lot of takes out there that are very ambiguous and obviously haven't been proven out yet. And I guess from your perspective as like the research like leading things in OpenAI, like I think those >> I mean I uh I firmly believe in exponent being on the exponential and in scaling laws. So, I think any of these bare takes um I fairly strongly disagree with. Um >> you know, when when it comes to pre-training is dead, I >> I mean, I think the the funny thing is this narrative only started spreading more widely after let's say um the last one or two years or so. But in many times in the history of developing LLMs, people have been saying this, right? And you know um there there have always been some some bottlenecks that people well you can't scale past this because of this bottleneck. Um and we've always found some kind of technique whether it be better engineering or some new research insight that helps you break past the boundary. And so I think it's just more and more of the same right like more careful research engineering more careful data engineering more careful scaling and it always unlocks that next ability to scale further. So I I mean it's held for you know almost 10 orders of magnitude but there's no reason it should not keep keep holding. >> Yeah that's a very fair point and I guess on research bets that have helped you scale beyond. >> Were there specific ideas that you can even remember in the early days that everyone was was saying that this is not going to work. >> Well yeah I mean I think of reasoning as one of the biggest examples of this. Yeah. And um you know the the first breakthrough that we launched to the world here was 01 but it wasn't easy to get that off the ground cuz one >> the world we were back living in back then it was one where pre-training plus post-training right that felt like such a promising paradigm. Yeah. Um and so even at a company like OpenAI, you would have people ask naturally why do something when you have a machine that works and fundamentally you know it's to the credit of >> you know Yakob Ilia many of the people who really had conviction and vision in this space um that we started pushing on this in earnest and even then it took a lot of steering to get the whole company behind this as a as a fundamental bet. >> Gotcha. Yeah. >> And how do you kind of develop that ability to motivate researchers cuz I assume that's a big part of you know taking a lot of bets and some will pan out but still building the trust in the team to know that eventually some of these will actually have you know power law effects. >> You know what's what's really cool about open AI is um research it feels like a meritocracy. So um often times the research managers are the people who um >> do the actual >> have done the best research in the past. And so I think a lot of steering can come top down, right? Like if your manager says, "Hey, you know, I'm like really convinced this is the path forward." Yeah. Um generally people will take that into heavy consideration, right? Um it's like, you know, this person who you've respected for their research taste and execution for so long is like now very excited by this idea. Um it's it's definitely something that uh >> that um yeah, you you people take into account. So I think there there's good top down steering. At the same time, you know, I think one really cool thing about OpenAI is um there are bottomup elements like we like to be convinced that um >> uh you know that we're wrong, right? And and someone can just come with cold hard evidence and many things like that have turned into >> core parts of our research road map. Just things that no one was really kind of trying to steer but some researcher on the ground have a heavy conviction in. Yeah. >> Um and and that's also a really big delight to see. >> Yeah. No, absolutely. I heard in a recent interview that you gave that your internal research roadmap hasn't really changed um even through all that we've seen you know with model development and even other companies I guess how often do you guys assess that reassess that even like act proactively I assume it's not a lot of reactive you know decision-m as like other models come out but how do you like think through that process especially as everything around you just continues to get better >> yeah so I think the thing is um the high level research road map should be stable Right. I think people need something to ground in. People need to see a path to what we're building. Um, and I've been very happy that we've stayed the course for for a while, but the implementation details can change over time, right? And I think um it's important to kind of like the sequencing will matter, right? The relative resourcing will matter and the the kind of exact threats on the ground will matter. So what we do is um I think we have >> kind of points in time that force us to reconsider these things. So uh one example is when we do compute. >> Yeah. Um, one of the parts of the job is just figuring out how to allocate compute to projects. >> Okay. >> And um, it's it's a time to kind of question like are we really putting compute to use and people to use at the highest priority events? >> Yeah. >> And I guess could you clarify more what you mean by oil? >> Oh yeah. Yeah. I was like coffee adding oil. But clarify what you mean by the higher level versus like the more implementation details like high level as general as like AGI that's like our northstar or is it more like granular than that? >> Um well yeah I mean at the very highest level right we have an or that focuses on pre-training right which is you know giving models a lot of world knowledge. We focus on RL like teaching the models how to reason with that knowledge how to chain the little insights together. Yeah. And then finally um alignment and post trading right and um we're always looking at both like how to scale the mainline in each of these domains and also new bets that fundamentally unlock either like different scaling properties or more aggressive scaling properties. >> Gotcha. Mhm. >> And so even in that I heard that every one to two months you go through what like 300 projects like research projects that could be um follow through on. Is there a way that you kind of hone that decision- making I assume as you like decide what to actually double down on and what not to since I assume there's a lot of talented researchers who provide possible ideas to pursue. >> Yeah. So I think really in the spirit of um of focus so one one narrative you might have heard is you know we're we're really focusing our bets at openi and um we're also trying to do a little bit more uh directive compute allocation as well. So >> um I don't like micromanaging my managers. I think one important thing is to empower them but to just kind of give compute big swats of compute to the big bets you want to make >> and then >> that's what they mean by directive like >> yeah yeah yeah but then also give them kind of flexible pools of compute which they can you know freely allocate to things that that they believe in or just kind of uh fudge with the the allocations that um that we prescribe. So I think um yeah I I think it's just tying let's say like a small number of bets say three to five bets from each org um >> into the main research road map >> and then really letting the the managers um and or leads take things from there. >> Gotcha. Okay, that makes sense. And I guess for rising researchers so um let's say in an interview setting are there specific tells or ways that you can identify okay this person has some you know potential of becoming a researcher to impact an org in a specific way or is it like just looking at the previous research that they've done and then that is what heavily dictates whether they can actually continue on. >> Um it's a hard problem before someone comes to open AAI. I think that's that's genuinely true. Um I I think um for a lot of the best research managers, you know, they work with so many researchers over time um where you kind of develop an intuition like you the things that they say, the ideas that they bring up. >> Um some >> are are those kind of do they hit hit the same mark or like are they the things that you would be thinking about personally too? And so there's this gut check of like you know does their intuition match um the same intuition that you have. >> Yeah. Um, but it is really hard to tell out out of the gate. Usually in, you know, let's say 6 months to a year, it's >> pretty clear who who's, you know, has the strongest trajectory and who's going to make a lot of impact. Um so yeah honestly um I I I think it's a hard problem but just having seen a lot of people you know go through research development um at OpenAI you develop an intuition for um you know who's more PP in different areas. >> Yeah. >> And one thing to kind of like just mention there is not every researcher is the same. I think there's a lot of different types of impact. >> Yeah. >> They're the people who just take an idea it's very clear and they'll just implement it before anyone else. M >> they're also the people who just come up with the kind of like crazy >> almost too crazy but >> moonshot type. >> Yeah. But somehow not that crazy and and and they they really convince you in a in a different way of seeing the world or or or another completely different type of project. So there's a lot of ways to make impact. >> Yeah. No, that's helpful. And so I guess elaborating on that, would you say there's similarities that you would see between let's say like top engineers um and top researchers? Like I often hear top engineers even at like small companies and startups are ones who can take an iOS um through like the product or do you think it's more so they're focusing solely on the research not considering like the end design how it's used by the customer? Well, I yeah, I mean I guess the thing about research is many times the path forward is unclear and so what differentiates researchers is >> how often they're pointed in the right direction. How like like you say taste, right? I think in engineering there are certain patterns that work like you know if you want to build a product that looks this way um the engineering principles can be pretty similar. Yeah. Um, but for research, I think the thing that's slightly different is just this ability to, >> you know, have good research taste to convince other people that, um, what you're doing is promising. >> Um, >> and then, yeah, um, again to just kind of integrate it into the core research room now. >> Gotcha. >> Yeah. >> Great. Okay, it seems like we're done with the vegetables. So, now we have to multitask. So, we're going to pour some water into our pots to get the base of the soup going. So, in the top right. Yeah. Here. And then just twisting this off. So let me pour some here. Um you can use some as well. And while we have this simmer and we'll add the veg, we'll cook our prawns here. Okay. >> Um yeah. So let me clean this up real quick. >> Looks looking great so far. I feel like sauté got some color on the onions and mushrooms. So let's turn this on. >> I guess one aspect or area that seems very interesting are evals. Um, and more specifically, have there been instances where you've seen like through just vibe checks that it is really good, but on the actual benchmarks it like performs very poorly or do you think it's like heavily correlated that you know if you're SweetBench Pro is you know a high number then it's like your vibe check on it doing coding tasks is also really really high. >> No, no, I mean I think there is this phenomenon um you know I I think internally I'm not sure if this is a externally used word but yeah just like benchmaxing you know um I yeah I think I think you can kind of overfit onto certain distributions um and >> it it won't be reflective how you how well you generalize right because um I mean easy ways to do this are you know you take a benchmark and you just find like very very very similar types of instances to the benchmark and you overtrain on those instances yeah >> um >> so I think um beyond that the the other scary thing in the field is the the number of canonical gold standard benchmarks is low yeah >> and we really are kind of in an evals crisis, right? Where all the really great uh EVAs that we all know like growing up like taking the SAT or um those those are all fully century and >> um we really need to find good new ways to benchmark the models. I think one great thing about tools like Codeex is they've really enabled the fast iteration of of evals. like we're able to just kind of have one person just very quickly put together a very high quality eval. >> Um, another kind of interesting thing of just being able to deploy your models is you can just see them evalu as people are doing things with them, right? Um, one of the great things is you know in math in coding and software like you get a sense for like where where they fall over what the task horizon they can do from from this like general very broad-based deployment. So >> yeah. No, that's that's helpful. Now we'll just add the prawns to the oil and get some color on it. >> Um, yeah. And so I guess double clicking onto that. >> Um, how do you balance both doing well on these benchmarks, >> but also not, you know, like benchmark maxing as you said, cuz I assume you want to be most honest and like not kind of cheapest system. But if you have like lower score let's say than a competitor or from other models to the consumer may be like wait your scores aren't that great so the model just probably isn't that good. Like how do you balance both of those dicho? >> Yeah I mean I think the thing is you just really have to operate over representative mixtures of evals and >> um always invest in creating new evals. Um, and yeah, really just like there's this philosophy of once an eval out in the world. Um, then it's it's just already not a good um and I think one one thing is um also just kind of partnering with external organizations to create evals. So, you know, in in many of the kind of hard math and science evals, um we've partnered with external organizations and um they've been able to kind of craft gold standards there for us. So yeah, I think um there's a kind of interesting philosophy of separate the teams that are creating the evals from the teams that are optimizing the the models themselves because that way you don't like co-incentivize them, right? Like the the way the evals theme can work is they're trying to build evals that are hard for the model. So there's this inherently adversarial process where um you're you're not kind of treating yourself, >> right? Yeah. The the incentives are somewhat um aligned in the right way between the two teams. >> Yeah. And do you kind of also contribute and help in the ideation process or even deciding what eval you know you should work with a third party on to develop? >> Yeah. Yeah. So I mean I think a lot of the work that Yakami do also involves just kind of steering the direction the evals go. I think we'll notice certain gaps right or certain kind of capabilities um that we want and every capability on the flip side is an eval right. You need some kind of >> eval that measures if you've elicited that capability well. So yeah, I think uh yeah, it's um it's takes a lot of steering and just to get everyone on the same page with evals is also a lot of prep work. >> Yeah. No, that's that's it. I guess on Yakub, >> you said in a previous interview that he's a very funny guy. >> Yeah. >> Do you have any fun stories that maybe you haven't shared about working with him? Cuz you also say that you guys align very well. So your discussions even on research um are very efficient and help a lot when like driving towards a frontier. I guess like on the opposite side of being very funny, are there things that you you share? >> You you asked about like a funny story. Well, he told me this joke yesterday which I thought was very funny. Um I mean in many ways we kind of uh jointly manage the the research efforts and um you know apparently some researcher came up to him and was like you know um it feels like I now just have an army of you know really dumb I like gold medalist. >> Yeah. And Jako's like, "That feels like Corey, the situation I'm in in real life." So, uh, yeah. No, he's he's just like brutally sarcastic and funny. Yeah. >> No, that's great. It's great to have humor in the workplace, you know, to balance out, especially as you're pushing the frontier on very important work. Um, but that also brings to mind one kind of weird scenario of how models can perform very well on let's say the IMO or even the ILI but may struggle with some more mundane task that a human can easily do. So I guess how do you deal with >> Yeah. Yeah. I mean ultimately I think what's intuitive for the models is often not um that intuitive for the humans. like uh there's there's a lot made of this jagged frontier analogy where um there's some things that the model is just inherently you know based maybe the data it sees or um kind of the the things that we we can teach it more easily it's just good at >> um I actually think you know a lot of a lot of it boils down to also just context right um the models don't have a lot of context that a human has >> um vision of course is something that's more naturally biologically wired for humans um and so yeah I think there are there's just certain kind of jagged capabilities that models are better at than humans and vice versa. Um but I also think you know context just um being able to take a single task learn lessons from it and apply them to future tasks um that capability is something that you know a lot of people are in working towards right now. Yeah. >> Um but is yeah very natural for humans. >> Yeah. Mhm. >> And on the context point, >> um a very lowhanging fruit example that many people say is just to increase the context window to provide more um examples so the model can perform. But do you think >> I assume there's more complexity on how to actually enable even with a large context window and a lot of context there could be bloat or even just a lot of like context rot as people have said. So how do you go through that process of >> navigating? I think there's kind of the canonical way you would solve for very long horizon learning, which is, you know, you just naively increase your context window, right? And >> um I mean that makes sense. I think there's a difference between implementing long context and implementing long context well like you said. Um and you there's a lot of >> kind of like needle in the haststack style to measure that. Um but I do think beyond that um there are also a lot of in some sense like engineering and research shortcuts that you could take. Um so you know like uh many many coding products today have features like compaction right where um you can compress kind of uh either insights or um working state and >> um stuff like that you know it just shortcuts a lot of the >> the very brutally difficult and expensive um primitives that you have to build with just native long. >> Gotcha. Great. Okay, now we're going to do the fun part. So, um let's lower the heat a little bit and then add a little bit more oil to the pan. Um and then we'll torch the the shrimp the ponds to get a little more flavor in there. >> Yep. >> So, I'll first do it on on my pan to show you what it looks like. >> But, >> one shot learning. >> Yeah. >> Yeah, indeed. Wait, I didn't pour any bourbon. >> Okay. Wait. So, let's pour like a fourth. Okay. And then pour it in. Heat is off. And then torch it. >> Awesome. >> Great. >> All right. Okay. I think I got this. >> Okay. Yeah. So do don't do it with your Yeah. >> There you go. >> So pour it to like half of the fourth cup. >> Okay. >> And then once you have that, I can give this to you. >> Perfect. >> Great. So it's off and then once that's off, we can turn it back on. Perfect. Okay. And then now do you want to hold on this and just press this button >> to uh fire it up? Yeah. Cool. Yeah. Perfect. Great. Okay. Bomb bang it. It's a little light, but yeah. Great. And then we could turn on the heat again. >> Great. >> And then we'll just cook off the alcohol. >> Okay. Great. Great. >> Great. How how you feeling? You know, great. Great. >> Basically there cooking everything out. >> Okay. Awesome. Yeah. And I guess in terms of research ideas and what to work towards, do you think there's still a lot of lowhanging fruit or ideas that can still be improved a lot through just optimizing small parts of already implemented work or do you think right now there has to be a lot of research that are completely new bets that people take? Um, yeah, that's a really great question. I feel like there are new bets but probably not that many. Um, in some sense like uh >> hopefully you feel like you know AJI is coming soon, right? And um I think everyone sees that these models are getting really capable. And >> I think if you really imagine the implications of that, we're getting closer and closer to a world where the models can come up with more of the innovations on their own. Yeah. They can kind of do self- sustained research. This is one of the big or goals that we've set for for our research work. And and so I think like >> you know what really matters is are there big bets before that point in time. And um I I think the window is small, but there are still like some fairly significant ideas we're trying out. >> Yeah. I mean there have been some researchers who have stated that to get to AVI, we still need let's say like two or three more breakthroughs that be like continual learning or some other ideas. Um do you follow that same view perspective or do you think it's kind of more so like not as drastic as coming up with like three completely different paradigms? >> Um yeah, I mean I I don't know. I I don't know if that same framing like continuous learning is a basic primitive that you have to unlock. Um >> there's so many different techniques. I I don't know um that I think you know we're trying a lot of ad limitations of it. I don't know what would consider as a breakthrough versus not but I think there are clearly many shots on goal and I I'm pretty sure it'll work. >> Great. Okay. >> Great. Okay. So the shrimp is basically done. >> Awesome. >> Okay. So do you want to do the fomb thing again to get more color? >> Let's do it. Yeah. I'll turn on the heat a little bit and then let me get some more oil. >> Great. >> Yeah, cuz we want to get like some dark color. I think mine has in here. >> Great. >> And then we can hopefully get another shot. >> Okay. >> Yeah. Do you want do you want to go first? >> Um >> same same amount as >> Yeah, same amount. We could hopefully get cuz I think the heat wasn't as Yeah. Okay. Can you put it in? Yeah. Advance. There you go. Just touch the button. >> Great. Wow. There it is. >> It's good stretch. >> Yeah, indeed. And it's like add some good flavor here. Let me try it over here. >> Yeah. Let's see. Great. All right. Let's see. Let's go. There it is. All right. But we're in the final stretch. We have our shrimp all cooked. Yep. some fire. So, now we can kind of >> Yeah. cook it off a little bit and then um we should add our veg to the water. >> I'm impressed by your multitasking abilities, you know. I think that's actually one thing um we need our models to get better at. like it should just be able to do some thread like this and also just have a conversation with people. >> Yeah. No, I that also reminds me do you think images and audio and video and even text like that should all be one under one model or do you think it'll like break through like specific specialized like audio model or >> Well, yeah. I mean, I think for for a research lab, I think there are a lot of advantages for it to being under one. So, um you just have to maintain one infrastructure stack for instance. Um I think the cost to like maintaining and scaling many infrastructure stacks at once. Um I think that's something that >> you shouldn't underestimate. So I think there are a lot of benefits to just like you know you do some core research in >> in your like in your fundamental stack and that just carries over to whatever modality or whatever thing you want. So >> um I I think there's a strong bias for us to keep it >> in as as few different ar uh architectures as possible. >> Gotcha. >> Yeah. Great. You know that that makes a lot of sense. I think like the architecture as well is something that isn't often considered is very important. >> But one also term that I've been seeing a lot that you've also kind of mentioned is a vibe researcher. You know, we have vibe coders obviously. >> Yeah. >> But I guess on vibe researching, what do you think is like the end state? Do you think the main value out of a vibe researcher is just the research taste of coming up with the right idea or do you think it's more so the execution of going through and following through on the actual research? Um yeah, so I I I think we're actually moving towards this world very quickly, right? Um I think both at OpenAI and at other labs, um you're starting to see a lot of the work become mostly orchestration focused, right? Like um the the research is coming up with the ideas. Um and the model's great enough to do the implementation execution by itself. Um so I think >> you know when you when it comes down to like uh you know is is the value of coming up with ideas versus execution. Um yeah, both are still important, but it does feel like there's there's this market shift um towards just kind of being able to come up with a lot of ideas and then >> um the model can actually do the the execution and orchestration for you. So um I think it's very much going to be the future of doing research. >> Yeah. >> Um >> we also said earlier, you know, like the models don't quite have the taste yet and >> um >> that's why you still need the researchers coming up with the ideas. Yeah, it's going to be hard to teach the models good taste. We noticed that. But in terms of actually accelerating the research, there's clear tangible benefits already. >> Yeah. Do you think there'll ever be parody in terms of research taste with models? >> I think so. I mean, when we look at our kind of three-year roadmap, right? Um the end goal that we want to reach is one where >> you know the the models are just doing endto-end research and um >> I think a part of that problem is just being able to have the model come up with good taste. you point at some, you know, just generic benchmark or something and it finds the right solutions. >> Yeah. Yeah. No, that's helpful. >> And in terms of research done by humans at OpenAI, >> how do you guys go about, I guess, the postmortem process of, let's say, a research bet that didn't turn out well. >> I assume that a lot of it is taking these, you know, best bets and some don't turn out well. Well, I would say that is a big part of OpenAI's alpha because I think >> one thing that differentiates us from other labs is we take a lot of high-risisk bets. I think it's what's allowed us to stay at the frontier so consistently over time. Um but it also means that some of the bets are not going to pan out and >> um a hard correlary of that is when when a bet doesn't pan out. Um you have to you know not delude yourself into thinking that you know this is something that will work and and kind of uh disconnect from it. So I think there are certain calls you have you have to make right like uh kind of look back and be like well um this was a promising idea at the time but actually it's less important than we thought you know it's uh there's some other approach that works better or >> you know um >> there's some other kind of uh some something that we discovered but I I think many of that >> much of that work is also very fruitful. So what we realized is like uh even sometimes when people fail at um proving out a technique uh their writeups are very important important because um they'll kind of a lot it will often be a natural idea and you can kind of save a lot of people from going through the same thing. >> Yeah. No, that's helpful. >> Yeah. >> So I guess when it comes to this positive view on failure, how do you balance that with you know a researcher let's say who takes a lot of bets, consecutive bets and none of them pan out? because I assume at a certain point you'd want a researcher to eventually have contributions that are actually beneficial compared to only taking bets that maybe pan out to being not a good space. >> Just through experience, I I've definitely seen some people fall into this. Um >> but I've also had several cases where, you know, they it's just like bet after bet, it doesn't pan out and just when you're like at the brink of frustration, you have something that's like a mega hit. And um this happened enough. So, it really just depends on kind of are are the ideas themselves sound? Um, they can be ambitious, but they still have to be sound and um there's a certain kind of person who will will just take a lot of those ideas and it's okay because they're somewhat on the riskier frontier, >> but they only have to justify it once in a while for it to make sense. Maybe like a very trading >> like kind of lens on the world, but yeah, it's it's just on on expectation, right? Like they they need to add value. >> Yeah. No, that's that's great. Okay, so we're basically assembled. Now it's the finishing touches. So you can taste your soup and then we just add soy sauce if it's not salty enough. >> And if it's too salty, we can add some water. We can lower this down. >> But let's see our final creation. How is it? >> It's pretty good. >> Good. >> Yeah. >> Healthy. Okay. Mine's good. >> Yeah. >> Mine needs a little bit of water. Could you pass water? >> Absolutely. >> Look great. So how was that? How did that feel? Um, this is a student distillation. Um, you're you're clearly better than I am at this point. >> No, no, no. I feel like you did a very great job, especially with even with the shrimp and flaming. >> Yeah. Smells great. Wow. Okay, that sounds good. I guess just more generally, I'm kind of curious. Are there areas in research or topics that you think are right now overrated and underrated? Like what would you categorize under? Um, well, I think if you still have a pre-training is dead view of the world, um, I think >> I think, uh, pre-training is definitely Yeah. Yeah. Yeah. Not not dead. It's it's, um, it's underrated. Um, >> yeah. And honestly, um, I think products and kind of thinking about end uses and, you know, how you tie all the primitives you build in research to, you know, real agentic use cases in the world. That's also underrated. I think um, you really can't just kind of build everything in a vacuum and not connect things to utility. >> Yeah. No, that's great. Great. Awesome. I think we are ready to taste. So, do we want to give it a go? >> Let's do it. >> We can move this. And I think there should be some plating. Yeah, I do want to take those plates over here. >> Okay, shrimp looks great. But everything else here. Great. And then we could just use these pots. >> So, we have our shrimp. >> Yep. >> You want to try it? Cheers. Cheers. >> Cheers. Let's see. It may be a little sweet. >> That's a little too sweet cuz we found way too much. >> It's good for me, I would say. Cheers. >> That's good. Okay, great. Um Sean, do you want to come and taste our tofu soup? >> Smells so good. By the way, you guys can't smell it, but um >> we'll pretend you're a researcher that's trying to approach >> and I'm Zuck and he's uh you know trying to get you. So, do you want to grab the spoon over here? >> Soup. Soup is really going to sway a decision here. >> Quality of soup. >> Yeah. >> All right. >> Try both. >> All right. What was the artistic uh direction here? >> Um artistic direction? Well, it was mimicry, I think. This great art in miracy. >> Yeah. Just letting them cook. >> Good. >> Wow. Yeah. >> Strong. >> I mean, I think um one thing I definitely is like savory and spice that go together, but then also like the sort of sea seafoodness >> kind of really goes into it. >> Great. >> Okay. It's fine. Mine is mine. >> Am I supposed to pick a winner or what? >> No, no, no. You're supposed to try it and >> No, you do pick a winner. Yes. >> Okay. This is a eval. >> Yes, >> switchbench. >> External evals. >> Okay. I got to say >> I feel like there's too much water in this. I think like the the >> I think it's also a pot cuz this is a big pot. >> Okay. Um I I would say I have to go for this. Uh just like you're our respected guest, but I want to be objective. >> Of course. Of course. Yes. Yes. Um and like Yeah, the density I think really a flavor really. Um >> I I'll do h half the water. Half the water >> probably. >> Okay. Makes solid sense. >> I mean, I think it's very personal, right? Like >> Yeah. I think it's also very personal taste. You know, even when you do a lot of cooking, taste. >> No, no, no, no. >> Okay. I know a couple recipes. Um I follow them to the tea, I can't like if you tell me, oh, cook something slightly different. I I'm completely lost. >> Right. Right. >> Yeah. >> Mhm. Yeah. Yeah. Oh, I I'm not going to lie. I kind of looked up in chat a couple things beforehand like just as prep, but >> no worries. But yeah, it was it was great having you. I feel like you're always leading the field with a lot of research tastes as well and it's great seeing yeah the work. So hopefully it was fun. >> A lot of fun. Yeah.
Jobs for this video
| Stage | Status | Attempts | Last error | Updated |
|---|---|---|---|---|
| summarize | done | 0 | — | 2026-06-25 22:04:30.347160+00:00 |
| transcript | done | 0 | — | 2026-06-25 22:03:58.341511+00:00 |
| metadata | done | 0 | — | 2026-06-25 22:03:36.181038+00:00 |