The Agent Cloud: Databricks’ Bet on the Future of AI — Matei Zaharia and Reynold Xin

One of the fuses we have is actually once you can get the data in the right place, the AI models are becoming pretty good. The generic agents are fairly [music] I mean Ali talked about AGI already here. They have pretty good reasoning capabilities. Actually, I think many of the traditional software will be sort of rewritten with this new paradigm which is just get the data to be there and then let's slap some AGI on top. Magic will come out. >> Yeah. >> Um but without the right data, you can't really do that. >> Before we get into today's episode, I just have a small message for listeners. Thank you. We would not be able to bring you the AI engineering science and entertainment contents that you so clearly want if you didn't choose to also click in and tune into our content. We've been approached by sponsors on an almost daily basis, but fortunately enough of you actually subscribe to us to keep all this sustainable without ads and we want to keep it that way. But I just have one favor to ask all of you. The single most powerful completely free thing you can do is to click that subscribe button. It's the only thing I'll ever ask of you and it means absolutely everything to me and my team that works so hard to bring the In Space to you each and every week. If you do it, I promise you we'll never stop working to make this show even better. Now let's get into it. Matei Zaharia from Databricks. Welcome to In Space. >> Thanks for having us. >> Yeah, thanks so much. >> Uh thanks for taking time out. You you have your Databricks Data AI Summit going on. You were just telling me how the first summit that you guys ran was just 50 people. >> Mhm. >> Yeah, it was a little meet-up at Berkeley, I think. We put together a Yeah. >> We did these tutorials and yeah, just teach people Spark. >> Yeah. You know, obviously now it's like I think I had like a the headline number is like 100,000 people around the world, 30,000 in person. Uh it's a crazy community. Well, I mean I just saw the keynote. >> [laughter] >> Ali is just Did you know that was it obvious that back when that Ali would be like such a great like CEO? Like he's a great presenter. >> What do you think? >> Uh I mean, I think among our group of founders, it was clear that I think he'd be the best at this. And and yeah, it turned out great. And he's I mean, he's ramped up on so many topics going a company. He would just go in and like study at and, you know, become talked all the experts. Like, even if you can't hire the person, you know, learn enough about like finance and sales and whatever it was. Um, and you know, I'd go from there. Yeah. >> I mean, he's obviously very high IQ and have very high EQ, but it wasn't like Ali today is quite different from Ali from like 10 years [laughter] ago. I think he there's a lot of work that he put in to get to this point. >> Yeah. I mean, no. I mean, to me the the most appealing thing about him is that he's funny. And like it You know, it's >> It's true. Yeah. >> It's hard to make jokes about, you know, data about serious topics and security and what have you. >> Oh, yeah. That's for sure. >> Yeah. So, you you guys launched a whole bunch of things. I'll just go to name check briefly the stuff because we were not going to cover everything. Omnigents, your your baby. El Tap, your baby. Your Dream Engine. Uh, we're also going to cover Genie, cover Customer League. You acquired Panther, Open Sharing, and there's Unity AI Gateway. A lot of these I think like are things that you would expect a Databricks to do. It's like part of the the road map. Everyone in your category has has similar things. But I think probably the two of you are leading the two most unique and differentiated initiatives in the landscape. Maybe we'll start with uh with Omnigents and then we'll we'll go into it. >> I do think that a lot of people are exploring this sort of meta harness concept. What led you to it? >> Yeah. There were actually a couple of like converging lines, which is I think is a good sign that you need something new. So, on the one hand, there's all the coding agent in for internally. We have really great dev infra team. They built something called Isaac that's basically like a wrapper on cloud code and and code X and lets you use them either on the web in like sandboxes or just on your dev machine or on your laptop or whatever. And then you know, they were adding all kinds of stuff there and we saw all the the sort of more advanced engineers like were building their own workflows with tons of agents and they were building their own UI's and stuff on top of even on top of that. And then the other one was that like us building agents, we shipped this like data science agent called Genie on the research team which I co-lead basically. We also build a lot of internal ones for various things and then we have all the customer ones and all of them were running into this thing of like, oh I need to switch model and harness and so on you know, every few months. Plus the agent is like completely useless if you can't share sessions with someone and have history and have search and all this like layer on top of it for collaboration. I thought a bit about it from both contexts and at first people thought it was weird and like why are you doing coding agents and custom agents in the same thing but I said it's it's it's basically the same problems and you you just want to build the stuff that lets you deliver the agent maybe control it if you care about security and make it portable across things. And then we prototyped some things as experiments. We said, yeah, actually we can make it work and then we you know, we sort of built it for real. >> I'm wondering if this kind of let's call it architecture >> Yeah. >> maps to anything in your careers in the past. You know, like I always think about how a lot of things actually just tie back to operating systems. >> Mhm. >> [laughter] >> A lot of operating systems tie back to databases so or the other way around. >> So the thing I do think it ties a lot to like network protocols, you know, internet protocol. We also did Yeah, we did stuff with like data sharing also which is probably most viewers probably won't know unless they >> Yeah, open protocol is the term for sharing. Open sharing. >> Open sharing. Yeah, so it's like you have a company, you maintain some kind of table like like let's say like Walmart or something. They have like the you know, inventory and what's been sold in each store. And then you also have suppliers and they would love to produce more things and ship them like exactly the moment you need them. So they would love like real-time access to your table. So instead of like sending emails around or Excel sheets or phone calls, why can't you share like a view of that table in real-time with them? Then they query, they you know, join it with their data and they decide what to send. So it's it's one of these things where you you like that you you might ask like today since we can vibe code anything so fast, why do we even need to design like protocols or APIs or software? Why can't you just vibe code things on demand? But actually for this type of interoperability where multiple parties that are moving at different speeds are building stuff and you still want some layer on top to coordinate, you do want to design it and build it. So it reminds me of that like agents talking to each other and users talking to agents and and tools. >> Do we know of any other comments or alternative viewpoints? >> I think by the way, we had a debate on exactly we said the benefits with Matter a lot and I I think around the time we decided to do this thing, I was telling Matei, "Hey, it just happened to be there's a particular week that I was coding non-stop from the moment I woke up to like the moment I went to bed. I was like looking at my cloud sessions, my Codex sessions and and one of the things particularly annoying was having to keep my laptop open." >> Mhm. >> I was actually driving to a doctor's appointment and I remember because I want to make sure the whole thing continues working. By the way, it's so comforting to hear you say that because I'm like I don't know if I'm a clown and I'm doing this or like Yeah, yeah, like honestly, I I was driving and I was tethering my laptop to my phone. Keeping on the side, whenever I hit a red light, I started looking at what's going on on my laptop. And I just felt that was ridiculous. >> Yeah. >> It felt like we went back to the dark ages of programming. >> I mean, the productivity you gain from all this coding age is amazing, but uh Like, have you heard of cloud? >> [laughter] >> It was crazy to me. >> Was it the thing we were working on was the sandboxes or was this before that? >> It was a sandbox. >> Okay. So, you you were >> And so, I was approaching from very different angle. I wanted to like, "Hey, we're going to have cloud sandboxes that actually doesn't shut down. You can get one very quickly, but not just for running agentic sessions. It's actually also for running development." So, I was actually personally building that that week and through building that, I ran into all these issues. And then, I wrote actually a document for my case that, "Here's my wish list of what the actual environment should do." And I think he actually ended up almost implementing every single one of them. >> Yeah, I remember Reynold saying cuz my first prototype of this had just chats with your agent and he said, "I have to be able to open a a shell like my own shell and like list files and like tail them and stuff." So, I was >> Is this an SSH into a mainframe? >> Yeah, actually it has [laughter] that. >> Telling my logs. >> Yeah. Yeah. >> And also, another thing I think I asked was uh I I I had I still use cursor for the sole purpose of rendering markdown files. >> Uh-huh. Yes. >> give me a way to see my markdown files and render them properly. I don't need a separate tool anymore." Yeah. I think You also built that in. >> we did that. Yeah. Yeah, we had a lot of engineers building, you know, their own vibe coding setup. But then, the other thing they all said is like, "Hey, I built something that's amazing for me, but like no one else on the team can use that cuz I I don't have a server to collaborate." And And this is This is why we tried to set up Omnigen so you can have a server and have the security uh set up in there. So, you know, like log in with Google or whatever and like actually securely share stuff. Would And that's why we've seen a lot of other agents like hit things like people think they prototyped an awesome agent, but you know, it's not allowed to connect to like some really important data or whatever because of the security team. So, yeah. >> Yeah, at this point So, for those watching along on YouTube, we're going to bring up a image of the structure here and we can talk through a little bit of the architecture. I think I just I just want to have people understand cuz like when we're talking about software, it can be very abstract and like like here's actually what we're talking about. You've worked out in open source this entire platform basically and there's a runner component and server component with a sort of uniform API that you've you figured out. Any other sort of element and obviously you can plug in all these persistence layers and and compute layers. This is a whole cloud. This is Agent Cloud. >> Yeah, it's I mean it's got these components to work with it. The you know, a lot of the action happens like on the machine where you deploy your agent to. So, whatever you've got on there, you can run. But yeah, it's I think it's sort of the minimal thing you you want to have hosted like collaborative agents and to have that server. And one of the reasons we open sourced it is anyone building agents, this gives them an app they can start with and customize, which we were seeing in Databricks too like someone would make a nice, you know, agent app and then other teams would ask, "Oh, can I just use yours for my agent?" >> Yeah, I think we had like five or six different Agentic frameworks built by every different team. They do all do more or less the same thing. >> Yeah, you need to basically people want to take something that works in Forkit and you might as well have something open source. Yeah, which which also was another question that which is interesting for a Databricks like what do you choose to open source, what do you choose to make it proprietary? It's It's I mean this goes back to Spark, right? >> Yeah. >> [laughter] >> One So, I mean one of the reasons to open source something is if you think it's a layer that will actually there'll be some network effect. It'll benefit from many people collaborating on it. So, for example, with Spark, I don't know if you if you know what when when Spark came out, we we also focused a lot on letting you have libraries on top. So, like they used to be different distributed computing engines for like machine learning and graph computation. We said they should all be libraries that you can compose, and we made it super easy to add connectors to data sources, too. And then we benefit because, you know, we we don't have the time to write like connectors to like, you know, a thousand like different databases and and file formats. But we can just use the ones people make, and of course they benefit from joining, you know, kind of this this thing. So, that's like one of these as a Another way to think about it is I can imagine, you know, we our thing wasn't open. We had some kind of agent hosting thing, but it's not open, and then there's an open one. If you're Which one's going to win in the long run? So, like here, because there is this benefit from like people writing integrations, it'll be it'll be that. And then there are other things that like you just can't even deliver as open source that are things the company does. Like, for example, how do you make sure your like streaming, you know, jobs or your your lake-based database doesn't like, you know, lose all your data at night? Well, that requires an an operational team that's going to sit there. There's no way it has to be a service. So, like we want to make sure as a company we're really good at those infra services, and then we're as open as as we can in terms of like what you build on top. >> I mean, speaking from a benefits, I think we're already seeing pull request and the ecosystem integration, even though it was only released on Saturday. >> Yeah, Saturday. Yeah, so someone >> Let's see what's going on. Yeah. >> Yeah, you can look at the merge lines. I actually asked some legend this this morning about the >> 400 merge already? >> Yeah, I I think quite I would guess around half are not from my team. But for example, someone added support for running it on Kubernetes. People added many cloud sandboxes. So, this can launch a cloud sandbox and run your agent in there, which is great for sharing, too, cuz it's not like on your laptop and someone's like running sky code on there. Um so, yeah, many startups have put those in and we expect to see more of them. We also have more agent harnesses already, cursor, CLI, and anti-gravity, also. >> Yeah. >> That's all uh beautiful. I I you know, I I I feel like the last time this happens, there was the rise of the modern data stack. I don't know if it was that useful. I'm I'm actually kind of curious in your your postmortem. I I think most people [clears throat] will agree that it is finally dead. Uh but it maybe this arises to a new modern AI stack that like does the same thing. >> [laughter] >> I don't know. >> I mean, I think the modern data stack was a pretty useful thing, probably even up until this day. I I think what maybe for the audience who don't actually understand the history. I think the modern data stack is effectively decomposed into you need a layer to ingest the data in. You need a layer to transform your data. Uh then all of this are run and then you need a layer to maybe visualize your data. And all of this runs on some sort of data warehouse or later on as we're doing data warehouse, also lakehouse. I think that concepts are all very powerful and very useful. It's sort of enabled a lot of workloads. What people eventually run into is kind of a question of unification and consolidation. It's, "Hey, do you really need to chop all of this into different pieces and work with so many different vendors and platforms in order to get like a very simple visualization done." Right? So, I think like over time everybody start realizing that customers are pushing us. We start to realize that, so we start building more and more capabilities and trying to consolidate. And at the end of the day now, customers don't have to worry about having to hook up five different systems in order to produce a chart. But the I think I honestly something like this is probably happening um in how many different frameworks do you want to hook up together in order to produce like do a very simple agent. >> Just to be clear, I would say the core of this is this common API on top of all the harnesses. So the API is basically like you've got an agent session and you can send in a message or like a file. Basically, that's what you can send in and then you get out, you know, these streams as it's streaming text or as it's doing tool calls. And uh or the other thing you can send in is you can like tell it to cancel or turn. So that's the API. Now, the thing we did is we we could get you that on top of like Claude code, running in a terminal, Codex, you know, Phi, OpenAI SDK, all that stuff. We map them all to that same interface. So that is something that you'd have to maintain yourself if you built your own like agent orchestrator. And then whenever Claude changes its API, you got to, you know, tweak your thing or it's going to lose some messages. So that's the thing that's valuable to maintain. Then on top of that, like we built a few apps. I think we we built a pretty cool UI and stuff, but that's um and and we built the security and control piece which which I'm excited about. But it's that common interface. So we don't we it doesn't try to be a stack. And in fact, you could plug in your your own UI on top of this uh server. That That's one of the use cases we care a lot about cuz we want to use this in in our own products. >> Yeah, it should be everywhere. >> Yeah. >> I think one of those things that is is really interesting to me is well, first of all, I'll I'll endeavor to do everything and not call it the modern AI stack because I think [laughter] we have a name. But like yes, like you know, uh so one of the first people that told me about compute uh sandboxing was Nikita from Neon. Cuz a lot of people think about uh Neon as like, well, it's serverless Postgres with like the separation of the compute and storage and uh you know, instant branching and all those things. But actually, every database company is also a compute company. >> Yeah. >> And so he was actually showing to me his whole his sandboxing solution. I don't think he ever launched it. >> So our sandbox solution, the reason we could have built it so quickly was because we realized if you just take the actual lake base architecture and remove the database from it, by the way it was coming from you. Now there are some differences. For example, in the ones that support this particular workload is important to have local persistence. Um because you want your state to persist. Your libraries you don't have to install your library every time. Right? Whereas the Neon architecture, because of the separation of storage from compute, you don't need persistent local disk. So there's some differences. But the at the end of the day, yeah, it's it's uh >> Yeah, so this is when you run like a a coding sandbox. Like if I use that we had the dev infra internally at Databricks, there's like many many like tens of gigabytes of data just for like all the source code and like artifacts and stuff that I built and I want that to come back next time. So but yeah. >> Before the show, we was talking about some statistics that might be surprising at the adoption. It could be internal, could be external, whatever comes to mind. Just to impress people the scale this is happening. >> So we on the analytics side, I think we launch maybe 50 or 60 million virtual machines a day across our three clouds. So we're one of the biggest compute orchestrator out there. So that's for sure for CPU compute. >> Yeah. >> Um the and all of those process, I think exabytes of data. I joked about depending on which time zone you are, typically before you have breakfast, Databricks would have processed processed exabytes of data already on that day. Um and on Neon it's it's actually pretty interesting queue it's launching I think 13 million databases a day now. >> Yeah, that to me that was like a big >> And that's just like >> What do you what do you mean? >> [laughter] >> And then a lot of those were thanks to agent agents and branching experimentation. >> Yeah. >> Because we made it so easy and so quickly and thanks a lot to Nikita's team to launch databases. It's uh that So it's changing the way people use databases. >> Yeah. Okay, we're going to go into more database talk in a bit, but I want to make sure we close up anything on Omnigens. Uh you mentioned uh you're excited about the security and control side. Uh a lot of companies are figuring that out right now, as well as the spend side. Um what have you found there? >> Yeah, so I spent quite a bit of time talking to internal users, uh developers, security team, you know, uh managers, and also lots of customers. And there's a few things. Like first of all, one thing, you know, that immediately was became obvious as for security, uh you know, there's this tension between like usability and security. And um the way people do like a lot of coding agents today have very basic things like you can tell me which tool patterns are allowed or disallowed or whatever. It's like yes or no. But that puts you in a very tough spot. So, just as an example, like should my agent be able to read, you know, some confidential documents? Or or let's say should it be able to install new packages from NPM, which you know, maybe it's a it's compromised. Yes or no. Like maybe maybe I want to allow it. Should my agent be able to publish stuff to the company website? Well, if I'm using at the code on the website, yes. But should it be able to do both? So, it can like grab a confidential document and be prompt injected and leak it? Probably not. So, the thing we decided we need is stateful or what we call contextual policies, where you keep track of the state of that session. It's not like is it allowed to push to the marketing site or not? But like hey, if it did a risky thing, like it installed, you know, a one-day-old package from NPM, or it it read like a thousand confidential docs, then no. Then don't don't do it. Otherwise, maybe it's okay. That's one example of like moving that trade-off. So, it's both more secure and more useful by having a more powerful engine essentially. This requires tracking sessions. The other piece that was interesting there is like there are these very low-level events it's doing and you want some libraries on top that parse them. Like for example, we have a MCP server on Google Drive internally. It's got 60 API calls. Like how do I know which of those like will share a document with stuff on the internet and which ones won't? It's it's annoying. So we designed an Omnigen the policy layer so that it's it's functions and you can have libraries. Like someone can make something that maps the low-level events to high-level ones and then you write a policy about the high-level things that came out. So and that was related to the Panther and Yeah, Panther is will help with that. Panther is kind of a similar idea on the event processing side and it's Python based versus a weird custom language. This is sort of more as in real time. >> [laughter] >> Those things are happening. Yeah. So yeah, but but these are the cool things. I think the contextual or stateful part and then the the way it can be libraries and that was another reason to make it open source because others will write libraries and like we and our customers can use them. And the final thing because it's stateful, one of the states we track is how much you spent in that session. So I can I've had like I I asked an agent to debug something and it spent $500 because it decided to read a lot of log files and burn a lot of tokens. But I can literally say, "Okay, launch a sub agent to do this and cap it to spending $5. Like ask me for permission if it needs more." And because we're counting that within that session, it'll pop up and tell me, "Okay, you spent five $5. Do you want to go on?" >> So I'm more than context here. Matei spent the last five years, a lot of his time was architecting Unity Catalog at Databricks, which is the governance layer for data. >> That's right. Yeah. Yeah. >> And it's sort of combining expertise at that layer together with all the AI governance here. >> Yes. Yeah, but I also spent a lot of time being annoyed by coding agents and and getting bombs. And also as the CTO, I don't want to end up on the front page as like I installed some weird NPM package and leaked all the code. So, I'm especially paranoid, but also I have very little time. So, I don't want to sit there approving like, "Do you want to run a 20-line, you know, bash script? Yes or no?" So, that's why I spend a lot of time figuring out like, how can I make it as safe as possible and not annoying. >> Yeah. Is safety and let's call it security a bigger concern than token maxing or token budgets, you know, which one is like >> Oh, yeah, they're both there. I mean, I don't know. I I guess it depends on the type of company you are. So, I think some companies like the budget is is limited and you know, they they really care >> about that. Or I mean, you can be Uber and still be concerned, you know. >> Yeah, oh yeah, totally. Yeah, [laughter] yeah, yeah. If you have Yeah. Yeah, for for us security is is absolutely critical as a as a, you know, cloud provider. It's it's the most important thing. And token maxing, you know, we we we're not so worried about it yet, but but I've seen the like for example, I talked to some consulting companies. They have like 100,000 employees who are all coding for customers. If those each spend like an extra $1,000 a month, that's that's not fun. You know, we have like only a few thousand engineers. >> What's the policy in Databricks? Is it is it just unlimited or >> unlimited, but we do you know, we use our own product to like analyze the traces and stuff and we have a team that's you know, looking to optimize and and to see if anyone's doing something weird. And we actually had some really cool insights just from analyzing current traces like which models are better at say Rust versus like, you know, TypeScript or whatever. So, yeah, at least in our code base. >> Yeah, amazing. Obviously, I have to ask the token maxing question. Obviously, I think it's a it's a key thing, but but yes, security and control above that and figuring out a sane layer that you can have some autonomy, but not too much. >> Yeah. [laughter] Yeah, and we want to make it super easy. As an engineer, you should set the thing. So, in Omnigen, you can ask your agent set up a policy on yourself to do this. So, it >> If there's anything I should be showing, I I I don't I don't see it on the GitHub, but you know, this is >> put in the docs there. So, you can look at it later. Just look in the docs on contextual policies if you want to see. Um >> I just like to show people >> policies. >> Yeah, if you want to, you know, follow up on this, this is exactly where to look, right? Yeah. >> Yeah. Yeah. Uh yeah, and the story of these is like I just wrote uh you know, like I wrote a doc with like 10 ideas for things before as you were working on them. Well, it did That was like my wish list of things people asked, and I told the team like, "Hey, can you do like at least five of these for the launch?" And then they just got back with all of them. So, >> Oh, wow. >> Um so, you can come up with with more, but then some of them are just meant to be examples. Really, you can intercept like any event the agent is making, and you can then either block or force it to ask the user or like allow, and you can update state to keep track of stuff. >> Yeah, cuz you know, ultimately, I think I think of you as like a systems designer, you let people plug in, right? That's the whole modus operandi of what you do. >> Yeah. Yeah, and we and we care a lot about also composability. Like, can someone else write a library that others use, which this is meant to >> There's also a batteries included philosophy here, probably very similar to how you did Spark, which is you could just start using. Yeah. >> Yeah, that's right. It has to be good out of the box at certain things, and then you can build your own things on top you know, we don't want to do but you know, in Spark, if you just want to like, I don't know, like read a table or do like aggregation, it should be awesome out of that out of the box. >> Yeah. People want to catch up on Omnigen, they should watch your keynote, uh they should go through the GitHub in the docs. If they wanted to contribute or they want to build on this ecosystem, where would you call out as the most high-leverage places to get involved? >> Yeah, do get involved in the Discord and then get hub our team is is there is monitoring and some of the things people ask for we just built ourselves. Some of them, you know, we're we're collaborating with with them to build that and also tell us like how you would like to use that cuz I think especially for developers like everyone wants it to work their own way and a really good developer tool like you have to hear the feedback on all the ways and figure out the abstractions and how to let people customize. So we love to hear like if you think hey I you know I don't want it to work this way uh tell us. We really just want to get that compatibility layer across agents and then let you do stuff on top. >> Yeah. Is there any you know in terms of like the startup side I'm a founder I want to I see an opportunity I want to get in front of you. What's your request for like a startup that like you know I wish someone someone was working on this. >> Oh for a startup. >> Yeah. Like you know you got your own startup it's doing well but like you know if you weren't working on your own startup what what is like obvious that you should >> [laughter] >> You advise many startups too obviously. >> I mean I do think just as a company with a lot of engineers like anything that helps me make sense of how people are using coding agents and and spend but also quality or like you should write you know you should add this skill or you should write this thing or your agents are really horrible at tasks involving this service or like go spend time. That would be nice. Yeah. >> Yeah the closest I found is this team get AI. >> Mhm. Oh cool yeah. >> They they started with like we will just do code and human attribution but they're basically building the analytics layer on top of that. I do think like there are a bunch of like artificial analysis is obviously doing super well with with their stuff. So there's there will be people I think I think this is like the domain of consultants first but then people will actually build software that is like they have like the management plane for coding agents. >> Yeah I think there'll be a lot of insights there. You have it in other areas. >> Okay, well, and then the other big thing is your dream engine. >> [laughter] >> If you want to tell the the story of of you know, OLTP. >> And our background was I'm going to make people listen to our Ankur Goyal episode where we talked about single store, HTAP, and all that history. >> Yeah, yeah. The OLTP idea is actually pretty simple. So, people have heard of the Ankur's talk about HTAP, it's effectively the world of databases. Sorry, there's like maybe a lot of context that needs to be injected here. The world of databases >> to be the database podcast that I'm forcing people to like learn your databases, guys. You cannot vibe code with just markdown files. >> It's one of the most important fundamentals of [laughter] systems technologies out there. But, the world of databases is effectively split into roughly two halves. There's what we call OLTP databases, which are transactional, and think of your Postgres, your MySQL, your Oracle databases. And the other side is what we call analytics, and sometime might refer to the term OLAP. And the differences is on OLTP, you typically have maybe run some transactions on event that looks up at one specific row, we update that row, right? It's a very row-oriented data structure. And on analytics, you're trying to reason on the data, you're trying to compute, "Hey, what's my revenue per store? What's my How's my website doing every day?" And then you eventually want to probably end up running machine learning on it to predict, "Hey, how will my maybe sales be going in the future?" They are so very different architecture, and everybody start with OLTP databases cuz every app, when it becomes serious enough, that needs more than markdown files, you need to have a database. You don't want to lose your data. You want to have some transactional [snorts] consistency. But, once you want to reason on the data, if you only have like 100 rows, it's probably okay to run it on your Postgres or your on your MySQL database. But, once you have more data and want to run more complicated analysis, the very analysis might crush your save post crash database. So you start doing getting data out of the >> Replication replicate them into the analytic systems. >> Yeah, which is for people elastic search is like a big >> Yeah, so some of them actually get into elastic search for like log analysis. A lot of our customers obviously get into data bricks to run more sophisticated things. And there's this term called CDC. >> CDC change data capture. >> Um and what it does it reads the bin log of the database. And if you don't understand what bin log is fine. The but it's a little delta of the data and then reconstruct based on the delta the state of the database um on the analytic side. But CDC is like a very painful thing. It it's how basically standard in the industry. Everybody uses it. But um it ends up being so if I think many data engineers ends up being woken up at like 3:00 a.m. um because of some pipeline thing. >> You know, my my explanation is like everybody is like a you know, became a $5 billion company just doing CDC. >> Yeah, exactly. CDC is like a very it's one of the most boring but one of the most fundamental operations like powering modern society. But it's so brittle that uh we joke that it's should be called continuous data corruption because you might change your schema on your OLTP database and then the CDC pipeline fails to handle the schema change. And then everything goes out. >> I mean there's all sorts of tricks that you can do like you add in like some versioning or whatever. But >> Yeah, but it's a very in general very complicated. Like I think at my keynote I asked the audience put up their hand if they love their CDC pipeline. Only like maybe two people put it up. So if single store like about maybe a decade ago I think the industry had this idea, "Hey, what if I built a single database that can handle both workloads?" >> Which like by the way every database person ever has ever always dreamed about this. >> Yes, yes, this is the the holy grail of database engineering. Why not build a single system that can do both of this? But it ends up just being a lot of compromises. Um one, I think one of the first issue is that hey, each they say Postgres has a massive ecosystem. Right? You want to be using the tools that's built for Postgres. And Spark, for example, had a massive ecosystem. There's a lot of libraries you want to use. If you were to create now a new thing, you don't have an ecosystem. You tend to create a new smaller proprietary API and you're lacking both. And it's also very difficult to make it performance-wise to be a sort of comparable on either side. So it ends up being actually sucking up both. And our whole idea of LTab is kind of obviously a word play on the term HTab, is that we think this is HTab done right. HTab wants to build a single engine for both. We think you can get 99% of what you need by unifying the storage. And just have a single storage layer. And once you have the single storage layer, if your Postgres databases are writing data in a column-oriented format, everything analytics can just go read that data directly without any delay. Right? There's no pipeline in between, so all the data would immediately be available for reasoning analytics. I think I was telling some customers earlier, hey, when we talked about this is going to be super useful for agents. I actually at first didn't really believe in it myself, even though we wrote that positioning. But then last night I was having dinner with Australian customer and they actually told me, oh hey, one of the big issue we have is we have all these logs from our services and we see SLA dips and want to investigate. But then there's no way for those agents to even understand what's going on in the actual databases themselves. All we see is just like product telemetry of the database and the services. You would actually make those agent 10 times more powerful if you understand, for example, who's actually placing those orders. Um what is happening? What exactly are they doing? So now I'm actually sold on our own message. >> Yeah. >> Um I think it's really kind of it gets you basically the almost all of the benefits of the HTAP holy grail, which is hey, make the data available immediate for reasoning analytics. >> Yeah, I think you know in the way that humans are generally intelligent and want to have the ability and access to to query anything. Even while they do the work, they also need history and they need context and like where else do they get context? That's an analytical workload. >> Exactly. >> [laughter] >> Yeah, and I remember when we had incidents with with our databases and the engineer said, well, I can't just run a giant query on it to see what's going on because that's going to bring down the database and it hurt it even more. Like that's the kind of stuff that this gets rid of because you spin up a whole separate fleet of machines that's doing the analytics. You're not overloading like the main database that's still trying to serve stuff. >> Yeah. >> So, this has been a dream for a while. What had to get done in order to get to today? Like, you know, yeah. I feel like you have announced variants of this several times. But it wasn't as clear as L-TAP. >> Yeah. >> I think L-TAP is like a like, okay, we've got it guys. >> I was talking to somebody at Meta and then he was asking me, hey, what's the catch? Why is it possible now? And I think the reality is we took a lot of time to actually worked on the lake base architecture. I mean, obviously a lot of it came from the Neon team, which is hey, separation of storage from compute. And it turned out it was just a tiny little step away. Going from that to this L-TAP idea, which is hey, we just in the Neon architecture and in the lake base architecture, we're writing data in row oriented format to the open data lake. But in there, we're writing in Postgres pages. Actually, Ali and I were spending a lot of time debating, hey, can we actually just change that to write in column oriented format? And we're just debating and then one day, one of our engineers, who's actually super smart, came in and said, hey, I just prototyped it, it works. >> Wait, prototype what? >> Prototype instead of storing the data in the data lake in the row-oriented format. >> Like Postgres payments, write them in parquet. >> Yeah. >> Um and he just make the observation that hey, our storage fleet is has a lot of extra idle CPUs. And we could use those CPUs to do the transcoding from row to column, where row is good for OLTP, but column's good for analytics. Um so let's do the transcoding at that time. And as a matter of fact, once you transcode the data, the data compresses better. So from those services writing to, for example, S3 or other data lake like object stores, you can actually write them faster cuz now they are now smaller. >> Yeah. >> So there's no overhead, it's no compromise in performance. >> overhead. >> Yeah, but because but we had extra CPUs anyway. >> fleet anyway, yeah. >> So the the debate ended. I mean, it's one of the classics of the tech issue of a lot of debate, but then somebody actually went ahead and just tried to prototype it and it worked. >> But like something this strategic and important to the company, I expect there to be like a kickoff thing, like a design doc, nothing like that. >> Nothing like that. He He just We We were debating [laughter] in many many meetings, and then we're just debating whether it's possible or not from first principle, and then somebody just did it. >> Yeah, I mean, if you set yourself up so people do that, that would be great. And that happened a bit with Omni John too. I think if if I just had a doc and like we can make these together, everyone would you know, would think oh, what about this? What about this? But then you if you try it out, it helps. And then if you have real users and they bash it and like it's still working or in this case, if you have the workload, you you know what the workload looks like, you can just test the same pattern and yeah. Tech aside, which is very cool, this is like the most important thing, the culture of innovation. And you don't have to ask my permission, you don't have to like do a whole form formal process, just do it, you know. >> Well, especially these days, I think with AI, it's actually easier to do that. >> I think you are very real I mean, I've met a lot of C-suite of like large companies and like I think that at scale is things slow down and I'm sure you felt it already but somehow you have this core of people that like are exempt. >> How? >> I think we hire and we work with really really good people and that's a very important part of it and empowering them but also spending a lot of time maybe us in the trenches matter a lot also. >> Yeah, I think I mean I think first people could kind of adapt to being in the larger company so that helps and we we want to make sure they know that they can try stuff and and settle debates and and have a lot of examples of how it was done before or launch a thing in beta or whatever and then the other thing I do think as a company like despite the size we don't launch that many like products we try to keep it pretty coherent that's that was actually the whole like sort of theory of the company was like instead of having like 20 Amazon services you need to set up like a analytics and machine learning stack you just have one and it's like the same API, the same semantics across all of them, the same copy of the data so that requires like unification and then we basically added one more thing at a time like we added storage with Delta Lake we didn't used to do any storage then we added sequel you know we added machine learning platform stuff so but yeah don't don't do too many but do those things well and and that also helps you know helps keep it manageable. >> Yeah. >> The other thing we kind of encourage a lot is instead of building to boil the ocean for everything let's figure out how do we do it incrementally how how do we do it very quickly like many of our products they're built in the span of weeks and then we go to hey like usually my first question to whoever team is building is who's the target customer who are you working with are you on a first name basis with them are you texting with them? I think having that very tight loop. >> Can you bring up another launch that comes to mind with in this kind of thing I just to I just want to give you some background on that way. >> The uh >> Who's the customer? >> Yeah, I'm even was more of an internal thing actually because we would use that for our developer like yeah, basically the whole like AI team got access to it and was using it and we made sure it works from the beginning with our internal code base which is a mono repo that's like enormous. We gave them like some infrastructure. We gave them lots of like token capacity. So it's all the developers. Yeah. We had others. I don't This is I think a public story but >> I was going to ask marketplace open sharing all of them had I I just don't remember exactly which ones publicly referenced it. >> Yeah, they had other Well, very very early in the company there was like Delta Lake which is the transactional storage layer we did. We had our largest customer at the time said like okay, I need some I want something in the cloud cuz you know, I I if the rest of our network is compromised like this thing needs to be separate to store and query the events and then talked to us. He said okay, this is the rate of events per second. This is like the freshness I want. Can you do it? So that was like way larger than any workload we had and we had our engineer working on that my Michael Armbrust and he worked just to make this work and once it worked for them, you know, it worked for everyone else. Yeah, this was early in the company probably like four years in or so. >> 20 2018? >> Yeah, 17 18. >> [laughter] >> Yeah, clean room which is basically how you share data in a way without sharing underlying data but you allow specific operations. Those were done effectively initially just for two customers. I think the industry has a sense of hey, maybe if you overfit to like one or two customers going to be really bad for you but I think the downside of overfitting is much smaller than me the upside itself. And if you sort of try to be too ambitious and boil the ocean it's a much bigger problem. >> Yeah. >> Cuz you might end up actually having no customer. >> Yeah, that's more that's the more likely outcome. Then then you can sort of pivot from there. I do think there is such a thing as a bad customer that sometimes you should fire. >> They could exist sometimes if you drive Well, one of the challenge I think we we probably see and maybe many AI so newer generation companies are seeing is so tech companies are very very different from non-tech companies or traditional enterprises. And if you optimize everything just for tech companies, you might have very challenges scaling them outside of tech companies. >> Okay, what are like what are like top three differences that you always think about? >> Governance is a big >> I think yeah, a big one is like yeah, security, you know, data privacy, governance, all that stuff. So usually if you're building some kind of like B2B or developer tool, like your biggest market is going to be enterprises, but it's just very different a company that's existed for like, you know, it's had some form of IT for like 30 years. They have so many legacy systems or they operate in a regulated space. Whereas a startup or even like you know, like sort more recent tech company, all the everything is new and sort of pristine. So yeah, it's just different and if you've never worked with enterprises or been in one, you you just won't know about it. >> And the procurement process is probably quite different. There's actually far more stakeholders. >> is one. Yeah. Another piece that's interesting is I think some tech companies you know, people will say oh I can build that myself, right? I I I I I I'll just build that myself. So so then you go >> I don't think people say that about Databricks. >> Uh >> [laughter] >> Yeah, they do. They do. They do. They do. They do. >> Yeah, I mean yeah, and it depends on the the teams and things. But on the other hand like many of the enterprises say actually I don't. I never want to be in the business of building that. Like I don't want my you know, whatever I'm a retailer or something, I never want to be down because like some weird like nerd like couldn't get streaming pipelines working. That's is what I'm >> Yeah, this makes them great customers to be honest, right? >> But you have to understand that it's hard without having worked there and stuff. Like you may not appreciate. >> Look, I think they're all great. Don't get me wrong. They have different challenges. But the many of the tech companies, for sure there's a lot far more DIY. >> On the flip side, you have people who are they're very much experts in their domain. Like they're building airplanes. They're, you know, designing medicines, whatever. And they just want a bridge to the knowledge where like they don't want to learn, you know, databases or whatever as cool as we think it is even as interesting as the average software engineer might think it is to read a little bit. Like they just never want to know. They just say I have you know, giant like, you know, matrix or whatever with my clinical data. Like how do I, you know, how do I like cluster it or whatever. So, yeah. >> Yeah. Yeah, that's true. Okay, so and then I wanted to actually build out the the sort of dream engine vision. >> [laughter] >> Where does this all lead? So one of the thing we realized maybe a couple of years back is that actually every single database engine out there, especially on the analytic side, are kind of a decade old. Um pretty much everything that have reasonable traction are about a decade old. And they all started targeting some very specific narrow use cases. And then over time it's become more and more successful. They've grown in their ambition. And then they try to support more and more use cases. But the fastest way to support those use cases tend to be hack around the that were initially created. They were not for those use cases. And then but you can kind of support them more or less okay. And before you know it, after 10 years of organic evolution that way, it becomes a gigantic pile of Um the and but then that includes Databricks. And very, very few company or very few systems I think have the gut to say let's go start from scratch. Let's go back to the drawing board and design knowing everything we know today after a decade of workflows and probably billions in revenue, let's attempt to rewrite it from scratch and actually make sure it will work and it can support all these use cases. So, we started doing that. But, it's a very ambitious project. Uh by the way, you can search on Wikipedia this thing called second system syndrome. >> Yeah, I know that. >> Or second system effect. >> Every developer must know what a second system >> It's basically you build your first thing and it works out great and the second one's bound to fail because too ambitious. And then you ask someone >> you know, you think you know everything and then you're like I'm going to design the perfect system this time. >> Yeah. And it turned out it's not perfect and then it start failing and you're too ambitious never launch um and you get killed. The um the engineering team that actually started this, they were brilliant. I think we hired some of the best database engineers um on the planet into Databricks and they were brilliant. Thank god it's not their second system. Many of them have built more than two in the past. >> Nice. >> But, they were still worried about this. "Hey, building a database engine from scratch, I think the conventional wisdom is going to take like 5 years to mature. This will be a very long-term project. It could fail." Um I think one of the engineers kind of joking and say, "Hey, maybe we just call it Project Stream Engine." If we name after co-founder, maybe we don't get cancelled or killed. But, I I think they built something pretty remarkable. Um they went back to they they kind of changed the way the database engines were built from a paradigm point of view. Usually when you build a database engine, you read a lot of academic papers, you try to understand what the latest algorithms and data structures and you put them together and see if they work or not. And there's a high risk of failure there also because whatever that looks really good on paper might work out might might actually look really good in 70% of the workloads, but then it backfires on the other 30%. Um they actually went built a more of a factory for building the databases. So, they spent more time building this factory and the factory takes the decade of traces we have. I think they count as a quadrillion data points in the trace table. >> You don't drop anything? Or you see sample? >> We for sure sample, but the there's like massive amount of things. And the and they use that to build a model. Um like a machine learning model. Not an LM, machine learning model. Machine learning model basically it can very very quickly tell us how any algorithm and how any implementation will perform for any specific type of queries with very very high fidelity. And based on that they can uh pick the most likely algorithm and data structure that will actually help with the different kinds of workloads. >> Mhm. >> Both at runtime as well as at implementation time. >> Mhm. >> Because there's like unlimited number of >> I mean it sounds like you want to like route to different data structures. >> Yeah, I mean if you think about it a single database has many things implemented together, but you want to make sure they all work well with each other and then for any given operation there might be more than one implementation. So we make it actually one really really Reality is things algorithm that works super well for example for very very low latency might not work very well for say scanning through petabytes of data. Yeah. Right. Actually most often there's a trade-off there between throughput and latency. >> What are the key dimensions like scale, throughput, latency, what what what anything else? >> and and the distribution of data. >> Yeah. >> Right. How sparse the data is. That matters very a lot. Um how frequently do you hit the same data? >> Yeah. How many distinct values and stuff like that? >> Those things matter a lot. Like number of distinct value basically impacts the memory consumption of your aggregation. Your hash like at some point there's a hash table. >> That somebody I'm going to in my write-up I'm going to try to list all these out because I really want a taxonomy. To me taxonomies are so helpful because it covers everything they should think about. >> I think if you actually try to list it out probably like a million different features. >> [laughter] >> I always want like okay give me like 12. >> [laughter] >> Uh you know >> you know like a someone did like I think a Oracle paper in like 40 years ago did like the This is the eight fallacies of distributed systems. That kind of thing is super useful. >> Yeah, that's it. >> It's like, okay, think through these eight. >> But let me give you a very weird example, but it actually has a profound implication on performance, which is like it's your string is ASCII or does it have Unicode in it? How should I encode this? >> I mean strings are the most complex data types. >> [laughter] >> So the that like for example if strings are super dense, you could actually convert every string into a like imagine I have to do a aggregation instead of having a hash table, you could actually have an array because if your string is dense enough, if you only have 256 options, you don't need a hash table. You can just do array >> Yeah. >> lookup. >> Yeah. >> Um and that that'll be far faster. >> like a country code or something. >> Yeah. >> Yeah. >> So it's actually like probably millions of uh features in that model. But using that, they can one basically prioritize the different algorithms that might actually impact in practice. And many of them are very counterintuitive. It isn't actually things that you think it might work super well actually don't work that well in practice. But also more importantly at runtime, you can dispatch the right algorithm and structure. >> I'm listening to to the the dream. I feel like Databricks is is doing a really good job with the incremental evolution. Do you have to hard cut to a new system at any point or like >> We designed it in a way that it can be incremental. So first we're releasing a new endpoint. Uh but but this goes to the broader ocean versus What we wanted to do is one to the by design, this new engine should be able to do everything we're able to do before and better. Right? It's been particular the better part refers to very low latency low latency workloads that can finish in tens of milliseconds. But we want to roll it out incrementally with incremental capabilities so it doesn't take like 5 years to actually see the light at the end of the tunnel. >> I think that's a heroic task. I don't know what what what other way to say it. I am really interested in any any sort of new workload and new databases. I mean, obviously, I think if I've maybe established that I'm a little bit of a database nerd. The transactional databases, sorry, the accounting databases, like the the TigerBeetles. I don't know if you've seen those. >> What What do they do? >> Dual entry accounting database. Like it's just meant to really model like financial accounts and credit systems and >> It's like a very specific >> very high throughput, yeah. Yeah. No, it's so when you were talking about how everyone like starts with a thing and then they they scale up and then they tack on other things. It's exactly that. And then I use recently interviewed Simon from TurboFifo, same thing. Like and Chroma as well. Like they all the vector database companies of 2023 all are suddenly now just we're just generally general storage of blob storage. >> Especially a database should have never been a separate category. >> [laughter] >> I I think that used to be a hot take. Now now it's like the the conventional wisdom nowadays. What What should be a separate category? You know, if everything becomes ELT, like what's >> [laughter] >> I think the thesis of ELT is we're not collapsing the databases at the actual query layer. We're just collapsing the storage layer. Yeah. And that's a I think a very important part. And we actually don't think it makes sense to collapse the query layer into a single like HTAP style database. And part of it By the way, the other thing I think a lot of people had is hey, it would be nice if there was only one query language I have to worry about instead of worrying about Postgres SQL and maybe Spark SQL, why not just one? But I don't think that's an issue for agents. Agents are very eloquent in Postgres SQL or Spark SQL. It's never going to get confused. As long as the data is there and it's accessible, agents will do fine. That that might have been So, >> Yeah, and the >> 5 years ago might have been a problem for humans. >> That could arise over time also, but it should and this is leads to how to do things incrementally, right? Like you we realize you don't need it right now. We don't need to solve that problem to have a lot of value from from the current Delta app. >> Yeah, okay. I'm going to end the pod with a little bit of more of sort of spicier things. >> [laughter] >> Everyone has like had the receive within a separation of storage and compute and try to build you know the the clouds. I had the same pitches from Snowflake. How have you succeeded where they failed? >> [laughter] >> Rough. >> Well, I mean respect that they are a competitor. Objectively, you have outpaced them. >> What is the the core insight from your point of view that you you guys just went different directions? >> Probably [snorts] the biggest fundamental difference. Both companies started around the same time. Both went to the cloud. Both focused on storage from compute architecture. But the biggest difference one is open. Like Databricks had never had a proprietary format, right? We started with the open sort of ecosystem. Started with Parquet and then evolved into Delta and Iceberg and all that. It's like one big thing. I think that matters a lot. The other one is AI. >> Um I mean before 2022 October 2022 when ChatGPT came out, we had always pitched Databricks as a machine learning plus data. And a lot of the platform were built with machine learning use cases in mind. And obviously AI is a little bit different. And Matei is like spent far more time there than I do. But the whole platform was we never felt hey we're just a data infrastructure platform. >> Like Databricks only yeah. >> I think that they started with a like they thought okay, we'll just manage the most valuable data and try to make it really fast. For that, we'll have our own storage, you know, which is optimized with the engine. And then we'll just target like the small amount of data that like the managers and whatever, you know, finance people and so on look at and make that super fast to serve. And you know, it was a different space. Whereas we started with like we'll do the bulk processing and ingest. Like you got a bunch of, you know, JSON log files, you got whatever, we do that very large-scale stuff cuz that's what Spark was was for, the large-scale batch processing like stuff. And then, we'll keep the the the data in an open format. Might be slower, but like it's already out there, you can consume it downstream. And it turned out that, you know, it's easier to to go from that batch thing that's really good at the scale and ingesting and super low cost and create versions in it that have the speed and features of the, you know, super easy-to-use like smaller data for business users thing. >> And then there's then optimize. >> Yeah, start open and start large. Like in some sense, we started upstream of them. And there was a time actually when we both like sort of listed each other as partners because you said if you use both solutions together, use Databricks for like your ingest and compute and then serve the tables out of Snowflake, you get all the visualization, all the way fast stuff. Like that's great. And then, you know, we both realized like customers were telling us like, "Why do I need this other thing? Why can't I just query your tables?" And we said, "No, we're horrible at that. Like please use our partner for the SQL warehouse stuff." And then they realized like, "Wait a minute, so much of the compute is moving upstream into this this other thing." >> You have to go into each other's territory, yeah. >> But I think we did start with like the bigger scope and with the open thing. And And that's important actually. Like as a kind of ghost enterprises, like if your company's existed for like 30 years, you've experienced, you know, being locked into Oracle and like all kinds of like crazy things. And if you're the CTO there and you're setting up the architecture for the future for your company, you're going to want to pick a foundation that's open. And you only want like one way to manage data in your company ideally. You don't want like seven different systems. >> But I think the data format have won. Like, I think now every enterprise wants to put data in open data format. But, uh it was actually very controversial like back then. I think five, six When exactly as one of the Snowflake co-founders actually wrote a blog called Choosing Open Wisely, which basically argued against >> [laughter] >> Yeah, yeah. >> I think they might have taken it down. You have to find the archive now. >> Oh, I mean, it's it's it's never going away now. Uh no, no, it's still there. I love the the sort of perspective of that only you guys will have because obviously you you you run the company. Uh and I I Thank you for indulging this. It's a incredible uh perspective. >> Maybe one last one. Um as you were talking, I think I have to give Ali a lot of credit. >> Mhm. >> He's an incredible CEO. I think he's the perfect combination IQ, EQ, technology obsession, execution, business acumen. >> Mhm. >> Um and uh and he's also a founder, which makes a lot may him a lot easier for him >> Yeah. >> to mobilize and execute. Um I think that's uh >> Oh, that was that was it. So, did uh you have Ali and he they don't like okay. >> Well, there's >> [laughter] >> a whole lot of other things, but I think Ali played a pretty big role in the >> I was I was I was I was I thought he was there was like going to be some technical uh choice that he he contributed to. >> I he he pushed for a lot of these. Like, there were sort of forks in the road where he pushed for like one way and then it it became clear that like that was the right way. Uh yeah. >> Uh I mean, there's a whole book that needs to be written about how like the eight of you like, you know, work together and all that. I think there's been profiles that people have done. Second one, not a clear questioning again. Uh Mosaic. Mosaic. A lot of people in our community are in are curious on like what's the sort of the model story of Databricks, right? Like, when you guys bought Mosaic, like the the thing was like, "Okay, well, we can do fine-tuning. We're going to do in-house model." cuz they had uh the Mosaic models. And it seems like you're you're not doing that. And it seems like you're going towards more of the Alt App and and uh and the Harness stuff. What's the story there? You know, just >> Yeah, I guess what when Mosaic started I think it was it was well known or became most well known for releasing open source LLMs early on and and they were general models. Actually before that they were doing other things. They were about optimizing uh training systems basically. So they had the fastest like image model training stack in in the world and stuff like that. And then they decided to do LLMs which which was smart. They they moved into it before ChatGPT. So they had some of the first open source LLMs. >> Yeah. Um we interviewed John Franco and Abby for MPC-7B. >> Yeah, exactly. Yeah. Oh yeah, very cool. Yeah. Yeah. So we uh decided, you know, even though we we did launch a open source model DBRX and you know, we we went up to like sort of above the Llama 3 scale, we decided that we really want to focus on there'll be so many people releasing models and instead of doing the general model where like, you know, a big part of the recipe is just throwing a lot of compute and just scale, uh we want to focus on like the next step also of let's say you have the very smart model. How do you make it, you know, useful? For us it was a lot about automating like how like like making it very good at querying data. That's the the first party agents we have called Genie. Uh so it's like a virtual data scientist. Imagine, you know, there's someone who already knows all the stuff in your company inside out and knows all the machine learning libraries, all the data libraries, you know, all the stuff on the web and you can ask them questions, you know. That's um that's what we wanted to do first. So that meant like let's not focus as much on like let's just train some kind of frontier model, but let's build a system using either external models or or um fine-tuned uh customized components. Um we're still doing quite a bit of model training though and in fact we're always, you know, we're procuring like lots of GPUs and stuff all the time to do it. Um and there's a few places where we're doing that. One is uh, there are many high volume use cases where if you have a specialized model, it's just so much better than any of the of the general models you get. Uh, nice example of that is understanding like documents, like PDF, Word documents, stuff like that, parsing them. If you've ever tried to do that, it's frustrating cuz you send it to like, you know, like Claude Fable or whatever, it like almost gets it, but it gets some things wrong. And it's super expensive. You just burned a huge amount of tokens plopping in an image into there. So, our team uh, built this uh, document uh, sort of vision model that takes a page and gives you back a nice JSON with all the components. And it's very competitive. It's like probably like 100x cheaper than those uh, frontier models and still better. And that's actually done by one of the researchers who came from DeepMind, was a co-founder of Adapt, like very early LLM scaling person, but um, but focused on on this. Um, likewise, we have um, uh, we're doing specialized sub-agents for part of what the coding agent does. And if you've seen the stuff on advisor models uh, from Harvey, um, also from >> And Anthropic, Commission also, yeah. >> And UC Berkeley, actually, one of my grad students there uh, wrote a paper called advisor models. I think before those came out. I mean, I'm sure others had the idea at the same time, but that's uh, something that helps a ton. So, yeah, we actually showed some some stuff just today at the keynote on uh, >> Is it Parf? Oh, you know Parf? >> Parf, yeah, yeah, Parf is >> speaking at my thing. Uh, he's speaking at Continual Learning Bench. >> Yes, yes, yeah. I'm one of his advisors at Adapt, yeah. >> interviewed his brother, Chai, cuz he's also at Adapt. >> Yeah, yeah, yeah. >> Uh, that that family's very smart. >> Yeah, yeah, yeah. [laughter] They're they're awesome, yeah. So, yeah, so we're doing some of that and and as we get experience with these in the first-party agents, we're also doing them with customers. So, my my feeling is like um, customizing models is actually going to get way easier over time. That's what we're finding cuz the base models are smarter, so they generate better traces in RL already, and then RL is about learning from your own past traces. And then synthetic data generation is way better, way easier now. Uh, we have pipelines just using open-source models. Like the same model generates training environments and trains itself and beats like Opus and GPT 5.5 and stuff at a task. So, I do think it's going to pick up. Like, you know, the thing is the ease of training the algorithms is only going to go up over time. There's a question of when it crosses into mainstream. Like, instead of just like, you know, specialized document parsing thing we did, where like you need a hardcore LLM researcher, when does it get easy enough that anyone can like sort of plop in some stuff and describe a task. >> Yeah. >> Yeah. >> Well, you know what makes it easy? Interfaces. And uh, unified APIs. Cuz obviously if it's not interoperable, then you cannot switch. >> That's what we're seeing with these like with with Omnigent and the composable agents. Like you can have sub-agents with specialized models, and then you can train the whole thing. I think that'll help a lot, too. Yeah. >> The last thing I was going to leave, actually, this is I'm sequencing this, so I'm actually kind of proud of myself. Satya, uh, is uh, is uh, you know, talking about this. I interviewed him at Microsoft Build a couple weeks ago. And then he wrote this essay, which I'm sure you've seen, which is the whole mobile building frontier ecosystem. He sounded, uh, when I was talking to him, more like a Databricks CEO >> [laughter] >> Uh-huh. >> Um, is there is there a I mean, uh, this thing presumably went viral in my circles. I don't know if it did in your circles. What what's the sort of theory of like, uh, you know, I guess tokens as IP, building up the context, you know, he basically said everything but data is the new oil or context is the new oil. Some some version of that, you know, that that you guys have heard before. >> Yeah, I agree. I think that the data you have, as you get better technology around that, like you can just do more in your domain with it. It's not even just about AI. Even when people, uh, you know, started collecting stuff in real time, like I remember all the power companies put like the smart meters and stuff, and all the car manufacturers started putting like sensors and cameras and stuff. Any technology like makes data more valuable and can give you some advantage. Anything that helps you do something with it and make some decisions. And AI is the same way. Like you had all this stuff that's just sitting there. Now you can have an agent automatically tell you. Like for example, you know, instead of I discovered as what a feature in my product is broken cuz a customer complained, the agent tells me, "I noticed no one is like uploading files anymore cuz they they got errors or whatever." And as you saw with like Raiden, like as a database company, because we have all these the history of all the queries and all the table layouts and like how they work, we can build a new engine very quickly that, uh, actually is is good and we're confident that it's going to be good. So, I think I think this is right. I think the the question is exactly how it will, uh, land, but I do think like custom model customization which Sajid talked about is going to get easier over time. >> Yeah. >> Which is why, by the way, I brought up the model thing cuz they have their MEI things and you guys don't. That's the That was the to be the mental question. >> Yeah, [laughter] we we do have, um, we're doing like RL fine-tuning as a service and uh, with with a bunch of customers. We don't have like basically, you know, we we have like preview customers and we have a general something called AI runtime that's like we got you GPU clusters on demand with, uh, software stack in there that makes it easy to do training. So, we didn't like sort of but that's existed for a while. We've had like GPU compute for a while and that's where a lot of the Mosaic, uh, stack went to to help scale that. >> Yeah. >> But yeah, we found that the engagements, like some of the There's two types of customers. There's some who just want GPUs and libraries to like get data in and out and monitors. So that's what AI runtime is. And then there's some that say, "Hey, you know, can you actually work with me, build evals, build synthetic data, >> Yeah. The more forward deployed solutions architects. >> that's what we're doing. And and as and more things will transition from like being custom to to not. But um that's sort of how it is today. >> Going back to original question, I think one of the thesis we have is actually the ones you can get the data in the right place. The AI models are becoming pretty good. The generic agents are fairly I mean I think I heard you talking about AGI is already here. They have pretty good reasoning capabilities. Actually, I think many of the traditional software will be sort of rewritten uh with this new paradigm, which is just get the data to be there. And then they slap some agent on top. Magic will come out. >> Yeah. >> Um but without the right data, you can't really do that. And it's actually our approach going to security and our approach to going to the uh customer data platform space. >> Yeah. >> Is uh like we we launched two products at Data and AI Summit. One targeting sort of security teams and the other one targeting marketing teams. And those all are have a lot of existing technologies out there. And our I think our approach is just, "Hey, once you get the data in, everything is a lot easier with agents on top." >> Yeah. Yeah. Well, and you guys have been fantastic guests. I just love this discussion. I just love the ability to dive in on the tech side, but also culture and strategy. I hope this isn't the last time we chat. Like I mean, congrats on all the success so far. >> Thank you. Congrats on your success also. [clears throat] >> Yeah. >> Yeah. Yeah, I mean uh David's actually supporting my uh event, which is a it's a So I run Yeah, I run conference. And it is actually I was I I've been attendee of Data AI Summit for a long time. And I noticed that it was like kind of This is back in 2022. It was like 90% data and then 10% AI. And I was just like, "Well, okay, like we need a we need the community thing that is like just 90% AI. Which like not everybody is. >> Yeah, yeah, that works. >> So yeah, yeah, so Databricks will be will be at the conference and you know, I I it's just amazing to see you guys build out the most like interesting like cloud that I have ever seen outside of like the you know, the the the the big three and like it's amazing how far you've grown. Like one of the one of the most insightful like you know, I don't I'm not a VC but I play one on TV. Like Ben Horowitz like when he was talking to you guys advising you on just like where is this company going? He was like don't sell until 100 billion or some some version of that story, right? >> It was like the the company should be worth a trillion dollars. You're underselling it for 10 billion. >> And like he doesn't do that for everyone, you know, like >> [laughter] >> for some some reason like you know, I I think he saw the vision but also the infinite runway that you have. >> We're lucky to have Ben. Yeah, he's a big supporter. >> Yeah, amazing. Okay, well, thank you so much. >> All right, thank you so much, Sweaks. >> [music]

Jobs for this video
Stage	Status	Last error	Updated
summarize	done	—	2026-06-24 22:00:58.585723+00:00
transcript	done	—	2026-06-24 22:00:40.504979+00:00
metadata	done	—	2026-06-24 22:00:20.174668+00:00

The Agent Cloud: Databricks’ Bet on the Future of AI — Matei Zaharia and Reynold Xin

TLDR

Key points

Tools mentioned

Techniques

Takeaways

Jobs for this video