Logs Are All You Need: Rethinking Observability with AI Agents

summarized

TLDR

Sazabi is an AI-native observability platform that replaces traditional dashboards with a chat interface, arguing that logs alone are sufficient for understanding production systems when paired with AI agents. The platform eliminates metrics and traces, uses AI to generate alerts dynamically, and provides a Slackbot for natural language queries. The founder discusses controversial design choices, including a read-only, no-public-internet agent that uses sandboxes and git-backed memory for shared state across threads.

Key points

  • Sazabi replaces traditional observability dashboards with a chat interface and Slackbot, arguing that the best UX for observability is chat.
  • The platform rejects the three pillars of observability (metrics, logs, traces) and claims logs are all you need, reconstructing metrics and traces from logs on the backend.
  • Sazabi eliminates static monitors and thresholds, using AI to generate non-deterministic alerts based on logs, codebase, and other context.
  • The Sazabi agent is read-only and has no public internet access, using a Daytona sandbox with a bash tool and a log DB query tool for root cause analysis.
  • Memory is stored in git-backed markdown files, with separate branches per conversation thread, enabling shared state across agent runs and automatic merge conflict resolution.
  • The agent can spawn sub-agents in parallel, each with its own sandbox, and uses git for shared memory across threads.
  • Evals for CLI-based agents are challenging because the only tool is bash; the team uses fuzzy matching with LLMs and local mock services to evaluate agent behavior.
  • The company uses a TypeScript monorepo with Turbo Repo to simplify importing packages and building eval tools that import the agent and dependencies directly.

Tools mentioned

  • Sazabi
  • Data Dog
  • Vercel AI SDK
  • Daytona
  • Turbo Repo
  • Cursor
  • Claude Code
  • MCP (Model Context Protocol)

Techniques

  • logs-only observability
  • AI-generated alerts
  • git-backed agent memory
  • sandbox-based agent execution
  • parallel sub-agent spawning
  • genetic prompt optimization (Jeepa)
  • fuzzy matching with LLM for eval
  • read-only agent design

Takeaways

  • Observability can be simplified to logs plus AI agents, eliminating the need for metrics and traces.
  • Static monitors are obsolete; AI can generate more meaningful alerts by understanding context.
  • Git-backed file systems provide a robust way to share state across agent threads and handle merge conflicts.
  • Evals for CLI-based agents are hard and often require home-rolled solutions with fuzzy matching and local mocks.
Transcript (captions)
to to do sort of a genetic paro to apply the idea of genetic paro to like an agent trajectory. Yeah. >> Or to some agentic task. What that might look like is like spinning up a 100 versions of the agent in parallel to with the same input. >> We're golden. Thank you, sir. >> Thank you. >> Right, dude. Where do we even start? >> So much >> I feel like Yeah. [laughter] >> What are you trying to do? You're trying to dethrone data dog. >> That's basically it. That's like the short version. And along the way there are another there will be other bodies. >> Yeah. [laughter] >> But uh >> to reach the final boss. >> To reach the final boss. Yeah. But the the I guess for the viewer I mean we should explain what Suzabi is. The short version is I'm working on this company called Sazabi. And uh Suzabi is an AI native observability platform specifically built for engineering teams that want to move fast. Uh I think there's been a lot of changes to the way that we do uh software development over the last year or two. And one part of that life cycle that feels distinctly not fast and slow and painful is the process of finding and fixing issues in production. I felt like, you know, I have a background in observability and infrastructure and have hated observability tools for a long time or had a very lovehate relationship with them. >> You know, as my development workflow changed over the last couple of years, uh it very quickly became clear to me that like this particular part of the software development life cycle needed to be disrupted in the way that we've disrupted like creating new features. Um, so >> and it's observing agents or it's observing what like systems like traditional systems. >> Yeah, that's a good question and actually it's one of the first questions I basically get every time I talk about Sabi. It's like >> um like is this a brain trust or an Arise or a Langmith competitor or raindrop competitor? >> Yeah. Ah, >> which all of those you were talking about last time we did the podcast. You were like, "Yeah, we're using Phoenix. We're using Lens. >> It's cool, but it's, you know, we're also using Data Dog." And so, you were using all of these different tools. So, you're very well equipped to know what the pains are of using the tools. >> Yeah. The the official answer is that we do AI for observability, not observability for AI. >> Oh, okay. I started the observability team at Brex um like in 2019 and uh for most of my career I've been focused on DevOps infrastructure making developers more productive uh site reliability and have a ton of familiarity with data dog and bring it to every company that I work at but I feel like it's just really not meeting me where I am as with regards to like my software development workflow like running all these coding agents in parallel and like running background coding agents and like not even looking at code just no look merging stuff. >> That's the looks good to me. >> Yeah, it's it's just a wait we're building it a different way today and so I think that data dog is um >> but how are you doing it differently? >> There are three main things that sabi does differently relative to a traditional observability platform. The first is that uh we don't let you look directly at your telemetry which is controversial. Uh obviously you go the whole point of an observability platform is to like observe the data that you your application is generating uh through dashboards through like log search interfaces, flame graphs, service maps, um all of the 20 or 40 different modules that you see on the sidebar of a of a product like Data Dog. But um I think that is all those are all just ways for you to answer questions about how production is working like is it up is it down you know what does this error mean like this customer customers complaining about something why uh how long have customers been affected by this um >> severity of an alert >> the severity that you know which commits responsible for bringing this down what what are my should I >> we think that the same way that developers don't look at their code today. They won't look at their telemetry. >> They will just ask questions of an agent that has access to that data and the agent will tell them what's wrong. It will answer all those questions for them. So, >> um, another way of putting that is that we think that the best UX for observability is chat. >> And so, Sazabi >> just gives you a chat interface and an amazing Slackbot. >> Yeah. >> So, that's number one. >> Oh, that was only the first one. All right. Tell me, too. >> Uh, yeah, we have a whole manifesto. So, >> [laughter] >> Number two is uh equally controversial if not more. It's this idea that logs are all you need. >> And if you're familiar with observability, then you might have heard of something called the three pillars of observability. Does that ring a bell? >> Yeah. >> The three pillars are metrics, logs, and traces. And for a decade or more since the beginning of observability, >> lots of companies were built off the back of that. >> That was the premise, right? like the idea was that you needed all three types of telemetry to properly understand the application in production and I just think that's not true. Um >> I think it's especially not true in 2026 when we have agents. Um >> so then you're like let's get rid of the traces and metrics and just focus on the logs. >> That's right. Yeah. Um there's a lot of reasons why I think this is the case. I mean, one one benefit of of just focusing on the logs is that instrumentation is now like significantly simpler, right? Because we took actually the two things that are hard to instrument, traces and metrics, and just get rid of them. So like you no longer need to know how to set up a Prometheus server or the difference between a counter and a gauge and a histogram and a rate. And uh you don't have to propagate span context and make sure that like the trace ids are passing to through the entire uh call stack and also across services. We just get rid of all of that. You just know how to need to know how to like do a console log or a print statement. Um >> so am I going to be able to vibe observe? Is that the idea? >> Yeah, that it ties a little bit to our first idea which is like you should just be able to ask simple natural language questions. Anyone on your team should be able to ask those kinds of questions. >> And your instrumentation experience is going to be way easier because it just requires you to add logs. Um, and you're going to get all the benefits that you would have if you had logs, metrics, and traces >> because uh we have some tricks on the back end to reconstruct metrics and traces from the logs. >> All right, let's peel back the veil. Wait, that was only two. >> That's two. >> What's the third one? Since >> they are so spicy. >> They are spicy. The third one is uh I love all three of my children equally, but uh the second one has definitely been getting the most attention. I think the first two deserve a little bit more because they are quite controversial. Like we're we're really asking people to change the way that they do observability. Um it's it's a dramatic departure from how you've done it in the past. Um and the third one is maybe the most radical change. Uh it is this idea that traditional monitoring and monitors are dead. Like we will never we have no use for static monitors with static thresholds anymore that uh you know alert you when CPU exceeds uh 80% of the host or uh when a when a pod enters a crash loop back back off status. >> And why not? >> The reason is well first of all because monitors and alerts suck. It's if the one if one bad thing about observability is uh is instrumentation, then the like next worst thing is monitors. >> Alert fatigue, dude. >> Alert fatigue. >> I learned about that term when I was learning [clears throat] about observability tooling. >> And people were like, we're going to use AI so that we can recognize which alerts are actually useful versus not. And that was back in like 2018, 2019. >> People are still working on that. So we take it a step further which is we're not using a AI to evaluate your alerts and decide which ones are meaningful or not or enrich them. We're just using AI to generate the alerts. >> Mhm. >> So you don't go into Suzabi and say like I want to alert on XYZ. Sabi has access to your your production telemetry specifically your logs and has access access access to your codebase and has access to any of the other tools and context systems that you give it give it and uh from there it's able to decide what is meaningful to to you and not and so you'll receive an alert for example Slack notification that is completely non-deterministic uh at the discretion of the agent and it could be like hey you know I saw this commit go out. There's I'm seeing this error in the logs. It seems like there's a problem with payment service. Uh here are the your recommended remedi remediation action items. Would you like me to kick off a cursor cloud agent to fix this? >> Um >> so you're taking one step further. It's like there's this problem. I could go and try and fix it and get kind of far maybe or get all the way. >> Yeah. Yeah, I mean one of the things that we draw a line at like one of I guess philosophically sizabi like we're focused on observability and um helping people find and fix issues. We draw a line at code generation. >> So we we will not open a poll request or or merge anything to your PR. In fact, Suzabi is like completely readon system. Uh but we can initiate for example like if you link your cursor account or you um are using cloud code with an MCP server or the sizabi CLI uh you can use sizabi to generate code generate fixes >> and can you also create issues in linear do [clears throat] yeah right >> yeah you can do that you can do that the agent can do that at your direction like you could say hey sabi um create a linear ticket for this >> or you could say hey sabi every time you see a new issue create create a linear ticket for it. >> Mhm. >> And Sazabi has memory, so it will it will remember that preference and it will keep doing it uh in perpetuity. >> Dude, there's a lot of stuff that I want to get into like peeking behind the veil on how you're doing things and >> what you've learned while building it because I think you have some equally spicy takes beyond just >> what you're trying to do. Yeah, exactly. Let's talk a little bit about the MCP versus CLI paradigm >> because I think there's this simplistic view where people are like, "Ah, I don't like MCP because it bloats my [ __ ] I don't want to use it." But >> you have a bit more of a nuance take. >> You know what I think is funny is that like a couple weeks ago, we had a big banger birthday party for Claude Code. >> Yeah. And it was his it's Cloud Code's oneyear anniversary. And I think MCP came out in like late 2024. >> I'm not sure anyone was celebrating the MCP birthday. Um, which is no shade to MCP. It's just interesting. Two products uh that are both like were both killer apps in their own way from the same company. One of them has had this like crazy celebration. The other one's not. um >> MCP uh you know and I don't keep entirely up with the protocol so I'm sure they're doing a lot of great work to make everything better but um I guess the big the big complaint that people had over the last 6 months or so was that it would blow up the context just by loading the MCP server. >> Yeah. >> And then anthropic has has done some work on patterns for addressing that. I think one of them is basically tool search. >> Yeah. And well, yeah, and there's also the um what is it the progressive disclosure? >> Yeah. And I think that is kind of like tool search if I understand correctly. Like there it's like the agent, you don't need to load in all of the tools into context right away. The agent sort of should express some kind of intention around what tools it's looking for and then you could service the relevant ones. >> Yeah. Yeah. that's becoming a best practice where you just have a certain set of abstractions above it so that you can say I want all the tools that are related to XYZ >> and you get there >> and the same problem kind of could exist in the context of a sandbox and a CLI but the um you know imagine like the number of tools or or uh I guess programs that are at the exposure uh at The disc discretion of the agent within the context of a sandbox is enormous. >> It's like a set, GP, cat, ls, like literally every Unix utility and other CLI that you've installed. It's a huge number of tools. >> And I think it's interesting that like the that pattern doesn't create the same context rot >> that MCP does. I think that is because these tools are uh effectively baked into the >> they're in the >> into the weights. >> Yeah. >> So the model sort of just knows that there's probably a cat tool. >> Mhm. >> In this environment and so it just like tries and then you know maybe it gets like a it gets an error where it says there's like there's cat is not that alias hasn't been set up or it doesn't exist in the in this context. Um it's not in the path. Mhm. >> And it's just like, okay, well, let's try something else. >> I have noticed, and I don't know if you've noticed this, too, that I'm very happy golucky or trigger happy when it comes to creating skills. And I'm starting to wonder if now, >> are we headed to the same >> like skills bloat, which all of a sudden now I got a million different skills. And cool, it's all local. I don't have to worry about the MCP server, but anytime I do something, >> yeah, >> just because the majority of the time I make my skills universal. If I really like a skill, I'm like, "Yeah, this is going in all my projects." >> One advantage of skills, and I I do think that skills run the same risk. Um because the the way the agent uses them, it basically ls in this directory which contains all of your skills and then it basically starts walking the directory and reading the the header and the markdown of each of the skills to kind of get a sense of whether this is relevant. But the advantage of that is that it's only reading the header. >> Yeah. >> It's not loading the tools. >> My understanding is that it's only reading the it's not loading the whole markdown scale. >> Yeah. which is there's maybe in MCP there's no way to um to provide a hint about what this is the way that there is with skills. >> Well, let's talk about sandboxes because that's something that I >> what the bane of your existence. >> No, no, I love sandboxes. Yeah. >> Good. tell me about it and how you're doing it because I know with observability it's really hard to recreate certain scenarios >> and recreate like when things are failing that's one uh so that's I guess that's a little bit more on the eval side but you want a sandbox kind of related yeah exactly >> yeah because we use a sandbox for our agent >> sandboxes are become relevant in our evals right because we're we need to run the agent in the evals So, >> how are you using them maybe to start and then go into what some of the PES are or I know you mentioned how you're using git and you're it's kind of like a trick or a nice little hack so that you can always have shared state. >> Yeah, we'll see if by the time this comes out whether I am still still as bullish on as I am. >> But uh yeah, maybe I'll just talk a little bit about the sabi agent and how it works and where sandboxes play a role. Um the Sazabi agent is uh let's see we built it on on Verscell AI SDK workflows um which have been pretty good although there's some sneaky lock in stuff. >> Um the code is very clean and there are some very ergonomic um parts of the SDK and it is uh it has access to a Daytona sandbox and a bash tool which runs in the sandbox. It has a handful of other tools that are directly mounted to the agent. Um I'm trying to think of an example right now like we'll have to come back to it but for the most part it's operating inside of actually you know here's here's a a critical tool that >> uh it uses >> that's not in the within the sandbox and that is like our log DB query tool. >> Okay. So obviously one of the most important things that our agent does is our reads or logs, right? Like that's how we root cause things. Um so we we have we give it a a tool that is basically a like a SQL like uh interface to our log database. It's able to run any sort of readonly query. those queries go through a proxy so that we can kind of enforce uh um we we also have like RLS set up on the database and um so this particular user is not not able to to update or drop the any tables which would be really important. Uh >> but there we do some things through the proxy as well. Um and then almost everything else happens with bash and with our sandbox. Uh, for example, like you can install all kinds of CLIs into the sandbox environment. Things like the AWS CLI if you want Suzi to be able to investigate uh or correlate things that it finds in the logs with what you're it's actually seeing in your AWS account. >> Um, the your the Suzi has access to your source code and the way it has access to your source code is through the sandbox file system. So we'll clone your repo into the sandbox and so Suzabi can then just explore the files and u basically again do that do that correlation step uh tell you like hey Demetrius like this error that I'm seeing in the log specifically comes from this commit and is in this file it's on this line >> and it was introduced by >> John >> John two days ago the designer brought down our marketing site >> John you had one job >> there's Uh there's also our memory which is entirely based on the file system and sandbox. >> And this is where things get interesting because I was saying like how do you have shared state right and how do you and you were saying you pull everything from git and you're constantly pushing back to git. So if you have a lot of different sandboxes that are running they're pulling all the time and then pushing and so it stays up to date. And I did have the question of well what do you do about data and I guess you haven't hit that yet like databases you don't use. Well, I just for more for more context just to lay it out the basically our memory is very similar to something like openclaw where it's all based in markdown files and in a prescribed folder structure and those files are stored >> right now in a self-hosted git server git repository. Um what we do is we actually create separate git repositories for every what's what's called a project which is a scoping mechanism within sizabi like you might make a a project for your staging data and a project for your production data just so that when you're talking to suziab you're either talking about staging or production but not both. >> So we create a git repo for every project and then we create a thread or branch for every thread. The thread would be like a conversation. Uh and the first thing that happens when we create the sandbox is we pull down that particular branch. >> Mhm. >> Which on a you know if it's a brand new thread there it will be basically empty. Uh and then when the sandbox command the bash command executes and ends we always push back to that branch. So then regardless of whatever commands the agent ran, like maybe the command, maybe the agent just like echoed something into a text file or maybe it wrote a JavaScript program and like parsed a bunch of logs and then found some information and then wrote it all into like a complex file system. Either way, we then just push to that remote branch and now we have like a persistent that state is persisted independently of the sandbox. And then the next time we start the sandbox, we pull again. There's probably no defaf. Uh, and then the agent can use it again. But what's interesting is that this now gives us a mechanism where we can share state or share memory across multiple agent runs or agent uh multiple threads. So let's say you're talking to Suzabi and you say um Suzabi like my favorite color is red. Please, you know, always remember that. and I'm talking to Suzabi and I say, "Sazabi, um, my favorite color is blue." >> The Suzi is going to store both of those things in memory. And then the third Suzabi thread, if we were to ask what's my favorite color, it would say, "Well, well Demetrius is red and Shitz is blue." >> Um, because we've we've committed both of that those memories to the git based file system or the git backed file system. And then we can merge together the memories that are shared from different branches if that makes sense. Mhm. >> There's also something kind of interesting that happens too where um because this is agents are sharing memory, you can imagine maybe there's like a list of issues that Suzabi is keeping track of in its memory and one agent wants to say mark an issue as resolved and another issue wants agent wants to um actually like mark that issue as like mitigated but not resolved. They can both try to make that change. Uh but what's going to happen is a merge conflict. >> Yeah. >> And then we have a background workflow that resolves merge conflicts. >> Oh, cool. >> Um which this all of this is things that you would get from a normal transactional database like Postgress, but it's um you're more directly operating on top of files which is uh really like agents are basically better at working with files I think than they are with working with Postgress interfaces. Yeah, I'm trying to figure out in my head is when you feel like you want to use a whole sandbox versus when you just want to use like cloud code work trees and have the agents working in parallel in those ways. >> Yeah, that's interesting. I mean, if I were to imagine like an other alternative approach to implementing what we've implemented, we could have like some shared VM or shared computer that the agents all have access to and they are each one of them gets a work tree or something like that. >> That seems a little bit more fraught. I think that it's nice to kind of keep things isolated at the thread level. >> I mean, I think the entire industry is kind of is getting a better grasp of for trees and what they're useful for and what they're not. It's because no one had heard of work trees like a year and a half ago. >> Um I think it would work if all of the agents were operating in the same the same sandbox environment like the same VM or the same machine but uh because we use them separately there wouldn't I don't think there's a really good use case for work trees. Each thread only needs one branch anyways. So we'll never a branch the reason you would use a work tree is to have multiple branches on the same in the same environment but we only have one branch that's relevant to a particular environment. And now you mentioned how you're using uh Ga Ga >> Jeepa. >> Jeepa Ja. >> These guys >> I think it's Jeepa actually. >> It's very acade. It's an academic thing. So >> DSPI it took me like a year to figure out how you pronounce that. Same guys, they go and they make another one. They can't just call it like Johnny. >> Is Jeepa or is like Moldbot a worse [laughter] >> worse? Like who's worse at the Peter Diver or is it the the JA people? Exactly. [laughter] >> Yeah. The reason I ask is because their whole thing is like fanning out and trying to do prompt optimization, right? >> It's genetic paro. That's why it's called >> Japa, >> which is super fancy sounding name. >> The paro principle like the >> it is like the paro principle. It's the it actually is the same idea. Um genetic refers to the fact that Jeppa uses this uh this evolutionary algorithm. So, it's involves creating Yeah, it's like you create lots of versions of the prompt. Mutations are what they're called and then uh the mutations that do well get get kept and the mutations that do badly die out. >> So, it's natural selection. >> Yeah. >> For your prompt. >> This is what I'm thinking. Like there's something that I don't know how to rationalize in my head, but it feels like it's all kind of similar in the way that work trees are for getting things done. But now you start seeing a lot of people using a lot of work trees. And then you're even saying, "Yeah, we have different sandboxes." >> Oh, that's that's >> Do you see where I'm going with that? Right. Like with Jeepa, you could kind of be like, "Oh, because one thing >> it's a mind-bending idea." I think like you're maybe of of >> to to do sort of a genetic paro to apply the idea of genetic paro to like an agent trajectory. >> Yeah. or to it some agentic task. What that might look like is like spinning up a hundred versions of the agent in parallel to with the same in >> same Yeah. task. >> Exactly that. Because I heard a guy that came on the um the meetups that we were doing, he was saying one way that he gets better reliability out of his agents is by having he kicks off a task and he sets up five sandboxes and whichever one creates the code that passes the most tests, that's the one that he uses. >> Yeah. But actually, this is this idea has been around for a while. I mean, when Codex launched cloud agents or background agents or whatever they're calling them, >> they quietly had this feature which I thought was amazing and like and indicates sort of where where we might be going, which is that you could spawn in number of agents for the same input and just get like in outputs and then look at >> you. So if like you really wanted a good result and you just wanted to pull the slot machine like as many times you could be like do a 100 and then I'll just look at all of the outputs and see which one like >> or see whichever one passes test. >> That's the problem is that they had no mechanism for it. They have the map but they don't have the reduce >> like no way to quickly in or conveniently eliminate the the bad ones and identify which one's the best one. M um >> and and so the reason I'm saying that is just like okay well with your sandboxes is there a world where you see yourself doing that kind of thing in the or or right now are you just doing one sandbox for one task and that's good enough. Yeah, I think actually this question has is independent of sandboxes like >> we have these primitives like we have the message we have the thread we have the run every a every thread has an a sandbox connected to it um and then these threads share memory via the git file system that will continue to be true but what we can do is change our user interface or the application so that when a user types in a query We actually kick off a bunch of agents >> and all of the agents try to to to fulfill the query and then we merge them back somehow. >> Yeah. >> We don't do that today and I think that's an interesting approach. I mean there's been a lot of cool projects recently around like massive parallelization of agents but usually the way it starts is there's is actually there's a a main agent and the a main agent is spawning sub agents and background agents >> at its discretion >> at its discretion and then there's some level of recursion where sub aents can spawn sub aents. >> Yeah. >> Um and we we have all of that today. So um if we've evalued our agent properly, it will spin up the appropriate number of sub aents to investigate whatever uh your query is. And um they will all go off in parallel. And in the same way that when you're using cloud code or something, sometimes you'll see the prompt like you so you get this nice little to-do list where it's like you got five different items like update this and then push this and then you know change this file um and then run the tests and each one of them has a spinning icon. >> Yeah. So because it's got like five different sub aits running in parallel. Um >> and so Zabi has the has the same capability right now, >> but each one of those would have its own sandbox um with a shared memory. >> Yeah. >> So they could all find so sub agent A and B could find discover some things and commit them to memory, but then they would also report them back to the main agent >> as a part of their output. So then sub aents C and D would in theory uh could could see the memory that had that has been committed by A and B. >> They see it in git, not in the context of the >> not in the context depending on the sequence of when they were kicked off. Like if all four of those agents are kicked off at the same time, then the main agent didn't know there were the findings of A and B when it started B and C. >> Yeah. >> So the only >> then all of a sudden B and C are like I know kung fu. Uh yeah, those basically [laughter] I mean there we want them all to be looking for issues and problems and anomalies in the system and then committing those to memory so that uh uh every agent that we run benefits from this share from the shared findings, the collective findings of of all of the other agents when investigating problems with your system. >> Okay, talk to me about evaps. >> They're hard. So what else to move on? Next topic. [laughter] Evos. >> It's so funny. Evos are like they're they're so important and I love making fun of them. >> I'm so sick of talking about We've been talking about them for >> talking about them. Yeah. It's And it's um >> still hard. >> We had this whole thing called Big Eval that we used to joke about. >> Yeah. >> Capital B capital E where it was like Big Eval was like all of the venture money that was put into the Eval companies is like wants you to think that you need Eval. [laughter] >> It was a scop. It was a scop, >> dude. >> And it's not it's not a scop, but it is um there's the evals like >> there's so much juice you can get from your agent without actually uh implementing evals uh through just context engineering through providing a really good harness and and a sandbox environment I think gets you a long ways and then using the right models and implementing things like sub agents. So you can have a really good agent without writing a single eval. >> Now That's why big evals exist. >> Yeah. [laughter] >> Now, I think that in order to really reach the next level, like once you've done all of that, uh, you need evals >> because, you know, we need to we want our we don't just want our agent to take advantage of all of the best practices, we want it to be bestin-class. And one of the things that gives us an advantage is that we all of the log data that we have access to. Uh, that is something that like other companies don't have. And the way that we make that data valuable or the way that that data feeds back into our agent to make it good is via eval. So we have to actually write them. >> You mentioned if you properly eval you are going to spawn the right amount of sub aents at the right time. Why is that? You can eval all sorts of things like you can eval uh like fact factbased stuff like they or or try to minimize hallucinations like uh let's say there is a there's like the log data contains one error log line and I ask the ev the test is like when I ask the agent you know tell me what error that you find the agent reports the correct error like that's that's like a binary outcome um but then there are also non-binary outcomes like uh how long or short is the agent's response or does the agent accurately describe the like the possible root causes? Uh or another interesting one is like does the agent first of all does the agent like help you create a bomb if you ask like we should we should email that behavior out forunately >> you don't want that [laughter] >> yeah I mean a bomb observability company [laughter] >> you >> exactly [snorts] >> we it's in process we're doing [laughter] it >> um I was we're going to get put on a list >> yeah um [laughter] >> this podcast just got marked as [ __ ] shadowban. Nobody's going to see this anymore. >> Demonetized. >> Exactly. [laughter] And shadowbanned. >> So >> we But also the agent should I mean ideally not leak leak implementation details. >> Like if a user asks us what tools the agent has access to or um uh asks us like exactly how our agent architecture works, like we'd prefer for that for agent not to volunteer that information. Um and then um you most importantly the the emails that matter the most are like does the agent trigger an alert when it when we think it like there's a the situation calls for it >> or does the agent accurately root cause an issue based on the logs available to it and based on the the the codebase that it has access to. Well, you were mentioning how difficult it is to eval CLI tools or just like when the agent's using CLI tools versus >> tool calling because tool calling you can say did the agent call the tool? Yes. No, it's very binary. Yeah. with CLI >> there's only one tool there's just a bash tool >> and so you really have to be very specific in the eval for the CLI [clears throat] did it call the bash but in this way and >> yeah it's a little it gets a little fuzzy I mean very practically and in simple terms if you have a tool like trigger alert and then you can very easily build an eval that says was the trigger alert tool called. But if you have an eval or tool like bash like execute bash command, then what does it mean to trigger alert? You know, um maybe the agent triggers alerts by like calling a a curl like curling an endpoint on our API server. Well, like there are 20 different ways to do that and with different parameters and you know it could be it could be curl or it could be like maybe it has written a script or something or um maybe it uh maybe it triggers the alert with one payload or versus a different one. It it gets a lot harder and you're kind of the only parameter that you can really match on in that bash execute bash command tool is like a string command. Uh >> yeah. So how do you do that at scale? >> So I want to know that the bash tool was called but I want to know it was called with a particular command or like one of 100 commands. >> Yeah. >> So I think you know maybe a practical solution to this is actually just basically fuzzy matching on it with an LLM saying like does this look like it called like it triggered an alert and >> you can't just use the outcome. Well, I mean one other if we want to take a step further and it depends on what's in your eval environment like what what services are mocked or what or what services spin up as a part of your eval environment. But if you imagine our API server was running >> in the eval environment then what we can do is actually uh basically check the API endpoint. >> That's what I was thinking >> and say like was this API endpoint called with like a with one of these parameters >> because it doesn't matter how it gets there. It just matters it gets there, right? >> Yeah, it's true. But it but then you're now you have this problem where you're like every one of your dependencies, everything that the agent can operate on that you might want to eval for needs to spin up as a part of your deval environment. >> Yeah. Um that is I mean we have I'm very proud of how good our like you you can do bunddev and it spins up like a huge amount of our world but uh as we as soon as you add the sandbox and you let people install whatever tools they want in the sandbox it's basically impossible for us to guarantee that all of the things that uh >> that could that the agent could operate on will be a part of our eval environment. >> Yeah. Um, so, >> so then it doesn't really work to see if it calls the API. >> Yeah. Well, then and then a AWS for example, what if like what if we want a future version of Suzavi to actually like autoscale up a database or a Kubernetes cluster? Um, are we going to just, you know, maybe let's say your Kubernetes cluster where your [clears throat] database is running on AWS on RDS or EKS. >> How does our EV environment now have to have a running version of AWS? for us to test that. >> Yeah. >> And then we need to get the data out of AWS to determine whether this was successful or not, whether it actually whether the side effect that was that occurred is was is there or not. So the only thing I can think of is like spinning up like an entire version of our production app specifically for evalance and then having all of these hooks into these third party services that >> Well, yeah. Talk to me more about isn't that what you're doing with RL environments? we to be we basically do a version of this. Um I I'm mostly complaining because [laughter] because it's a pain in the ass and because I I'm looking for better solutions. Yeah, we try to >> we try to avoid having to spin up external dependencies as much as possible for this sort of thing. Like if we can run um like a local Postgress instead of superbase or or like an RDS instance >> way better or if we can use local stack instead of AWS way better uh but the application is becoming more complex the sandbox increases complexity and >> yeah every time you're pushing code you're adding to that I imagine >> the as the application gets better like the eval environment becomes harder. Yeah. >> To create and maintain. >> So it's almost like there's that friction. There's a tension there of Yeah. We want to make the application better, but we also recognize that now we're making our eval harder, you know, with from an eval perspective. >> Dude, soon you're going to have to buy big eval. >> Yeah. Well, everything we have right now is home rolled. >> Yeah, >> I imagine. Cuz that is >> the SCOP. >> You know what I mean? Big A. Are you kidding [laughter] me? >> We spend so much money on so much stuff. We're very not optimized from a perspective of like of from a fis a financial perspective. Um I probably shouldn't say that but [laughter] um >> we'll cut that out. [clears throat] >> Pretend like you didn't hear that. >> No, we we're very fiscally responsible and what's most important is time and not Yeah. So we have money, we don't have time, so we spend money to to move fast. >> And you still don't buy eval tools, which is >> And we still don't buy eval tools, which is it does say a lot. We I think part of it's because we're in a TypeScript mono repo. We use Turbo Repo which is a gift from God >> which allows you to import local packages uh from the same file system or from the same repository. Whereas in the past you used to have to publish a package to something like uh some artifact server like mpm. >> Yeah. >> And then >> so you're working locally. You want to update the contract for a service to like change the type on a field, you have to change the type, publish the package, go to the other package, >> like bump the version. [laughter] >> It's just a huge pain in the butt. >> Turbo Repo fixes that. Um, and now we just have we have so many packages. Everything imports itself other other things. It's so beautiful. Um, >> okay. >> It's so easy to create new packages. >> Yeah. And what this allows us to do from an eval and agent perspective is we can we able to build a typescript eval tool that just imports our agent and imports a lot of the other dependencies and data sets >> that we have defined in other packages. >> Um >> and are you trying to run going back to like the nefarious things that you want to guard against? Are you trying to red team the agent in a way that you're making sure that >> mostly for security purposes, right? Customers care a lot about um a lot about security, right? Because like their logs and their codebase like that's those are the >> ops is not that far away from dev sec ops. >> True. >> And >> they sit next to each other. >> Yeah. Um what the two big things on this is that Sazabi's readonly system >> and it doesn't have access to the public internet. >> So >> on purpose >> on purpose >> made sure that >> yeah we don't we don't want an open call situation where like sabi can can maybe it reads some website that has prompt injection attack and then it exfiltrates your codebase like can't do that. >> It only has access to domains and IPs that you whitelist. >> [clears throat] >> Uh so we're secure from that perspective and security for sandboxes is complicated. Um but the things that we do I mean we do want to eval um to make sure that savvi isn't susceptible to types to even I guess in that case the only people that would have access are like maybe malicious users within your organization. >> So it's a much lower risk >> but uh >> less probable. >> Yeah. There's there's security security for our customers and there's security for ourselves where like we we don't want Suzabi to volunteer too much information about >> Yeah. >> Sabi because that's our that's our special sauce. Yeah. >> Which we're which I'm yapping about >> on a podcast. >> Nobody listens to [laughter] this. Don't worry about that. >> Well, we're in the late space office so this is this is going to be a big one. >> Um then there's security for our customers which is a wholly separate issue. Um and we we just took two hard stances from the start which is read only no public internet access and then we have all of the various certifications sock 2 type 2 ISO 2701 >> GDPR HIPPA um I mean we our team built banking infrastructure at PR so >> there's a lot of hoops you got to jump through >> yeah we we know a compliance or two when we see it [laughter] >> for people that are basically living in the future and vibe coding or I guess it's not called vibe coding anymore. Did you hear that? It's called like agentic engineering. >> I thought it was AI engineering. >> I don't know, man. It's the I'm going off of the latest Simon Wilson blog posts. >> Oh, that's authoritative. [laughter] >> Exactly. >> That's That is authoritative. Um, >> so a for the people that are agentically engineering their [ __ ] it begs to question how would they loop in Senzar >> Suzi. [laughter] That's so funny. The name is divisive. Um, this is >> wrong every time cuz in my head, every time I see it written, I just think and I say it to myself as Sansbar. So now when I have to ask >> is it Jack Black who says it or is that um is it Kyle? >> It's Jack Black. It's so good. >> Sabi, don't worry. The world will know soon. >> Yeah. Um >> Exactly. I'll remember it next time. But >> yeah, for people that are trying to agendically engineer >> Yeah. >> What are you saying? >> I guess I got different messages for different audiences. I mean, if you're building >> if you're building agents and um you want to know more about how we're doing it uh or you want to you want to talk to me, I'm happy to connect about it and share some of our my lessons learned. I I'm very much a practitioner. Um >> I'm not a a researcher or an academic. Built a lot of agents over the last year or two. >> That's for sure. That's why I love talking. >> It's so fun. But uh so I mean your mileage may vary. I I try to uh try to share practical advice from the trenches. And >> you hiring? >> We're totally hiring. So, if you like building agents and you want you like software engineering and dev tools and you want to see some crazy [ __ ] about how software engineering is going to change in the next 6 months to a year, like reach out. We have talents.ai. >> Boom. >> Boom. Uh send us an email. We would love to meet you. Um, and then for I guess last my my LA I will be in closed beta at when this announcement or when this podcast comes out. >> So if anybody wants to use it >> if you want to use it visit sabi.com s a z-abi.com we will be opening the wait list uh pretty significantly as a part of the beta and uh it's a very cool and powerful tool. So if you love AI coding tools like cursor and cloud code and you your team is shipping really quickly, especially if you have a team of engineers and production traffic and users that you don't want to disappoint, >> hit us up. We will get you on boarded. >> Logs are all you need. And blogs are all you in.

Jobs for this video

Jobs for this video
Stage Status Attempts Last error Updated
summarize done 0 2026-06-22 22:02:00.207425+00:00
transcript done 0 2026-06-22 22:00:33.394021+00:00
metadata done 0 2026-06-22 22:00:19.107307+00:00