Voice Agent Use Cases

summarized

TLDR

Voice agents for customer support require a hybrid cascaded architecture that balances control, latency, and quality, rather than pure speech-to-speech or naive chaining. The key challenges include accurate transcription, turn-taking, and latency masking, especially for non-technical users who need interfaces similar to managing human agents. A constellation of models—smaller ones for quick responses and larger ones for complex tasks—is common in production to maintain natural conversation flow.

Key points

  • Voice agents face unique challenges over chat, including transcription accuracy, background noise, multi-speaker diarization, and turn-taking, which increase failure points.
  • Turn-taking can be improved with a hybrid approach combining acoustic features (pitch, RMS energy) for fast detection and neural models for accuracy, reducing latency.
  • A constellation of models is used in production: a small foreground model handles simple turns and masks latency while a larger model performs complex reasoning or tool calls in the background.
  • Non-technical users (e.g., customer support managers) need interfaces that mirror human-agent workflows, such as defining SOPs and providing natural language feedback for continuous improvement.
  • Fusing ASR and LLM layers can produce more intelligent transcriptions that ignore backchanneling or irrelevant speech, improving the user experience in dictation and email drafting.
  • Common voice agent use cases include inbound customer support, inbound sales lead qualification, outbound sales for long-tail customers, and appointment booking for services like plumbing or restaurants.
  • The debate between cascaded and speech-to-speech architectures is resolved by intermediate patterns that offer flexibility, control, and reliability, especially for enterprise customers who need to swap models or fix issues.
  • Conversational TTS models that consider historical context are important for maintaining appropriate tone and emotion across multiple turns in customer support dialogues.

Tools mentioned

  • 11 Labs
  • Sierra
  • Decagon
  • Vapi
  • LangChain
  • PipeCat
  • Ultravox
  • MLflow

Techniques

  • Hybrid turn-taking using acoustic features and neural models
  • Constellation of models with foreground/background architecture
  • Latency masking with smaller models and background deep research
  • Fusing ASR and LLM layers for intelligent transcription
  • Conversational TTS with historical context
  • Preference tuning from expert agent trajectories
  • SOP-based agent behavior definition for non-technical users

Takeaways

  • Voice agents benefit from a hybrid cascaded architecture that balances control, latency, and quality, rather than pure speech-to-speech or naive chaining.
  • Accurate transcription and intelligent turn-taking are critical for voice agent success, especially in noisy environments.
  • Non-technical users should be able to define agent behavior using familiar interfaces like SOPs and natural language feedback.
  • A constellation of models—small foreground model plus larger background model—is a practical pattern for production voice agents.
Transcript (captions)
How do we like make cascaded systems feel like speech to speech but still like have the enough controls? >> [music] [music] >> There's a challenge with controls where since there are so many controls if you have like a developer team within a company product focused company do this then often what I see with customers is that if you give too many controls and if they exactly don't know how to sort of set it up it is like oh this just doesn't work right? So I think there's always a tension between like pre-configuration right? And giving more flexibility. But if you give more flexibility then for example sometimes people are okay why the latency so high for for speech generation because you know we maintain a certain buffer size to make sure that the speech generation has enough enough context right? To to be prosody aware to to have the right tone mean maintain the consistency in tone and higher the buffer size you will have like higher latency. But now if you let's say let them reduce the buffer size the speech quality sort of goes down and then they're like okay yeah you know complaining right? So I think that is a general challenge what is the right abstraction right? So we should also talk about general abstraction of agents so many of the companies building orchestration like it Vapi on the tech side LangChain these are meant for like tools to be built by software engineers like you build agents over them. But like I see let's say in customer support customer support is more like a cost center for a company which means you'll have like less number of engineers there. And the the real person or the real customer is the operations leader there who is basically managing the workforce there and wants to run end-to-end operation but like the current interfaces of our existing orchestration systems are not meant for them right? And that is where like we have recently received a project called 11 Labs for support but Sierra Decagon where we define these interfaces for non-technical sort of authors to to define the behavior of agents right? So even before AI agents you would see what what would like I was in Amazon and if we have let's say 100K support agents across the world many of them are not full-time right? Like they can come and they can leave at any point right? Which means there's not really an opportunity to train them on specific support so customer support always had these like pages of documents called SOPs and every time you call someone they say oh give me a moment to look up your information they're literally reading that right? That oh this person is reaching out for refund what should I do? I should look for information about Demetrius then I should check his eligibility for refund his past sort of purchase behavior blah blah blah and then that that takes time right? So these SOPs were written by the operations leaders or managers or customer support managers and they would use that to for human agents to follow them right? And then the second mode of sort of interaction between them would be that once the human agents are sort of following those SOPs sometimes they would not follow them and uh they would break the rules or be non-compliant to that and then the operations manager has mechanism to listen to calls and then uh >> [clears throat] >> give them feedback right? So I think my thought process is that if we need to make like agents used by non-tech authors which are which are largely which is most of the population who would be using agents and they still need autonomy on the behavior of agent right? What interfaces we need to build? So I think in on customer support specifically can I keep the interface same right? So instead of human agent can I have an AI agent but let's say if I'm the support manager keep can I keep the interface same where I still define SOPs now right? Just like I used to but now I can go very very detailed and long because I don't have a human constraint of not being able to read a long documents right? Like or large tables right? So I think that can change but like the regular interface can be the same. And then the second is just like I was providing feedback to human agents can I also sort of provide feedback to AI agent that oh you provided wrong information there or you sort of broke this rule or you did not follow the SOP or sometimes the SOP is not clear enough so the agent got confused even humans get confused. How do we take that feedback and sort of iterate on the specification and procedure? So I think that is quite interesting that like how do we sort of keep the interface of interaction between sort of domain experts who may not be software engineers same as how they were sort of working with human agents right? And if you were to do that then there's a lot more heavy lifting to be done right? Which means continual learning which is pretty hot topic right now that how do we make these agents improve based on natural language feedback right? So can talk about some design patterns for that how we are doing it how companies are doing it so yeah. Why do you feel like voice is harder than chat? Yeah so even chat is not easy right? Like chat has not been solved and I've spent more time doing chat than voice to be honest and I don't think like first there are no real benchmarks for like end-to-end dialogues. Now bench is one but it is so much more easier than the real world situation so for example uh Why is that just because you have so many random sounds or you >> Exactly. Different accents that type of thing? Yes so I think first if you start with chat dialogues and dialogues are often multi-turn so longer a dialogue there are the chances of failure increases right? And then for chats the challenge is that how does the agent recover right? If you change your mind you change your intent you said something that was unclear I misunderstood right? The agent misunderstood and then you know what are the self-correcting abilities? What are the abilities to ask right follow-up question at the right time? Because if it's not the right follow-up question at the right time people have this historical baggage of not trusting agents right? Like the AI agents or chat agents right? So uh which means that since they don't always trust you have to be very precise like when you should ask a follow-up question to for clarification right? When you think you misunderstood them and like you know pivot. So these variations are present in the text stack and now you add voice to that right? Voice is okay where are you sitting? Like most people are not in such a studio setting right? Like quiet often you know especially in like and and and and another important factor is that voice the distribution of how many what percentage of population interacts with support or any of these concierges through voice is vastly different between US and Europe and let's say India right? Like in India probably 75% of conversations customer conversations happen on voice right? While that number could be around 30 to 35% in the US right? So so the difference is very stark. And now when most people are calling from phone they're usually outside there is background noise there could be multiple speakers present how do you sort of filter that? trans People don't talk about enough about transcription people when when voice agents are there I think people want to hear like the most beautiful voice from text to speech but in terms of reliability for specific domains like customer support you need very accurate transcription right? So for example if if you call me and I ask you for your order number your email or your name and if I get any of that wrong and I don't have mechanisms to correct that or my transcription is not super accurate then there's no way my dialogue or the conversation will succeed it just fails. So now like we discuss in chat agents there are few degrees of freedoms where it could fail and that significantly increases at voice because of let's say transcription errors background noise multi-speaker diarization turn taking is a still a open problem. Yeah how do you cuz I I heard about with turn taking the you can't just say a number Yes. of how long you wait or if x amount of time has gone by you know oh cool now I can talk. So there's like probability >> Yes. Is it a whole separate model that you're using? Yes. Tell me about that. >> Yes so you can have a separate turn taking model you know based on transformer architectures fine tune for turn taking task where you collect data the data should have enough variations in terms of like different accents but different like pace and intonation let's say I speak very fast but someone else could take longer pauses right? Yes you. So having all those variations across all accents and languages I don't think those rich data sets exist today in voice and then like training turn taking based on that large part of turn taking has moved to these neural models, but I do feel and we can talk more about that that the the old school ways of extracting voice features like, you know, the the the pitch, the RMS energy, right? Like from your voice and using those features and using like simpler heuristics over them. I think that is still helpful. Like what? In what Yeah, so using these features to determine when to stop, like, you know, when the user has stopped, correct? So largely the turn taking models are transformer based architecture, but I do see like a hybrid as well, right? So because if you look at turn taking models, most of them take speech and text both as input. So imagine in a cascaded world, you speak something, then I transcribe it, then I pass the speech as well as text. It already adds a bit to the latency. >> Yeah. But let's say, you know, in some cases you are your features of voice are strong enough, right? Like where these acoustic features are with a very high confidence I can say that like Demetrius has just stopped speaking. Then I don't even need to like wait until the transcription is over, right? To to to detect the turn. So I think this also helps the hybrid approach also helps in reducing latency end to end. But when you're running those models, it's two separate models you're running, right? Yes, yes. But I think the the old school extracting acoustic feature is fast enough compared to the neural models. And you have something set up so that the acoustic models are going to be the ones that always override. So if that says, "Hey, we're done." Mhm. At a very high sort of confidence threshold, right? Like otherwise you still sort of use neural models for large number of use cases, right? So I think that is one pattern to somewhat reduce latency, right? Like through turn taking. And also have more confidence like using both these approaches. But still a open problem. There are not enough good benchmarks for turn taking open source, something I want to work on once I have more time. And then there are not enough good open source models I know PipeCat has done some work with smart turn V2 model where I think they created this interface to publicly collect conversation data. And then use that to fine tune our turn train our turn taking model. But it is often very hard to get data sets from like real distribution, right? For because when I'm talking to customer support, more often than not you will not have access to that data, right? So many of these conversations on which this these models are trained are either synthetic or, you know, by some of the data collection companies they probably pay users to do that. But like when when both of us are talking and then I know what the purpose is it could change things, right? Like subconsciously. I feel that is a very interesting area that like how do you sort of collect this more natural data and use that to sort of tune your turn taking. Also it depends on domain a lot, right? So the distribution of how people talk in customer support versus talk the same person talking to a friend versus talking to a customer support agent versus, you know, maybe talking to even the AI agent, right? Like the the pitch and annotation can change, right? Like so um I feel like fine tuning that for domain data is is very useful and I think that is still possible because when we are working with some enterprise customer and if you say that, "Okay, you know, we can have turn taking specifically for your customer support." There's a certain pattern, let's say most of your customers are in the US and there's a general pattern to that sort of conversation and noise levels versus let's say someone sitting in like, you know, India or in London, right? Like so fine tuning definitely helps on specific data set. So I do believe like when we don't have great benchmarks here, creating internal company benchmarks on real data is lot more powerful than, you know, just benchmarking on open source because these data sets have the challenges I was talking about. All right, y'all. This episode is brought to you by the good folks at MLflow, the open source platform for developers who want to build production-ready AI applications. Enhance your AI applications with end-to-end AI observability, all in a single integrated platform. With MLflow's GenAI capabilities, you can evaluate AI applications using a suite of built-in or custom judges, visualize trace executions and agentic analytics, and continuously monitor evaluations, all while tracking every run in one place. Ship better agents faster, you know that's the name of the game. Get started at mlflow.org. We had Zach in here from Sierra uh a few months ago and he was talking about how they're running a constellation of models. You've seen that that's very common Yes. >> Can you break down how many models and what what that constellation looks like? Yeah, so that is true and we can talk about constellation of models in different context. So one of the example was like, you know, for for turn taking was like I said like, you know, one could be a simpler acoustic features based model. Second is neural, right? Now you go to LLMs, right? That is a big challenge because in the easiest thing in chat is that somehow there is a there is a higher tolerance for latency where you're chatting with something and there's like dot dot dot and thinking, right? But in voice that is if that is complete silence, people would just drop the call, right? And I think there are like more historical reasons for that because voice agents or voice chatbots traditionally were like so dumb that people generally don't trust them. So now if you're putting like 2 seconds of pause, they'll be like, "Okay, I don't think I'll I'll get anything out of it." and they'll disconnect, right? So but there's also a need from our customers to use the best models, right? Like if if if Claude Opus is there or Sonnet is there they still want to use that as a intelligence layer. How do you do that? And I feel like there's something in between the like dumb cascaded way of just chaining things together and like speech to speech where you come up with like more intelligent patterns of cascaded architecture or whatever you call it. For example, let's say when when two of us are talking and you ask me a hard question, right? Which requires me to think more, right? I would not go completely silent after you ask me, right? I would often either ask you to give me a moment, right? And let's say if it's taking more time, I'll be like, "Hey Demetrius, I apologize, right? But give me 5 more seconds, right?" And that's better. But even better is that you ask me something, let's say about voice agents, right? And you asked a very open-ended question. I had to go go and then do a deep research to give you a comprehensive answer. But can I keep the conversation going, right? Can I start with something very basic about voice agents, what they are, where are they used, right? And for that do I need the most expensive model? No, right? So I think what we often see working with our customers but also in general is like I can have a like a smaller model, right? That is for these more cursory conversations, right? High level chit chat or are basically like turns that require low intelligence, Mhm. where these models keep you engaged and then they they do delegate the the more intense task in background to a to a more expensive model so that now there's no complete silence. I have a model which is trying to be helpful and at some point if it doesn't have enough context of what you're asking about, then it can ask you to sort of wait and you know, still keep the conversation going while the background tool or background more expensive model returns the result. This is a very common model. And now you can have like more than two models in this, right? So I think this is one very common use case of using multiple models. Second is um what I've seen is in customer support specifically that And at Amazon we we went back and forth multiple times where there was 20 2024 was very confusing where like you see these great model releases from GPT or Claude. When you chat with them like they are just like amazing, but then on on task-oriented dialogue tasks like customer support, where I don't really care about the overall general intelligence of the model, but like I want higher reliability on those tasks, there they would like often like the prompt-based approach or a react-style looping method would often fail. So then we would have like a constellation of models again, right? Like where for certain tasks, let's say tool calling, Claude was excellent even in like late 2024. I would probably rely on Haiku, right? But like response generation is something where I want to have more control because I'm building models for Amazon's customer support, which is in across so many languages, impacts probably 2 billion plus interactions. I cannot like let it loose on Haiku and just prompt the heck out of it, right? So there I would use like a smaller sort of fine-tuned model for response generation. Tool calling I will get the best from Haiku. So, I think that's another sort of reason to use a constellation of models. Yeah. Huh. I wonder how many is too many if there ever is too many. Yes, and too many shouldn't be too many in my opinion because very hard to sort of update them, improve them. And So, that was a challenge earlier. And there also we went back and forth even in early 2023 where we were like, okay, we should given that LLMs are here and we were already using a constellation before pre-LLM because generative capabilities were not there. So, for intent detection you use like you fine-tune a BERT style classifier. For response generation you really don't use generative models in production then, right? We we would have these like ranking models where you still use the understanding ability of BERT style models, but to keep the dialogue contained and not not say anything. Like we define for each domain, let's say refund from our historical transcripts we only see 500 types of like responses that are possible, right? Given that it's a constrained domain and long tail it's fine. We can transfer it to a human, right? And we ask the model to basically understand what the customer is saying and then like pick the right template and fill that template with Hi Demetrius, right? Right? And and I can put the placeholders and choose the template, right? And then we were using like a bunch of models together. And then we were like, okay, LLMs are here. We should not they are intelligent. They should be able to do sort of multiple things together. So, what we did was our first iteration was fine-tuning of Flan and Mistral on on uh our own data for multiple tasks which was action prediction. These days it's called tool calling, but that back then response generation, intent detection uh dialogue state tracking which is still very common. So, and we would have like bespoke models for each of them and we tried to consolidate them. And then you will see that like there's an interference like if one task sort of um if the model regresses on one task, right? Then I add let's say more data to the post training mix for that and then it ends up impacting my other tasks as well. It's Whac-A-Mole. Yeah, it's a Whac-A-Mole and then, you know, of course, you have to still release stuff in production. So, like we went from a like constellation of models to a multitask to a constellation, but like a smaller constellation. And I do believe in many production system I haven't seen one model doing everything specially for like more complex situation. If it's like a small to medium size business, largely a Q&A use case, couple of tool calls, right? I think I think just one model is just fine. Yeah. Yeah. But then what you're seeing for the various models in the constellation of models what are the different use cases of the different models? You mentioned a deep research. I imagine there's Okay, you kick off some tool calling. Maybe you want to go and grab some data from somewhere. >> Yes. Those are all pretty obvious ones. Are there others that I'm missing? It's just not deep research. You can deep research is the most common use case like doing that. But sometimes they are just APIs that take longer. Invoice agents if you see demos from different companies they'll be like, oh, you know, in 500 milliseconds we do everything, right? But in real world you're deploying it in production, the APIs themselves take couple of seconds or sometimes 1 second, right? And so, it it is not a constellation of models, but often you also use these models to sort of hide the latency of those like more expensive tool calls or legacy APIs, right? Um Yeah. Yeah. Other constellation of other is like retrieval, right? If you still use retrieval today, for example, um I still need to come up with a quick answer. So, I will I will have like a fast retriever which is like, you know, largely keyword based or grab based to give you a quick answer. And then, you know, more expensive retrieval that sort of does its job like more like deep research. >> And so, this is something where you would experience it as the voice agent saying, "Okay, let me talk to you about that more." >> And then boom, by the time it finishes that phrase. >> Yes. Yes, exactly. Or or give you something more useful than that, right? So, like like if you ask me what are voice agents, I think smaller models are still if I just do like grab on my documentation and search for voice agents, right? And then I will still have a reasonable answer to give you, Yeah. So, I I just don't need to ask you to wait. I can give you something and then I'm like, "Hey Demetrius, do you want me to go deeper?" But I was already going deeper in the background, right? And then fetch the response for you. So, by the time you say yes, I get enough time to to to basically, like, you know, use that response from a deeper deep research or a more expensive retrieval to to serve And that's where you're masking the latency. >> like latency masking. Yeah. Yeah. Yeah. Oh, yeah. That's awesome. Okay, so, what else? I know there are so many things that >> And then there there are other examples of like having constellation. For example, in support most of my examples are going to be from support because >> That's your knee-deep in it these days. So, it makes sense. >> For example, like if you just do like the cookie-cutter way of building agents where you put a knowledge base and some tools, right? At best it's going to match the performance of a level one customer support, right? Which is like usually not the best quality in a given organization. Every organization has level one and level two. Level two are probably more like permanent employees, domain experts, right? No matter what company it is today, it's so hard to beat their performance even today, right? Yeah. So, one of the ideas we were like, okay, you know, how do we make these agents smarter, get closer to these level? Can we sort of learn from their trajectories, right? Like how are they pivoting in a conversation? How are they sort of taking decisions on exceptions and not? What are they saying to customers that is more reassuring that the customer doesn't disconnect, right? And how do we sort of fetch that context and then also like, you know, give that as an example to our agent to sort of follow that. And so, for that we had like another model that basically processes them because these conversations look very similar, right? But just based on like one or two values of like, let's say, uh age of the account or location of the account, the policies could be very different. So, the model can make mistakes. So, we had sort of another sub-agent to like fetch these conversations that are high quality within similar context from these expert agents, give it to an LLM to see if there's something it has to learn from them, right? Pass it to as a context to the main LLM, right? Or our main agent. Um and that was another sort of multi-agent setting where it was helpful to use that. >> this dynamically. It's not like the training. No. You you and we did end up training that because if you the conversation that you sort of retrieve like these trajectories of conversation, the top five are so similar, right? When you retrieve them, that you would even we would not be able to find which is the correct one and which is not according to the policy and the SOP, right? These domains are very compliance or SOP heavy, right? So, you still need like some kind of a like fine-tuning over that to make it understand the difference between the correct and the wrong one based on some minor details, let's say, the the loyalty status of the customer or the tier of the customer or their credit usage or whatnot, right? And for that I think you need to still sort of do some kind of a preference tuning to make the model understand those like very specific differences. Yeah, these variables Yes. are only understood by the subject matter expert. >> Because like two of our conversations could be very similar, but then in one of the conversation probably the location is different or or some attribute is different and that makes all the difference, right? Whether I should be giving you refund or not giving you a refund, right? So, if you don't have like something that that learns to have this sort of discrimination, and out of the box I haven't seen anything working. >> Yeah. That discretion is so important. >> Yeah. Yeah. And for some reason these level two agents have that discretion. Uh-huh. Right? And I think it's still an open problem. How do we sort of learn from their behavior and make the agents like the current state of agents which are at best level one get closer to that. It is also like context capture, right? Like these days agents are largely working on internal knowledge bases and tools, right? But the judgment of like these experts, right? A lot of that information is in their heads. But do you think they know why? Sorry. >> They're like do you think the experts know why and they can explain if they were asked to. Yes, and that is something I'm very passionate about that how do we capture that data, right? Like I don't think they log it anywhere, but like with voice becoming so much better, right? Like, you know, can you ask them to like just talk about their reasoning of doing that and like make them log at least some of them, right? And then sort of close the gap through that information, but also like observing their behavior and sort of emulating that. I think that's a very important area of applied research where how do we sort of get this information that's there in their head or through their experience either observing their past trajectories or like you said explicitly asking them, but making it easier for them to be able to say that because agent handle time is a big sort of constraint. If I'm a human support agent, especially an expert one, the amount of money I make would be would dependent on how many calls I take. Yeah. Yeah, which means if I'm spending more time in like recording my rational, then like it's So I think that the incentive structure has to change there or it has to be a more natural way of like capturing that feedback. Yeah, you almost have to give them as much time or money compensation for doing this as if they were talking to a human. >> also like a huge tension within the operations or going to company and the org that is building AI agents to make those operations more efficient because for operations org, if I want to like use the expertise of these humans, it increases their handle time, increases their cost at least in the short term. While without this information, I would not be able to make my agents better. So there is quite a back and forth I've seen across different places. >> Yeah, there's a little bit of a catch-22. >> Yes, yes. You know what always fascinates me with voice is how much more willing I am to give extra context. So when I interact with uh someone through voice, I will say and explain much more than if it's with text. >> Yes, I feel that is like that is the real unlock in my opinion. Then like uh at a very high level people saying like oh voice is the most natural way [clears throat] of communication. Of course it is, but how does it help? It help it manifests in exactly what you said. Because it is a more natural way, I can speak more. I can provide more context, right? I can correct you more, right? While my patience level when I'm just chatting is like so much lower, I will just type minimum amount of words and that is going to be ambiguous. The agent is going to get confused and like but in voice, as long as you don't like frustrate the heck out of me, I'm happy to provide more context to you. Yeah, and just naturally I would say that I provide without being prompted extra because I want it's as I'm thinking through it, I'm also talking through it. >> Yes, yes. Normally when I will write with text, yeah, I'm I'm thinking through something, but I'll think so much faster and then I'll type out the summary. Yes. Exactly. And same for like voice other applications, right? This is just sort of chat or real customer conversation, but the amount of time in let's say customer support, how much time these associates spend in writing an email, right? That is a significant part of the handle time, sometimes as much as it they they spend like helping you out. They have to like do some compliance stuff, add information, frame an email, right? And I think that part has also been like, you know, agents or LLMs have helped quite a bit. And now with voice that becomes so much more easier, right? Like if you have to frame that email, you just talk. Or if an LLM pre-populates that email based on the summary of conversation, if you need to correct, you can just say it, right? Yeah, and so the thing that I've noticed is that I am starting to get really lazy when it comes to typing, and I think that's the wow moment that a lot of people have with WhisperFlow is when you can say oh no, I didn't mean to say that or when it automatically formats with bullet points. That's incredible because normally with a dictation, it's really shitty. And when you say oh, I didn't mean to say that, it will [clears throat] write all of that. >> that, exactly. And that is where I feel like there's sort of another area which I haven't personally sort of explored, but like which which can get important is like fusing the the ASR and the LLM layer, right? Why most people don't like that in production is because they want to have full control over their LLM, but for applications like this, right? Where I don't want it to transcribe verbatim everything I'm saying, it needs to be intelligent enough to have the context that like okay, whenever I was like, you know, back channeling or saying or you know, stopped for a moment or said something irrelevant in between, it shouldn't be transcribing that, right? And so those models are getting you you can literally like, you know, fine-tune a open-source model that can take like, you know, multimodal inputs, speech and text both, to be able to like basically fuse the first two layers, right? So I know there are companies and some customers doing that to manage latencies, but I think these personal use cases, right? Where you don't want transcription to be like completely verbatim, you also need to understand the context of the situation before you let's say send an email. So you can basically bake the ASR and the intelligence layer together. Yeah. I hadn't realized that, but makes a lot of sense. I think there's this voice AI startup called Ultravox. They used to be called fixie.ai in Seattle. Oh, yeah. They they follow this approach. Really? >> Yes, yes. I didn't realize that they were doing that hard because I remember back in the day they were trying to do like a LinkedIn competitor. Yeah, I know. I think now they are like from what I know from my last talk, a couple of things changed very fast these days, but they are voice agents platform. But their like general thesis was that like you can you know uh somewhere in between the speech-to-speech and like complete cascaded is where we basically fuse two components, right? So make it less cascaded, but again like there's a spectrum, then you lose some control, right? Yeah. Yeah. What other use cases do you see a lot of with the voice agents? Uh customer support, inbound sales, uh outbound sales. There are companies that are building over us for outbound sales as well. Uh and inbound is much easier because see if you look at the sales cycle, right? Like um there are different types of customers. Like some are like strategic, like key accounts, right? Where maybe you want to deploy your best sales person, right? But then there's a long tail, right? And uh maybe eventually you might still want a human there at least with the current capabilities, but the initial basic conversations, right? Like if you see your LinkedIn and I see my LinkedIn, like there would be um GDM folks from some startups reaching out to we are launching this product. Are you interested? Happy to set up a call with our founder or whatever, right? Like I think that initial call or initial email, that is a very common use case with with voice agents or text agents today where the the first part of the outbound sales cycle is like can be handled by that, which is a long tail of customers. >> Don't you think that would piss people off though if they get on a call and then it's like oh, this is just a voice agent. >> Yes, and that is why I feel like inbound is more common. Inbound is like when customer is reaching out to you that I want to use your product and I want to do some kind of a lead qualification that okay, is Demetrius the right customer? Yeah. Instead of having them fill out a form, you get them on a call and you just walk through Right? So that is a very clear one. Outbound, I know a few startups which are specifically companies built for outbound sales agent. They are building over us and they seem to be doing quite well. So I'm pretty sure there are like they have found a segment of customers who are just fine with this early calls being with voice agents, especially with how natural the voice has become. And to be honest like it was never about talking to a at least for me, talking to a human or to a to an AI. I I'm fine talking to an AI if I have enough trust that it's going to solve my problem, right? Just speaking to a startup founder where they're solving this for plumbing, right? Where and I've seen that problem myself and friends you are you are looking for a plumbing contractor. Today you go on the website, you submit your information, you ask a quote, and then couple of days later someone reviews that and gets back to you, ask for you the best time to talk, and but I want that like thing to be fixed today, right? Especially plumbing, you have some serious problems. >> Exactly, right? So I don't think that works. So for plumbing for example, it is like so much easier now and that's a great use case in my opinion where like if I can basically get a quote immediately, right? And it could basically find and I think plumbing also has this problem of like seasonality. So if there is like a layer above plumbing which is like because I many cases I don't I'm not fixated on a contractor. I want a contractor who is charging me like reasonable and is available today, right? So, something that can basically make calls to contractors on my behalf, submit that what I'm willing to pay for it, and then like get me something like tonight or tomorrow, right? >> It reminds me of there is these folks who will run Google Ads for like landscaping businesses. >> But they don't have a landscaping business. What they have is when somebody clicks on there and says, "I want to get a quote." >> They then take that lead and they'll sell it to an actual landscaping business. >> Yes. And then booking and reservation is like super sort of common use case. If you are a restaurant and if I'm a authenticated user, if you get a call from me to book a reservation, as long as you are sure that like I'm the person calling, do you care if I if it's an AI agent on my behalf or it's me? No. No. So, booking from the customer's side. >> Yes. Yeah, booking any appointment, huh? >> Yes, any appointment. Which means in future I do imagine to have like once the whole authentication layer, the payment part is is taken care of and I know companies like Visa and others are building a layer for that for like payment authentication with agents where I do have ability to authorize my agent to take XYZ actions on my behalf. It could be shopping, it could be making a reservation. And I think once and I think it will come like very soon. And then I can have my representation, right? That can make these calls, book these appointments. Because today what happens is like I call especially in SF reservation in restaurant, they don't pick up phones. So, I think this solves both sides of the problem, right? The the user side, but also the vendor side or the restaurant side because there's one person who's also serving, but also taking calls. They'll not pick up the calls, right? But if you have it's it's very simple, right? Like I know how many people are reserved for today, how many sort of spots I have. And then, you know, I could I could basically book a reservation automatically. Or worst case, if that information is not there yet, right? I could still like ask the person to call you back, right? So, that I don't have to call back, right? When they are available. So, in both cases I feel like this significant like time we are spending and we are like so unproductive doing this that this should already be solved now. >> Yeah. Dude, this is great. What else do you want to talk about? What are some other things? I know you mentioned you rattled off a ton at the beginning. Mhm. And I can't remember everything cuz you said so much. I think the whole like cascaded versus like speech to speech debate, right? That is the most common question that comes from customers. Also customers who are probably like more abstracted out or at like leadership levels where they don't want to get into the details, but they clearly feel that like, okay, when they talk to like speech to speech APIs, it sounds cool. Why do we need to have this like complex archi- orchestration? Why can't one model do everything, right? And um yeah, so I was talking to in in an informal setting with some OpenAI researcher. And um I think they are still sort of working on finding how do you sort of keep the speech quality good. Because the model has to be smaller to to to to to have the latency. >> To be fast. >> Right. Right. It has to be fast. But then then we already know based on scaling laws right now, right? A bigger model is better, right? So, I don't know like I think that tension would always exist. But more than that, even if these speech to speech models keep getting better, right? There's always even with the latest, let's say release of Opus, right? It will not satisfy me 100% of the time if I'm enterprise, which means like in some cases I need to like have ability to fix it, right? Sometimes it could just be prompt or maybe adding like tools or maybe using a different model, right? Like we talked about constellation of model. So, I think this whole like cascaded architecture gives me that flexibility. As long as we figure out like we we talked about that there are different layers of users for voice agents today. They are like non-technical business owners want to automate their like concierge experience, right? Like taking bookings and stuff like that. What is the interface for them? They want us to do all the heavy lifting. No customization needed, right? Then there's a layer below which is like customers who are small enough, but have few engineers, right? And they want a little bit more of control, right? There it gets like, you know, we do offer some controls, but I think it's still up for debate what is the right amount of control, right? Because if it is like all the control, then there are just too many knobs to change. And then if you don't do it right, then your experience is bad and then you can churn. >> It's witchcraft. >> It's a witchcraft after that, right? >> Yeah. Yeah. >> the number of knob the the biggest challenge of cascade is not that you can't make it work like a real time speech to speech in terms of voice quality, but with better intelligence. It's more about the number of knobs and the effort it takes to do that. And that is why I feel like we see that more and more where customers want to control the tech stack and like offload the voice orchestration to us. Mhm. That is getting very common. And I think there's a spectrum between pure dumb cascaded, which is just chaining them, and the speech to speech. And there are different patterns like we spoke about. You can fuse the the ASR plus your LLM layer. You can fuse the ASR plus turn taking layer, right? They can be the same models. Oh, yeah. >> need to be different, right? Then um you can have the TTS, right? We couple of weeks back we released this expressive mode TTS models. So, so far TTS models were already so good, but like in they were not trained or fine-tuned for long conversations. And long conversations I mean like 5 to 10 turns, which is pretty standard in customer support or other areas. So, these models were just trained on, okay, this is the text, right? With these emotional cues, right? Like laughing or whisper, generate a speech based on this, right? But like real world application of these models, if you look in more enterprise setting, which is not on the creative side of things, is like I want to make sure that when I generate a speech, it has context of my conversation so far, right? Like all of a sudden it should not start laughing when the conversation was going quite serious, right? And so far in this pipeline, all that offloading was being done on the LLM, right? Where LLM is responsible to basically um generate a sentence in a way that it is it is it would not laugh, right? Either use the right emotion tags with that or or keep the language that way that the TTS model knows that this is not happy moment, right? But now with this conversational TTS is nothing but the just you train it in a different way where you pass it the final utterance that it needs to do, but also pass historical context of the conversation. And that's super important in in domains like customer support, like booking, sales for example, right? Like so, with some of these design patterns where these TTS models getting more conversational, like, you know, taking full context of the conversation rather than that immediate turn. Um ASR models and LLM, you can sort of fuse them together. Um having like this foreground and background approach of model that you want to use like expensive model, the best model, but also want to keep it natural. So, have this like small model that is called you can call it a masking model that is not as dumb to just say that like say that give me a moment every time. It's smarter than that, but less smart than the big model. And so, I feel like there enough design patterns in between, right? Which could basically make it sound as good as speech to speech, but with so much more flexibility that when an enterprise customer says that, okay, why I'm seeing hallucination at XYZ or, you know, my there's an outage on like, you know, Google Gemini and I want to like now change the model now. How do you do that with like speech to speech? Like what if I'm using a speech to speech API from Gemini or OpenAI and there's an outage, what do I do there? You're screwed. I'm screwed, right? So, I feel like for and that's my personal opinion that like for some time now, right? This this somewhere in between is the right approach, right? Not the old cascaded way, right? Just chaining them. It doesn't work. You will have so many edge cases that it would end up sounding as dumb as the old chatbots despite improvements in ASR and TTS. So, you can based on your appetite, right? Like are you okay do you need to you have a small enough use case that you can use open source model. You don't need to rely on let's say Claude. And um in that case can you fuse the ASR and and the LLM and or maybe the foreground model, the smaller model you can fuse with the ASR. The bigger model can be separate and that can be accessed as a tool call, right? That is that is one way of doing that. Second, how do you sort of have a more sort of conversational TTS? Um You can potentially also combine the turn taking with this. So, there are multiple permutations and and I feel like the right developer platform and we are also sort of improving on that side would be at least for the core developer layer giving these options, right? So, that they don't making it easier to implement and experiment with these design patterns. I think I feel like most platforms are like at a long way to go there. >> Yeah. I haven't heard that from anyone. So, I feel because it it is non-trivial, right? You we talked about these ideas. It's easier said than done, right? Because I also do like like implement them or like, you know, in this like whole async event loop like making that all these events tie up together, right? Exactly at the same time, right? It is It is non-trivial. So, I think there's the the the dev platform layer need to provide this flexibility and then probably once people experiment with these permutation, then, you know, they can choose. Choose within the spectrum, not the cascaded and the speech to speech. There are options in between that give them enough trade-offs between, let's say, control versus quality, reliability versus quality. Yeah.

Jobs for this video

Jobs for this video
Stage Status Attempts Last error Updated
summarize done 0 2026-06-24 03:36:28.044288+00:00
transcript done 0 2026-06-24 03:35:32.850827+00:00
transcript dead 5 handler returned RETRY 2026-06-19 22:16:21.789981+00:00
metadata done 0 2026-06-19 22:00:31.421479+00:00