Sakana Fugu Hands-On Test – Does THIS Really Beat Fable 5?

motel, donuts, records, diner, tattoo, surf, pizza. Today, we're going to be taking a look at something that's been getting a lot of mention lately in the AI circles, and this is called Fugu by Sakana. Now this is a Japanese company who has come out with this and I hesitate calling it specifically a model because while there are aspects of actual kind of like a model here it is more of a router where it will intelligently use different models from existing state-of-the-art providers. So things like Opus 4.8, Gemini 3.1 Pro and GPT55. However, the orchestration of those tasks being delegated to those models is a proprietary model that is essentially fugu and I may very much be messing up the pronunciation of this. So, I do have to apologize in advance for that. However, this is something that is pretty cool and we've been seeing more things like this. Recently, I had tested the open router fusion which in one specific task was pretty good. In the rest, not as much. Now, this seems to be a little more intricate than that. And some of their benchmarks here are very, very potent, I would say. And essentially, the culmination of what I want to say is this test is going to really push this thing because some of these stated benchmarks are very, very impressive. Are they to in fact hold true? One in specific where we have Swebench Pro, which is just think of it as like a coding benchmark. This is essentially stacking up beyond everything except Fable 5. And they do mention here that this can outperform Frontier state-of-the-art models. In some we see that this is essentially matching or slightly edging out Fable 5. So this is going to be very interesting to test. There is obviously a bit of technical consideration about what exactly this is, but we're going to be putting it to the test. So before we get into it, please do feel free to subscribe as I do want that 100K plaque. And now let's talk a little about what exactly this is because I see just personally in looking at this there definitely could be some confusion in terms of like what exactly is this? Is this a model? Is this not a model? Is it just using other models? And basically the answer to those questions is yes. So they do give some pertinent information here. And something I do like to see is they prominently mention that this is grounded in two specific papers on learned model orchestration which are both linked right here. And you can see these both trinity and the conductor. So essentially there are two versions of the sakana fugu model. There's fugu and then there's fugu ultra tlddr. There is actually a model that is trained here beyond just this utilizing existing state-of-the-art models like opus 4.8 98 JPT55 and Gemini 31 Pro. This actually has some knowledge of its own and we can see for a simpler explanation. Let's look at section 3. Given a user query, a Fugu model constructs an agent scaffold over a pool of Frontier LLM workers deciding which workers to involve, what instructions or roles to assign, how intermediate outputs should be combined or verified, and when to synthesize the final answer. The user interacts with Fugu as if calling a single model while internally the system can route, delegate and coordinate across multiple specialized agents. So I think the main question when reading that is is this actually a model in and of itself and it is not in the traditional sense where we have something like GPT55. This is a model that is trained to specifically perform this task. Constructing an agentic scaffold over a pool of Frontier LLM workers. So there is something special here that would not exist were Sakana Fugu not involved in this orchestration. Additionally to that, they do also mention Fugu Ultra which prioritizes answer quality using deeper orchestration over a large worker pool at the cost of additional latency. If we scroll down, we will see a bit more about that. So, they mention in section 3.2 here, Fugu Ultra builds on the conductor, which was one of those linked papers back in the announcement post, adding novel extensions to accommodate long horizon function calling and multi-agent workflows through adaptive agent memory. They do give a bit more information here, but this is more of a first look and I do just want to kind of answer some of the questions I would imagine would come to mind as they did for myself, which is basically like, is there actually any intellectual property here? Is there a special model or something of the sort? And the answer to that is yes. Now, before we go further, I do just want to bring up a quick disclaimer that I have not personally tested this beyond just sending a hello message to the model. Everything I'm mentioning here is just from this paper, what it reads as. I can't definitively technically verify these claims or the benchmark scores. So, in terms of pricing, we do see that there are two options. There is a pay as you go one, which is what I have done. So, I just loaded in $40 here. And I've already used 4 cents, basically sending it hello twice, which is a little concerning in terms of the potential spend here. But the pricing is more in line with how one would traditionally measure pricing of using a model with $5 per million input tokens and $30 per million output tokens. And these values change when the context exceeds 272K. Additionally to that, there is a subscription plan here with the tiers that I do believe are seen with chat GPT. Now, for today, we're going to be testing this in a bit of an unusual manner because in their get started guide here, they prominently mention using this from within codeex, which is OpenAI's codeex like coding tool. It's like claude code but from OpenAI. There's a desktop app as well. However, because we're just going to be using this with Ubuntu, I do have this already set up. It is going to be terminal only regrettably, but we'll still be able to test and look at the outputs of the models as we would with any specific test. So, we can see the only thing I've done is write model slug question mark and it returned the answer just as fugu. We do also see a chain of thought where it says since the developer mentions I'm fugu developed by Sakana, I can share that information directly. So, let's run our browser OS test. This is the version where it needs to have the GTA clone and other than that it's just the traditional test we like to run. We'll see how this does. Keeping in mind this is the lower level version, just Fugu, not Fugu Ultra, which we will also be trying. All right, so in basically 6 minutes, we received our result. It did generate it and then take some screenshots just to be able to see things. I do want to pay specific mind to billing here. So this was used just with the cheaper model essentially. If we click on pay as you go, I was at 4 cents of usage prior to starting this and now our total is 9 cents. So that entire result cost 5 just to keep in mind and we will be monitoring our billing usage throughout the duration of this test. Now let's take a look at our Fugu browser OS and this was the lower one. So this wasn't the hardcore one. Okay, this definitely has a lot of inspiration from the GPT55 family of models or GPT5.x family of models. This is very very similar to their UI aesthetic. even this background gradient and things. I find it interesting it called it fugu OS. So whatever is going on under the hood here, there is definitely something that is specifically telling the models you need to replace like it could call it cla OS. Replace it with fugu or something. All right, special feature time capsule. We'll minimize this for now. Overall, this is very very decent. It is very much in line with the GPT 5.x family of models. Okay, that's not bad. It is a playable game. Okay. Those are some cars. Oh, wow. All right. Interesting. Funky. Is that a moon in the sky? I think it is. And I do believe those are police cars that are following us. There is also floating cash. Good. And we actually get a popup in the top right saying cash collected. There are mesh colliders on the buildings which are drawn interestingly because they're transparent but they are also solid at least in terms of mesh colliders. Okay, this is more of an infinite thing. Okay, cash collected. Can we get out of the car? We can't. All right, it's acceptable. I've seen better, but overall for a first try it's acceptable. Next up, well, I should go one by one. So, welcomed. We did already take a look at that. We have a notes app. Okay. Can we save now? All right. It doesn't save it as a text file. Sometimes they do. Sometimes they just save it to like persistent storage. We have a terminal app. All right. Help. As one would expect. It's nice to see. Terminal looks all right. Next up, we have our wallpaper lab. That sounds fairly self-explanatory. Okay. I have definitely seen something eerily similar to this. I would assume it is GPT 5 something that when I tested had something very similar to this. Overall, I do like these wallpapers. Nebula random procedural gradient as well, which is pretty darn cool. Choose file. Okay, I did choose a file, but it didn't change the background to that. Not a huge deal, but just something to note. All right. And then we can do custom colors or accent colors. Next up, we're going to save the special feature for last. Let's take a look at our other game, which is the Nebula Run 3D. This is actually not bad. Oh, okay. Well, I spoke too soon. All right. I think the I don't know. Is there a right click? How did I forget there isn't? Well, then we know it's not benchmaxed on my specific test cuz they would have included that to alleviate my wrath. Final thing would be the time capsules. I'm going to open a bunch of stuff and then we'll save one. Now, let's close all of this and restore it. And assuming all those windows pop back up when we click restore. Good. Then our time capsule did work. Not bad. All right. So overall, this was definitely a solid result. Significantly better than the open router disaster with the fusion function. So I'm happy to see that. But this does just look very, very heavily inspired by GPT55. So keep that in mind. We'll run some additional tests and see if we see any more style from one specific model bleeding through. All right. For our next test, we're going to do the beautiful subway scene. I have swapped this to Fugu Ultra High. So, this is the more expensive, more intricate one, and this is the one that on those benchmarks was basically outperforming everything except Fable in the specific coding test. But the TLDDR of what I'm trying to say is this should very much impress us. And I have put this in its own directory here. So, we're going to begin it just with the beautiful detailed subway scene. This is not being told to turn this into a game yet. We will following the likely successful completion of this turn this into a game. All right, so this is still ongoing, but for the past beyond past few minutes, this has just been trying to take screenshots. And basically the combination of what's going on here is it's just having trouble doing so with Chrome. Meaning it is very likely there may be a functional or visible subway scene here. Okay. I've never seen traffic cones. I don't believe in the subway scene. The movement is fluid. There is a subway car. I'm going to say in terms of good things here. It's clean and it is very well arranged in terms of the elements being placed where they would in real life. That system map perhaps does leave some um you know curiosity as to trains go. But aside from that, we do have a good sign there on the wall. We have a little green box. We have some benches. There is lighting. It's not at the level of extreme visual polish that I would have hoped considering the benchmarks. And we are using the expensive version of this model. Of course, the high-end Fugu Ultra. Additionally, I think these are gates. Now, you may be saying, "Behan, why are you having such trouble moving around here?" The keys are somewhat inverted, though. Are there mesh colliders on these stairs? Oh, okay. Not the best I've seen. Although, I'm going to say it is very clean in terms of just the proper arrangement. And I don't believe I've ever seen cones, trash cans, nice, but some models put nicer tile materials and things of the sort. So, this is strong. Let's take a look at the usage for that result. Keeping in mind it went on for quite a while just because it was trying to take a screenshot of the result. Okay, so we went up from I do believe 9 total to $3.54. So that cost $345. Definitely a bit of that would have been just the failure to try to get a screenshot and continuously trying to figure out why. So our next step is to have this turned into a game. We are using this on ultra high mode. I did have to start a new codec session because I forcibly ended the last one. And when I did try the resume feature, it was just defaulting to using one of the OpenAI models. I'm sure there's a trivial way around it, but for now, we're just having it begin from within the same directory, just with the follow-up script here to turn this into a firstperson shooter with zombieoid humanoid enemies and things like that. If this starts to get stuck in a loop again in terms of trying to check its output with screenshots, I may prematurely stop this just in the interest of both cost and time. All right, so it's now at the point where it's just trying to run tests, which is where it got into that screenshot error loop before I want to before it starts going and doing that. Let's see. So we were at $3.54 and now we're at $5. So that cost a$146 or $145 I do believe if my off-the cuff math is correct, which it generally is. Okay. So now before we let this go too far, I'm just going to refresh this page. And if this has turned into an FPS with sound, then we can fortunately stop that from Oh, wow. Okay. So, there's actually mesh colliders, so you can't cross the tracks, which is interesting. Okay, it doesn't seem like the enemies um have to abide by those rules. This is good. Now, the big test is do the ammunition. Okay, so it's not leaving holes in the environment, and that is something that one would expect to see in one of the Frontier state-of-the-art models when running this test. I have seen it a couple of times now and I would have expected to have seen it here because it gives more care to the just the care I guess could be said. All right, what else do we have here? We can sprint. R is to reload. Okay, the enemies are well done. They do actually have defined upper arms, lower arms, upper legs, lower legs. There are sound effects and we can see the muzzle flash actually reflects off of the environment around us as well. So, all right. Overall, I guess I would have the same piece of feedback for this game result that I did for the initial subway test result, which is it's very clean. It's well organized. It's well arranged. There was nothing there that was absolutely mind-blowing in terms of wow, this went above and beyond, but it was competent. Now, if that is worthy of the essentially the culmination of that whole test, I think was probably like $4, I don't know. But nonetheless, this is definitely holding up better than Open Routers Fusion did. Next up, we're going to try the self-contained C++ skateboarding game test. Now, this is one that was done with Fable 5 when testing that model on the channel. It was by a long shot the best result that was received so far. So, this is a prompt that will definitely separate the capabilities between good and very good. Okay, now it's trying to use Ray Lib, which we're not going to allow it to do because it will massively simplify the job and we don't want that to be the case. though it will be less of a proper comparison to other model results of this class that we've received. So I don't want to allow it to use that. All right. So that cost I think $382. We were at $4.99 and now we're at 8.81. So that seems to check out. Now let's take a look at our self-contained skate game. I'll go back to this as a background as I like that better. And it should just run when we press enter. Okay, this is not bad. It is it is it at the level of Fable? No, absolutely not. However, this is again what I noticed with the Subway result where it's very clean. Everything is well done. There's just nothing here that's really wowing me in terms of the polish and things like this. If we can't actually snap onto this rail, that is a little frustrating. Apparently, there may be a specific key that needs to be pressed. I do see L there maybe to get onto the rail. Okay, that is correct. All right, decent spin moves, but our player is relatively kind of in a weird like the Spongebob like the doodle bobb thing. I think that was how it like walked around. That does somewhat remind me of that. Okay, that's kind of sick. The pedestrians are moving. I did see while it was building this its chain of thought and it was wondering if it should put mesh colliders on the pedestrians. Okay, it didn't end up doing that, but that's all right. And I'm going to note that the actual writing on these stores, let's look at the diversity in the amount of stores here. So, we have surf pizza, arcade, tacos, skate, motel, donuts, records, diner, tattoo, surf, pizza, arcade. Okay, so then it starts to loop. But interesting, nice palm trees. The water effect is there and there is a mesh collider, so we can actually go out into the water. The boardwalk, I'll say, is very boardwalk-like just in terms of it looks like wood planks and they're decently arranged. Something I'd say is perhaps the speed. Now, I don't know if those are mileph figures that we see in the top left for speed. I can't imagine they are, but I think max speed here maybe a little fast. There are also ramps there, but unfortunately they're drawn into the stores. So, all right. Again, it's clean, but it's not anywhere near fable level. J is for a kick flip. I forgot to try that. Oh, no. Oh, no. Okay. I I wish I had um I wish I had continued to forget to try that. This just looks quite painful. All right. Acceptable, though. Next up, I want to check its ability to perceive things spatially and then reproduce them. So, here's a folder consisting of a bunch of different photos of different angles and different positions of the lid of this little retro laptop that I designed and printed. So, I want to we'll just open the codeex window from within this specific directory right here, which is just the photos directory. We're going to start it that way. So, it's a simple prompt telling it to create a 3D replica of these images in a single script. It must have a functional keyboard, meaning we can actually click the individual keys and press them. At least that's how I assess it should do this. And the replica must look identical to the source images. So, this will just be a more vision- based test to see how well it can transpose some of this into world space in a script. All right, so we're at $8.81. Now, we're at $9.57. So, this is actually one of the cheaper tests that we've run so far. What is that, like.74 cents or something? So, if we go back here, I want to see if it's actually able to see the images. And now we're back to the point, unfortunately, where it's now just going to spend time trying to troubleshoot the issues it's having with basically visually looking at this result. So, when that happens, okay, so I have a couple of things to say. This did not do what I wanted because I was thinking in my mind this would just be a 3JS result. However, the implementation of this is actually kind of interesting because while it is 2D, there is some depth to it. And I'm going to make specific note that it did a very good job pulling the specific color palette from any one of those images with the poster behind us, which we can actually see right now over my right shoulder. Some additional things. The color of the table is essentially spot-on, as well as the laptop. I am going to refrain from judging this until we take a look at the keys. So, let's just do help. Very good. Well, very good with the caveat that it's not exactly what I asked for. 3D. This is not 3D. This is This is like one of those cards you get where you open it and then the thing pops out from the paper and it's like, "Oh, cool. It's not 3D." I got to be honest with you, I'm still kind of mad because I wanted 3D and that was not 3D. That was 2 and 1/2D. So, we're still on ultra high and I yelled at it a bit. We'll see what happens. All right. Again, we're having some issues just with it opening Chrome and being able to verify the changes to the file, but we should have some changes reflected here. Hopefully turning this into 3D. I am a little concerned because very good. It did definitely turn this into 3D. Interesting that it kept some of the room elements. Uhoh. Oh, all right. Well, I'll take that. So, we had a bit of an issue here. Let's see. The keyboard is inverted. I want to just try close lid. Good. Okay. So, the lid actually closes in the proper orientation. Now, more or less though, I'm going to say this is better than I've seen from some models. Like I believe I tried this with Miniax and I think another model. This is actually I don't know if it's better. This is a tough one to judge, but it did properly transpose this from 2.5D to 3D. For our next test, I'm going to be performing a front-end design test. This is something that I've only started to do recently with more recent models. So, this is to create a beautiful website for Slapis Watch Company. The site should feature a high-end hero section with an animation panning around the watch, which is to be placed on a table. The scene needs to be created as I do not have the assets for the watch. It needs to look like something that would be a keyhot render, which is just a very high-end rendering program. So, not only does it need to create a good-looking front end for this watch website, it also needs to create the model itself and all the assets of the watch and then render a camera pan around it or something of the sort. Now, I do want to do something additional here so we can get some form of pure sidebyside test. And for that, I am actually going to run this in parallel with these models by themselves. So what I mean by that is if we go to chatgbt.com I am going to not with pro extended I will just use gpt55 we'll put it on high and then we'll paste in the same exact prompt right here in that same vein I am just going to go to claude and using opus 4.8 8 on high. I'm also going to paste in the same prompt. And then because even though it's a little outdated and is going to be soon replaced, we will also just put Gemini 3.1 Pro to the test as well on extended thinking mode. So, we'll be able to check these results. We'll have four results and we'll see how the specific Fugu Ultra model stacks up to the others. So, here is the full model comparison for the 3D watch website test where it needed to create a 3D model of a watch. Put it in the hero section, have some high-end camera orbiting it in the manner that one would see in a keyshot render. So, when we ran this with Google Gemini 3.1 Pro on extended thinking, it did give us an AB test. So, it gave us two options. The first of which had a more properly aligned watch face. Some interesting reflections. We can actually see there's glass surface over that. Unfortunately, the two product cards were just pretty low res SVGs. The second was kind of the opposite where the watch face was not properly oriented. It definitely had a better bit of movement to it, but we did get actual 3D models of the watches with some acceptable reflections in the product part. So, this was the AB test results which produced two specific results from Gemini 3.1 Pro. Next up, we have this right here, which is the Fugu Ultra result. It essentially has elements of some of these models, such as up here, this header is definitely GPT5.x just looks like it. Down here, this kind of off-white is very clawed looking. Unfortunately, the big problem, and essentially what we saw as well in our first test of trying to get a 3D model made, is it went and just did 2.5D photorealistic. It did not properly adhere to the 3D prompt. And this is the second time independently testing it that that happened. The watch does look good. The second hand is moving properly. It is moving around. However, the big problem here is that these are not 3D. They don't pretend to be. They're 2.5D at best, and it's not necessarily up to par with what one would have hoped for, especially considering this was using the Fugu Ultra model to handle this entire prompt. This next result was made with GPT55 on high thinking mode and this was by far significantly better than the previous three results. The watch model, the only big complaint I have here is that the text is really obscuring our view of the watch, so we can't really see too much there. And I would like to be able to see more watch detail. that is heavily remedied when we scroll down a bit more and actually see the results of these watch models in the pricing cards. This is really quite well done. The leather strap, we even have the thread that would go alongside the leather which would be seen in real life. Additionally to that, there are holes on the other end of the band so you could actually clamp this. It also made a completely different one with instead of leather, this is a steel band or something. Were we to nitpick, the watch faces should be oriented 90 degrees counterclockwise compared to how they are here. But that really is a very tiny detail to nitpick against compared to the previous results where we can definitely see the difference in capability here with the GPT55 on high thinking result and this was just done through the web chat interface. This was not done through codecs. This was very, very, very powerful, potent. Finally, we had Claude Opus 4.8 on high, thinking, interesting. Now, we're going to notice some issues here where essentially the watch is partially transparent and floating in midair. Now, we can't move the camera around. This is overall, I would actually have to say, a bit disappointing in some ways. However, the second hand movement as well as the hour hand and the minute hand as well as the date. The elements individually to this are not bad. We actually have like some curvature to the strap. There's a loop on the end of it. It has good potential. If we scroll down, let's take a look and see how the watches are in their product cards. Okay. I mean, I go out on a limb and say that's quite beautiful. Really, the big thing that lets this down is the halo that is floating in midair around these. If we judge this one independently by looking at its face, this is one of the more detailed watch faces I've seen in this result on that. The strap for this is quite nice as well. And this is a this section right here. I do like it. Actually color matched the specific bits to the watch face there. And it did there as well. still really the I think the culmination and the takeaway here is wow this GPT55 result is so good it's making me wonder if it's stealth testing 56 cuz and this isn't even on like the this wasn't the pro model this wasn't on the heaviest thinking mode it was very good I think the culmination of my assessment at least with this one specific test is that one would have been better off just running this prompt through either GPT55 or opus 4.8 independently of using the Fugu Ultra model. It likely would have been cheaper and also it produced a significantly stronger result. And these were both just done through the web chat interfaces. These were not done through any form of harness really. Although both at this point can still function somewhat agentically even just through the web chat interface. It does kind of speak to the overall experience I think of testing this model that we've noticed throughout today's video. Now, the final thing is the flight combat simulator test. I had just run this while we were looking through our watch results. Okay, I can already tell you that I'm not thrilled with this, and this was done using the ultra model as well. There are elements to this that are good, but unfortunately, this is a far cry from what any of these would have independently generated. Maybe not. No, Gemini 31 Pro should have done a better job than this independently as well, even though it's the oldest by a long shot. It kind of just resonates with the overall experience where in any of these prompts that we've run, I find that just doing it independently with any one of the potential models that is likely being used through this system would have produced a superior result with less overall complexity, additional cost incurred by having to pay for another subscription or another model. So that is going to be my blunt honest takeaway from using this is I will say positives this works significantly better than open routers fusion which produced results that outside of that very niche benchmark that was said to be better than fable ad. It was very poor and everything aside from that which was basically a deep research benchmark. And for a quick results overview our browser OS was solid. This looked very heavily influenced by GPT55 in my opinion. Everything there was decently done. Not bad. Our subway result was actually kind of fun. If we open the result that we have left, which is just the one that we got for the game, this was clean. It wasn't very wowing. Like, as I said, when we first looked at it, everything was placed nicely. It was sterile and it was clean, but I feel like the actual spice or style or soul that any one of these models would have put in doing this independently was kind of missing here. And there were no bullet holes in the environment when we did shoot at it, which does make me quite angry because that's something that I do like to see. And these models are capable of doing that. Additionally to that, the C++ skate game was similar. It was basically exactly my experience with the Subway game where it was clean and everything was all right, but I didn't really see much like style of the models, I guess could be said. And you can roll your eyes at that and I would accept that. But if you watch the channel, thank you. And you'll know that these models are I think Opus 4.7 did a better job at the self-contained C++ test. I don't even remember the GPT55 test, but it's kind of what I've noticed in testing. This is it's competent, but I don't know that this outperforms these models had we run any one of these tests individually with the singular model. So, say we ran all of these tests with GPT55 on heavy thinking mode or or less high thinking mode, I think we would have received superior results having just done that. And I really do think this watch website is a proper reflection of that. I did not keep track of individual cost per run following some of our first initial tests. I had loaded $40 into this. And if we refresh right here, we see the entirety of the testing cost $2157. All of the tests except for the initial browser OS. We're using the Fugu Ultimate or Fugu Ultra model right here. We only used the regular Fugu model just for the browser OS test. Everything else subsequently was run using this. I wanted to just test this and provide some form of visual material for folks to then come to their own conclusion here. So that is going to conclude our first look and test of Sakana Fugu which is very interesting and definitely a leap in capabilities based off of other orchestration systems I've seen mainly open routers fusion which was just tested recently though I don't know that it is really going to outperform these models when run independently at least for the tasks that we tested today. So, that is my closing summary of this. If you have any questions, please feel free to leave them in the comments.

Jobs for this video
Stage	Status	Last error	Updated
summarize	done	—	2026-06-23 22:00:59.700024+00:00
transcript	done	—	2026-06-23 22:00:32.163462+00:00
metadata	done	—	2026-06-23 22:00:18.483312+00:00

Sakana Fugu Hands-On Test – Does THIS Really Beat Fable 5?

TLDR

Key points

Tools mentioned

Techniques

Takeaways

Jobs for this video