episode 216
The Apparent Meaninglessness of AI Benchmarks, plus How to Explain AI Opportunities to Others
episode 216
The Apparent Meaninglessness of AI Benchmarks, plus How to Explain AI Opportunities to Others
Every week brings a new AI benchmark. Higher scores. Bigger claims. Louder voices insisting this changes everything. And yet, when you put AI in front of a real business problem, none of that noise seems to help. In this episode, Rob and Justin dig into why AI benchmarks often feel strangely meaningless in practice and why that disconnect is the point. Benchmarks aren’t useless. They’re just answering a different question than the one most businesses are asking.
This isn’t just random conjecture either. Rob walks through what he’s learned building actual AI workflows and why a twenty percent improvement on a leaderboard rarely translates into anything you can feel on the job. They talk about why model choice usually isn’t the bottleneck, why swapping models should be easy if you’ve built things the right way, and why the most successful AI work rarely shows up as a flashy demo. Most of the value is happening quietly, off screen, inside systems that look a lot more like normal software than artificial intelligence.
Rob and Justin also talk about why explaining AI is often harder than building it. The first demo people see tends to stick, even when it’s the wrong one. Consumer AI feels magical. Business AI face plants unless it’s built with intent, structure, and real context. This episode gives leaders better language for that gap, without hype or panic. If you’re done chasing benchmarks and just want a way to think about AI that survives contact with reality, this episode’s for you.
Episode Transcript
Announcer (00:04): Welcome to Raw Data with Rob Collie. Real talk about AI and data for business impact. And now CEO and founder of P3 Adaptive, your host, Rob Collie.
Rob Collie (00:20): All right. Well, Justin, I apologize for being a few minutes late to this recording. I was playing with my Christmas present.
Justin Mannhardt (00:25): Christmas is like 14 days from now.
Rob Collie (00:28): I know, I know. But when you're no longer a child, you no longer have children in the house, and presents come via UPS and FedEx and all of that, it's just kind of like the 12 days of Christmas. You just sort of open the packages as they show up. No wrapping. The wrapping is the Amazon box, or whatever, right? Well, every year my mom sends me dry ice for Christmas.
Justin Mannhardt (00:52): Really?
Rob Collie (00:52): To protect the dry ice, she has them pack it with frozen pizzas.
Justin Mannhardt (00:56): Oh, smart.
Rob Collie (00:57): Frozen gluten-free pizzas that are good and tasty, but the dry ice is the star of the show. So I was up there dumping all of it in the sink and running hot water over it, and making the waves of fog go all over the kitchen and everything, and videoing it and everything. And I'm like, "Oh, shit. Podcast."
(01:19): The oxygen content on the first floor of our house is now lower than it should be, right? All this carbon dioxide is displacing it. We packed coolers full of dry ice one time for a big, big, big cross country trip, and didn't really think about the implications of it. But as you would drive along with the windows closed and everything, slowly the CO2 level in the car kept creeping up, and we started to feel funny. You've got to ventilate when you use dry ice.
Justin Mannhardt (01:46): I feel like I could use some air.
Rob Collie (01:48): Yeah. Anyway, carbon dioxide, a lot of fun, especially when you freeze it solid and run hot water over it once a year.
Justin Mannhardt (01:55): This is a subtle plug for my college days. I was in a band in college. And most of our songs were instrumental. Occasionally we would throw in a lyric.
Rob Collie (02:05): Were you sneaker gazers? Those kind of deep, brooding, stare at your shoelaces kind of musicians.
Justin Mannhardt (02:13): Yeah, more like a hippie jam band.
Rob Collie (02:16): All right, I can dig it.
Justin Mannhardt (02:17): Probably our most famous lyrics of all time was the phrase, "If you breathe in the air, you will breathe out the carbon dioxide."
Rob Collie (02:23): Oh, that's deep.
Justin Mannhardt (02:29): It's the best we could come up with. Guitar solo.
Rob Collie (02:33): We are the human race. We were sent to make carbon dioxide. That is our purpose in life. And what do the trees do? They take carbon dioxide, turn it into oxygen so that we can turn it back into... It's like a war. We're at war-
Justin Mannhardt (02:47): We're at war with-
Rob Collie (02:47): ... with the trees.
Justin Mannhardt (02:48): ... the trees. We're not trying to save the environment. We're trying to defeat it.
Rob Collie (02:51): Yeah.
Justin Mannhardt (02:52): Oh God.
Rob Collie (02:53): Didn't I just turn this oxygen into carbon dioxide like five minutes ago? Oh, it's that kind of day. All right. We should probably talk about some stuff. What do you think?
Justin Mannhardt (03:03): Let's talk about some stuff.
Rob Collie (03:05): Well, there's a lot of stuff going on.
Justin Mannhardt (03:06): Truth.
Rob Collie (03:07): Even since by the time people hear this, the elapsed time in real world between the time that they heard our last episode and where we're at now, I don't feel like the same person in that brief window. Because we took a week off of releasing recordings.
Justin Mannhardt (03:25): For Thanksgiving, yep.
Rob Collie (03:26): Right. And then we had the guest Rui was on. Great conversation.
Justin Mannhardt (03:29): Thanks, Rui.
Rob Collie (03:30): So it's kind of like three weeks elapsed since we last recorded. And kind of a lot's happened since then. I don't even know where to begin.
Justin Mannhardt (03:37): Just to put a bold button on, I feel like I'm a different person from a week ago where so much has happened. On, I believe it was December the 2nd, all the news started breaking about Sam Altman declaring code red at OpenAI because of Gemini 3. Do you remember all this?
Rob Collie (03:56): I kind of do. Remember, I keep myself in a relatively news isolated state for my own productivity. But yeah, I do remember this, yes.
Justin Mannhardt (04:03): So just to give the context right, Gemini 3 was amazing and all its benchmarks were like really, really good and far surpassing anything that happened. And so OpenAI, Sam says code red, meaning we're going to double down, we're going to make our models better. We're going to cancel some of these other projects. It's nine days later, and OpenAI has shipped ChatGPT 5.2 today. It happened the day that we're recording this. And it's like far surpassing all the previous benchmarks of anything else. And so it's just like this game of leapfrog that's happening is insane. It's like for nine days, everybody stood back and said, "Ah, OpenAI is doomed." Which it might have other problems, right? I'm not going to give them a pass there, but it's like the pace at which these things keep happening is unreal.
Rob Collie (04:52): And it's not a realistic pace for human beings to keep up with while we're trying to build things, while we're trying to refine our wisdom, our skills, our muscles. I haven't touched either of those LLMs. I haven't used Gemini 3. I haven't used, at this point anyway, I didn't know that 5.2 was out.
Justin Mannhardt (05:10): Couple hours. Old news.
Rob Collie (05:12): Yeah. I've been having a very long-term, meaningful, and semi-monogamous relationship with Anthropic. Within that universe, I've been kind of doing both the Sonnet and the Opus thing. And really just not having any problems that would make me go looking for another LLM. That's really the thing. Whatever I'm struggling with or striving for, et cetera, the LLM itself doesn't seem like a crucial impediment. I guess it's one of those unknown unknowns. Oh, well, if you were doing this with Gemini 3, it would take you five less steps.
(05:50): My question is, I wonder how much the benchmarks actually relate to real life. Is a 20% benchmark improvement on a particular, an LLM, if it outperforms another LLM by 20% on a particular benchmark, can we even tell the difference? Ironically, the place where I have run into the LLM seeming like it could be better, the one agent solution I've been building, I've been building a bunch in parallel now. The place I've been running into what seems like an LLM limitation is the least serious of all of my agents. The fantasy football recap and preview writer, Tudy, is the one where the LLM's clumsiness continues to be a problem despite all kinds of best efforts.
Justin Mannhardt (06:40): And that's due to the amount of quantitative work or the rigor on understanding the hard data?
Rob Collie (06:47): All told, it's just too much for the LLM to handle all at once.
Justin Mannhardt (06:52): The context is too massive.
Rob Collie (06:53): Everything about it is exhausting for the LLM. There's league background information, like who all the people are. There's league history, who's won titles in the past, and all of that kind of stuff. There's also the nuances and the details of what happened this last week in the league. And then it's got to go produce like a 2,000 word narrative. It's longer than anything that would appear on ESPN.com.
(07:18): And along the way, fencing it in is just damn near impossible because it's a long tail of mistakes that it makes. It knows that, for example, in this other league that I was champion in 2023, but then it calls me the defending champ, which I'm not because I didn't win in 2024. I told it that these two guys in this one league are brothers and this other person is their father, and that's in the league history. And it forgets and makes one of the sons and the father, calls them brothers.
(07:50): When I kick this off, I'm as structured as I possibly can. The whole workflow here builds a raw recap script, like a description of everything that happened that week, so that the LLM doesn't have to go looking. It's feeding it the skeleton and the facts of the recap, as well as league context and stuff like that. I'm using Opus, not even Sonnet. I'm using the big bad apex predator. I'm using the expensive big context window because I did notice the difference between when I gave Opus versus Sonnet this task, I got fewer hallucinations, fewer clumsier mistakes.
(08:28): But even Opus, I can't keep it fenced in. And I'm into diminishing returns here. If this were commercially important that this 2,000 word recap be really hyper accurate, but also creative and entertaining, probably figure it out, but this would be like another month of my life to get it there. Whereas in other cases, the things I've been building are doing real world ROI, and they're world beaters. Complete inversion of what I would expect that the fantasy football recap writer would be the most difficult. But once you get close to the problem and you think about it for a while, and you get some experience with it, you say, okay, now I do know why this is the hardest. But going in, I wouldn't have expected it.
Justin Mannhardt (09:14): Yeah. It's interesting the experience you've had yourself that you're describing building things, and then you mentioned the benchmarks and their relevance to real world experience. And I'm not an expert in these benchmarks at all. But oh, okay, the LLM got this amazing score on some type of test. And some of these benchmarks seem to try and measure, well, how close are we to this idea of super intelligence, or AGI or whatever? But something I ran into on the web recently was the CEO of Databricks, Ali, I think it's Ali Ghodsi is how you'd say it, said that while everyone is racing at this idea of AGI or super intelligence, the state of AI today already provides companies everything that they actually need. And so it's feeding this narrative like, oh, it's not there yet. It's not there yet. But the reality is we don't need super intelligence. We need things like VendorBot that do a very menial but important activity that has high ROI.
Rob Collie (10:15): Yeah. And the LLMs are well beyond capable of that if you're able to assemble the, what we're calling, the regular Lego brick infrastructure around them to customize them for your business. And that's what we've been finding. So I'm picking on the fantasy football example because it's the one that hasn't worked. I send that out to this fancy league and people chat about it, and like the text thread comes alive and people interact with each other, which is really the whole purpose of the whole thing, was to drive more interaction between these friends. So mission accomplished, it's a success. But I'm looking at it and going like, 98% accurate is 2% wrong, and 2% wrong is not acceptable. I don't know. Maybe I should go try... It's an easy swap, right? Instead of calling Opus to write the recap, I can give the same information to GPT 5.2 and I give the same information to Gemini 3, see if it performs better.
Justin Mannhardt (11:10): It's a difficult situation, I think, for a lot of people. Just even between the big three players for a consumer level subscription. Google, Anthropic, OpenAI. Which one do you pick? And they all have their different plans, and they have different models within each one. And I was texting with somebody this morning, and they're like, "Hey, are you using this thing?" And I was like, "No, I'm mainly using this other thing." He was like, "Well, are you going to try it?" And I was like, "I don't know. Why should I?" And then take that up to kind of what a company is dealing with, whether it's a small company or a big company, it's like, who are we going to choose to use?
Rob Collie (11:50): Yeah, you keep your options open. And that's the wild thing. If your solution's built mostly out of regular Lego bricks, swapping out Anthropic's magic Lego brick for OpenAI or Google's magic Lego brick is really not that hard. You might quickly discover that, oh, the amount of the system prompts that I wrote that I slowly refined, I didn't realize I was actually optimizing them for Anthropic the way that it understands things.
Justin Mannhardt (12:16): Right. Because you're reacting to the output.
Rob Collie (12:19): Yeah, that's what happened, right? And so I might find that if the fantasy football agent, if Tudy were mission critical, honestly, like right now, the very next thing I would be doing is exactly that. I'd be swapping out for the other LLMs and running some tests and seeing what happens. Because it's not mission critical and I've kind of had my fun there, I'm moving on to other things. But I guess the scientific experimental part of me should go back for science. So maybe I go back into the lab for the good of humanity.
Justin Mannhardt (12:48): Well, it's interesting, I was on the phone with a client yesterday, and we were talking about this idea of evaluations. And the benchmark thing came up. And it's like, it's not really relevant to your business application. If anything, it's just telling you, okay, these things are seemingly getting more and more smart all the time.
(13:13): But even at that point, it's like, well, what are you really looking for in the output because you got this probabilistic system and not a deterministic one. And so having a framework to evaluate, do we like this model more than that model more than that model? So far in my experience, you're starting to get into almost like a personal preference territory. There are things where you can tell one model is maybe better at a certain type of task than the other.I don't know if you felt this way, like you swap Sonnet out for Opus, and you feel like there's a little bit of a different person on the other end of this phone line.
Rob Collie (13:50): Yeah. I was talking to, I had dinner with Dave Gaynor a couple of weeks back, and one of the things he mentioned that I wasn't really aware of is that OpenAI, for instance, has got like 100 finance stock trading types that all they do is just sit there all day, every day helping to train ChatGPT models on being better at that domain. He said they've also done similar things with health.
(14:18): And so there is a sense of domain specific differentiation between these tools. You've seen that ChatGPT is running ads on football aimed very much at the consumer market. It makes sense that they're making domain specific investments in things like health and finance and things like that. I know that Anthropic and Claude have also been advertising, but they're advertising more in the same way that Microsoft does or IBM. Just like general brand awareness. And it's still really aimed at corporate customers, at business customers, not at the individual like the ChatGPT ads are.
(14:55): And so like Anthropic isn't necessarily investing in the same ways. They're more interested in coding or whatever. I don't know what the relative portfolios are of these two companies, or Gemini for that matter, in terms of like where they're making their deeper investments in particular domains. But if the benchmark happens to be heavy in a particular domain, that's going to skew things. Experimentation I think is really the only thing we've got in developing our own sort of personal heuristics.
Justin Mannhardt (15:25): I've had measured success in some of the things I've built trying to replicate this idea of a unit test. For example, I can spin up 10 simultaneous chats with the same chat agent, ask it the same thing, and evaluate the responses. And if you do that, you'll see, okay, I 100% love what it did seven times, I hate what it did one time, and the two in the middle, I could take it or leave it. You got to sort of say, okay, how aggressive do I want to be on the system instruction or the knowledge base to try and enforce how often it is going to respond in a certain way?
(16:12): Now that we've got some of these things out in the wild here at P3, I've seen some really interesting examples where two people will ask it effectively the same question, and you can get sort of different responses that are defensible from a logical perspective, but what would you really want to be happening most of the time? That's been an interesting puzzle to try and wrestle with.
Rob Collie (16:34): Yeah. The freeform chat advice agents offer a lot of surface area for that kind of problem. No one really sits around and talks about the accuracy of traditional software. It adds numbers, it gets them right. If it's not accurate, it's a bug, and then you fix the bug and now it's accurate. So it's always accurate. It's either accurate or it's not in the marketplace.
(16:57): But with AI solutions, there is such a thing as accuracy, and certain use cases are much more sensitive, much more likely to result in inaccurate answers than others. It has nothing to do with the quality of the regular software, the regular Lego bricks you've deployed. It has nothing to do with the quality of the LLM being used. Because in this use case, you're just not really at risk. Even though AI is making judgments, AI is doing AI things, you're just not at risk in certain places. But in other cases where if it can literally give logically defensible advice that is 180 degrees different from each other, that's not what you want. You go to a human being over and over again, a human expert, and ask them that question, they're not going to give people 180 degree different advice.
Justin Mannhardt (17:48): You might find two different people that have divergent points of view.
Rob Collie (17:51): But then you identify that mismatch in your organization. You sit down with these people and say, "Listen folks." LLMs, they have their pre-training data, the stuff that's baked in, the stuff that it doesn't need to search the web for. It just knows. And then it has all the stuff that you tell it, the stuff that you put in the context window, this little tiny, tiny, tiny, tiny context window. It's like infinitely tiny compared to what it already knows. It's already sitting there with all the world's oceans of knowledge. And then you walk up with your eyedropper, and say, "How about adding this?" And it goes, "No, no, too much."
(18:26): So one of the things that humans still have as a tremendous advantage is that our pre-training data and our context window are constantly being reintegrated and merged. You tell me some new information, I put that into my equivalent of my pre-training data. It goes into the main storage banks. It's not held apart from everything else. And again, some people are more reliable than others. Some people listen better than others.
(18:52): But if you tell someone like, "No, in these sorts of situations from now on, you give answer A and not answer B," that person's going to be able to generalize from that in a way that maybe the LLM will, maybe the LLM won't. There's an art, there's nuance, there's discovery. But really you could harvest these exceptions and these things we're talking about and use them to generate breathless sound bites of like AI's not reliable, AI's not worth investing in. And that would miss the point. It is so stinking useful that those cases are ones that are well worth confronting and working your way through a solution for. It's not the reason to stay away. You shouldn't be hanging back saying, "I got to wait until they figure that out." They're not going to figure that out until there's a complete qualitative change in how they build these models. The trajectory we're on right now looks like the way it looks. And I'm frankly pretty skeptical that these benchmarks are really telling us a whole lot of useful information about how good these things are.
Justin Mannhardt (19:58): Because it's been so gradual over time, it's probably like the way you experience your children, if you have children. Like my oldest son is nine. And my brain can't comprehend how far he's come since he was three because I've been with him every single day on that journey. In the same way I've been using AI for so long, like just chat based general AI for so long, it's hard to remember how much different or worse it was. Some of the more obvious examples are things like video generation and image generation. These things are so incredibly capable already today, right now, of doing the vast majority of things we would want them to do within the context of a company. And they're the tedious, high human capital intensity things that we just don't do because the ROI's never been there.
(20:56): If I could get everybody to have access, unlimited access to our CEO's point of view about sales or operations or whatever, it can do those things and it can do a very good job at those things. Can it split the atom? No. Can it tell you if your vendor's going to miss its shipment? Yes.
Rob Collie (21:17): Let's talk about one more thing.
Justin Mannhardt (21:18): So when we at P3 realized, hey, we're going to make a different type of move into AI. There was a version of us that was like, oh, we're going to become AI powered and we're going to get really good at using AI to build dashboards and data models and reports and data lakes and pipelines and all that. And then we realized, wait, no, we can actually develop AI solutions, like workflows and agents and all these things.
(21:43): And so building that experience and that capability has been a big focus for us. So we've made a lot of progress, but it's been like building the technical capacity to understand the stuff, and what's the context window? What is that? How do we influence that? And so as I've been talking with more and more clients, a capability that we're realizing we need to build more so than we did, I think with Power BI, is the capability to navigate the sheer volume of possibilities.
(22:16): When Power BI was on the scene, it was like, oh, we need a dashboard. We kind of all understood what that is. But when you realize, oh, I could do 100 different things with AI, figuring out how to land those conversations with clients so it's not just paralyzed on the starting line. For me, it's been more important than, I think, in the BI era, way more important.
Rob Collie (22:38): Yeah. I mean, it's night and day more important. So I was on a call today with a prospective customer. When you start talking to someone and it's your responsibility to help them see the road ahead, it becomes so much more real and in the lane. And it really galvanizes the critical nature of this for you in a way that just talking about it at a distance doesn't.
(23:02): So yeah, I found myself in the headlights today, and it was helpful. So we started to show them, for example, this is one example, we started to show them one of our AI demos. A chat with data demo where it's using our P3 AI chat interface website to interact with a Power BI model behind the scenes. It's not being seen. So we were asking business questions and getting business answers, and it's really, really cool.
(23:27): But I stopped us and said, "Look, this is the first AI demo you're probably seeing that's sort of aimed at just you at the moment. And I do not want you to think that this is what AI is. I don't want you to think that every single AI solution is going to rhyme with this. Every application of AI is going to be radically different from each other. Because in different places, it's going to take a completely different form. And it's not going to be asking questions of data all the time."
(23:57): Eventually in my career, I became partly an educator, like a trainer and an author, and all that kind of stuff. And honestly, when I worked on software, I didn't realize this in the first half of my career working at Microsoft, but in the second half of my career, it started to dawn on me that I was also an educator there too. Every piece of software you build needs to educate the user on what it can do and how it can do it. So I was sort of like a trainer by at a distance even then.
(24:21): So I realized in that moment that we're showing them this demo, and I put myself in their shoes and I know what's going to happen. They're going to start thinking that this is AI. And their imagination for it is going to be completely anchored to this one style of example. And I'm like, "This is a really cool example. I want you to embrace it. There's a lot of relevance to you. But do not anchor your thoughts here. I want you to keep open because the next thing we're going to talk about is completely different." Even little subtle things like that, such a big deal. Whatever you see first, it's going to carry outsized weight unless you're sort of like preconditioned to think... Because they are, they're all over the map. Whereas dashboards are not all over the map.
Justin Mannhardt (25:04): Not to take anything away from all the cool things we've done with dashboards, but this is the blankest canvas you could possibly stare into. It's an interesting irony. You go back to conversations we've had maybe six, eight months ago, and you talk about how the dashboard put a constraint on the conversation. And now the chat experience is a constraint on the possibilities with AI even. And it's fair. Most really effective implementations of something AI related that we're either doing ourselves or hearing about or working with clients on, a lot of them aren't chat.
Rob Collie (25:46): A lot of them aren't chat. A lot of them aren't going to be sexy demos. They are incredibly useful and move the bottom line, but they're not as visible as you would think. Most of the really smart stuff is happening off screen, and you're only kind of seeing the output of it. And sometimes you're not even seeing that. Sometimes it's just magically lifting the boats. So there's that.
(26:18): I also, with 10 minutes to go before this call today, it's the first time that we've had a lot of conversations about AI with customers, but because I knew these people that we were talking to this morning, it made sense for me to be primary. And so this is my first time being primary on one of these art of the possible AI pitch calls. I was like, "What am I going to tell them?" Certainly got plenty of thought on the topic, but I'm not going to tell them to go listen to 24 hours of the podcast. So I'll just tell you what I told them and get your reaction because, again, I developed this with 10 minutes to go.
Justin Mannhardt (26:57): Inspired thinking usually does cool things.
Rob Collie (26:59): Yeah. I mean, it's stuff that we've already been talking about. It's not like I made it all up. But I told them, "Look, I'm sure you've been hearing two different stories. AI is revolutionizing everything. You're way behind. And then you've been hearing equally breathless takes of it's a fraud, it's a bubble, 95% of projects fail." And I said, "I actually believe that 90 to 95% of projects fail. It's just that we should be paying more attention to the five to 10. And we at P3 have been down the rabbit hole on this, and have really kind of cracked the riddle of what makes the difference between the successful and the unsuccessful.
(27:34): What makes the difference between successful and unsuccessful at a high level is a very, very simple story. That's good. I can tell you that, and you'll be able to understand it and that'll help you anchor your thinking about all of this stuff. Now, the details of it being in the successful five to 10%, the details are those are nerdy. Those require data gene people. Those require people who are willing to sit down and collaborate with a machine to build something. And that's not most people. And that's good news for P3, because we have a role to play in this new world. But it's really also really good news that the simple story at the top is understandable by all, I think."
(28:14): The simple story I told is that everyone's had this consumer experience of ChatGPT, the off-the-shelf experience, and it's amazing, but then the moment you try to use it for real business, it just face plants. And so you don't even know where to go. You're hearing these vague stories of business success, but how do you even get to there? What do they even look like? That chasm we were talking about, how do you enter that void and get to the other side?
(28:35): The answer is, first of all, you have to customize it. You cannot use chatgpt.com. You cannot use the consumer front ends for this. The LLM behind the scenes that those consumer front ends are using that chatgpt.com is calling into, those things are valuable, but you need to have your own experience, whatever it is, your own system that's calling into that LLM, and not using the consumer interface. You have to customize it.
(29:01): And then the second thing is you've got to, as you're customizing it, you're feeding it the right information at the right time and/or giving it access to information that it can ask questions of that information itself because all that information about your business, it knows nothing about your business, it needs to have that kind of access. And so most of what you're doing is building these regular Lego brick, non-AI structures around this magic Lego brick. That's where things start to get nuanced and chewy and technical.
(29:30): And we talked briefly about context windows on this call because there was one guy on the call that was like, he'd never heard about it before, but he's the sort that could absorb it. I was really impressed. This was their first really serious conversation about AI where it became tangible for them, and they were keeping up. They understood that story. It made sense to them. I think it was really useful to them. It gave them something to anchor themselves to so that when we had the rest of the chat, they could interpret it. Having a skeleton to attach the story too is super, super, super important to people. Anyway, so successful test of the high level to simple-
Justin Mannhardt (30:06): The magic Lego brick rides again.
Rob Collie (30:10): Well, one other thing that I found coming out of my mouth on that call, I didn't know it until I heard myself say it. I said, "I've kind of reached the conclusion that my career has been building for this moment. I thought forever that my career was about data, but I'm starting to come around to the idea that that was the warmup."
Justin Mannhardt (30:28): Careers have a tendency to do that. It's kind of interesting.
Rob Collie (30:31): I mean, that's the first time I've thought about my career as being fundamentally qualitatively different in a very long time. I was a software professional, then I was a data professional. I've only had those two sort of labels. But I think in the end, I will mostly remember, we're at the very beginning of this journey, but I think it's pretty high probability that I'm going to look back and go, yeah, that was an AI journey.
Justin Mannhardt (30:58): I have a potentially spiked, silly story in response to that. Something that I always assumed was completely useless in my career journey is when I had to get my Azure solution architecture expert level certification.
Rob Collie (31:15): That's a lot of words.
Justin Mannhardt (31:16): Yeah, I know. The certification was required through the partner programs with Microsoft and everything, and so I went and did it. But I had to learn about all kinds of stuff that had very little to do with what we actually did. Yeah, we were doing things in Azure Synapse and notebooks and ETL, but I was learning about pieces of that world that I thought were just completely irrelevant to me. So I've been vibe coding this application that I'm going to be using for myself at P3, and something wasn't quite working right. I understood why it wasn't working because of some BS I had to learn about.
Rob Collie (31:48): Really?
Justin Mannhardt (31:49): Yeah. And so I was able to spot the problem and fix it.
Rob Collie (31:53): Wow. Yeah.
Justin Mannhardt (31:55): For me, a takeaway I had this week, and even this conversation really helped me with is, keeping myself grounded in the noise. Yes, there's this race of capability that the LLM builders are engaged in. There's this war for dominance at their level. There's new cycles that are concerned about hype and bubbles and all these narratives. But just reminding myself that, what we have today, and this has been true for a while, what we have today, we could stop innovation right now, and we have everything we need to make things better. The best use cases, they don't need to be sexy. They don't need to be flashy.
Rob Collie (32:40): The real advancements in AI right now are the regular Lego brick advancements. The magic Lego brick is magic enough. It passed the magic enough threshold a long time ago. And whatever these benchmarks are telling us, I mean, maybe they matter in some cases. Again, I don't know yet. We'll find out. But building your muscles on how to apply these things, that's the breakthrough. And most of the really game changing things we've seen in AI in the last year have been, even from the AI companies themselves, have been regular Lego brick breakthroughs.
(33:11): Claude Code is a software product wrapped around an LLM. It is a brilliant product wrapped around that LLM, but everything about that product is a normal software product that's making a bunch of API calls to LLMs. It's the same thing. And so the things that really have been blowing people's minds, most of them, the things that are really changing the world or signaling some change in ROI and dramatic impact have been the essentially just creative and well executed ways to involve an LLM in a larger workflow. 99% of our short to medium term ROI in this space is going to be there. 1% is going to be these billionaires doing each other one better over and over again for code red bragging rights until one of them stumbles on something qualitatively different, which is bound to happen at some point.
Justin Mannhardt (34:08): Yeah.
Rob Collie (34:09): In the meantime.
Justin Mannhardt (34:10): See you next week.
Sign up to receive email updates
Enter your name and email address below and I'll send you periodic updates about the podcast.
Subscribe on your favorite platform.