🐱 Using Cat Pics to Explain a Surprising LLM Pitfall

One of the several mysteries about the workings of Large Language Models is their unexpected sensitivity to what seems like small changes in prompting. As a result there are many different guides and acronym-laden frameworks for writing effective prompts — including one from me!

Writing a good prompt really makes a huge difference between a meh run-of-the-mill boilerplate answer, and a dynamic banger of a response unlike anything anyone else could get. But why?

Try this. Go to your favorite chatbot and ask it “make me a picture of a cat”. I got this from DALL-E via Bing:

Looks a bit alien, but it is a cat, I suppose. Go ahead, try making your own cat picture now. I’ll wait.

Did I use a good prompt? Not particularly. Would I have been happy with a cartoon rendering of a tiger? A Egyptian feline Hieroglyph rendered in black and white? A picture of Wilfred the Cat, internet-famous for his unusual appearance? Most people would agree that any of these would feel weird, in a I-know-it-when-I-see-it kind of way.

This is because LLMs, like image models, are trained to respond in a particular way to underspecification. If you asked a human artist simply “make me an picture of a cat” they’d probably have questions: What kind of cat? You like grey tabbies? Do you have a photo of the cat you want drawn? Quick sketch or more photorealistic? What aspect ratio? Etc. etc.

LLMs are not trained to question us (except when specifically asked to do so).

In response to an underspecified request — and let’s be honest, with the level of detail models are capable of producing in an image or textual response, pretty much all requests are underspecified — they are optimized satisficers. They’re following their programmed rules, and the rules say to come up with something middle-of-the road. Good-enough. Average. Unoffensive to the largest swath of the population. They paper over specification gaps with mediocrity. Any other behavior gets thoroughly trained out of them before release.

Now, think about what this means for code generation models

Imagine yourself riding in a self-driving car or airplane. What kind of code would you prefer be running the system? Something middle-of-the-road? Or something crafted?

The architecture requirements of any realistic software system are underspecified. There are simply too many external variables to ever explicitly track them all. (Well maybe at NASA they come close. Sometimes.) Skilled architects don’t paper over specification gaps with mediocrity — they apply good taste and experience. They know when to keep options open (“things that might change”) and when to lock in to the simplest design that could possibly work (“things that won’t change”). A dash of abstraction in just the right place is a thing of beauty.

One hallmark of a good software architecture is that, by separating concerns and breaking down code into modules/components (among many other techniques), a lot of the code can be (and should be) anti-clever. Vanilla. Simple. Boring. This kind of structure, though, requires careful forethought.

This is why “vibe coding” is a delightful way to waste an evening, but when applied at scale results in plummeting code quality, a sharp uptick in source control reversions, and a complexity stew quickly exceeding the model’s own capability of dealing with.

(A wise person once said — and this is still true of model-generated code: Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?)

It’s a newbie mistake to “jump straight into code” without thinking through a problem first. And this is still a bad idea when a tool is generating the code. Maybe even worse. An emergent architecture is still an architecture, just probably not a very useful one, especially for the long term.

If you ever get a chance to work with a software architect in the getting-up-to-speed phase of a project, pay close attention to the nature and kinds of questions they ask. You may find it enlightening.

And if you’re thinking of adding or increasing the amount of AI code generation used in production, make sure you have at least a fractional architect on call. (Like, dare I say, me! Contact info.)

If you appreciate this kind of discussion, you can get it regularly delivered. Go here to find out the #1 mistake that is quietly destroying projects that use AI code generation.

Now, think about what this means for code generation models

Related