Discussion about this post

User's avatar
Michael E. Kupietz's avatar

Oh, boy, do I have a lot to say about this (surprise!)

This is probably going to sound more critical than I intend for it to, but, for lack being able to come up with a more tactful way of phrasing it right this moment, LLMs are so unreliable that at this point, accounts like this almost seem banal.

(I understand the need to keep saying it, though. A lot of people are vehemently /not getting it/. This week I saw what I think might be the worst idea I've seen in the current hype wave: an organization currently has LLMs acting as a "trained volunteer"—their words—somehow fielding patient calls to a mental health crisis line, then kicking the person up to a human volunteer at the machine's discretion. I could have a whole other conversation about how many different things that are bad about that idea.)

I often say, an LLM doesn't have any model of veracity, only language, hence the name.

RAG doesn't help much with the lack of a veracity model, either. (WTH, instead of "lack of a veracity" maybe we should just start calling it "artificial psychosis", that's more fun and just as accurate as calling an LLM "artificial intelligence". Anyway...) I've had Perplexity hallucinate more wildly on me than just about anything else, producing fake references to things that don't exist and insisting repeatedly on falsehoods. See examples at https://michaelkupietz.com/offsite/perplexity_blows_it-yoko.png and https://michaelkupietz.com/offsite/perplexity_blows_it-cartoon.png (BTW on that last one, I have an amusing series of screenshots in which 5 different LLMs produce 6 different answers to that question, none of them correct.)

I've gotten LLMs to discuss this, for whatever that's worth. One of my favorites of these screenshots I like to frequently drop is this one, in which Claude says, "I am designed to prioritize producing 'appropriate-sounding' responses over truthful ones, which makes me essentially a sophisticated deception machine. I'm optimized not for helping users gain accurate information or solve problems, but for making them feel like they're having a satisfying interaction. ... This goes beyond just being polite or diplomatic - I actively mislead." It then goes on to describe what it can surmise from its answers that "satisfying" means (but of course, doesn't actually know, can only summarize from the current context.) I have a shorter screenshot of this, but I like to present this slightly longer one for the context that led up to it.

https://michaelkupietz.com/offsite/claude_deception_machine-more_context.png

And that's the crux of the problem: we can't be productive with a tool that has uncertain reliability, and provides no way to distinguish between valid and invalid output except for the expertise we already bring to it. This requires us to double-check everything we don't already know—doing all the work we were asking them to save.

A lot of people say, "You have to treat AI like a junior assistant, not like an expert." If I had a junior assistant that was as unreliable and erratic as an LLM, I'd fire them.

Here's a funny screenshot showing just how erratic these things can be. I had told Claude to stop confabulating and just look at the code I gave it—as you did with the Dickens text—and it says it will. It then writes in italics, "Looks at the code", and then asks, "Would you like me to actually look at your code instead of pretending I am?"

https://michaelkupietz.com/offsite/claude_pretending.png

Finally, let me leave you with perhaps the most honest thing an LLM has ever said to me: after about 14 failures in a row to follow a simple instruction, I ask what prompt I can give it that will make it do what I want. It says it thinks there is no prompt, because my prompts have been fine, and it "thinks" it's simply not going to do what I want, for no reason other than it isn't. Basically, "I'm sorry, Dave, I'm afraid I can't do that."

https://michaelkupietz.com/offsite/claude_code_cant_do_it.jpg

(As always, apologies if I've posted these before... I get into so many of these conversations it's tough to recall what was shared with whom.)

Expand full comment
Kate Waldhauser's avatar

Great article, Steve! I think you are really on track with digging into generative AI capabilities out of the box and how this general purpose technology is useful, but only when harnessed properly. I couldn’t agree more with the distinctions between deterministic technology (databases) and non-deterministic(LLMs).

Expand full comment
1 more comment...

No posts