There have been multiple things which have gone wrong with AI for me but these two pushed me over the brink. This is mainly about LLMs but other AI has also not been particularly helpful for me.
Case 1
I was trying to find the music video from where a screenshot was taken.
I provided o4 mini the image and asked it where it is from. It rejected it saying that it does not discuss private details. Fair enough. I told it that it is xyz artist. It then listed three of their popular music videos, neither of which was the correct answer to my question.
Then I started a new chat and described in detail what the screenshot was. It once again regurgitated similar things.
I gave up. I did a simple reverse image search and found the answer in 30 seconds.
Case 2
I wanted a way to create a spreadsheet for tracking investments which had xyz columns.
It did give me the correct columns and rows but the formulae for calculations were off. They were almost correct most of the time but almost correct is useless when working with money.
I gave up. I manually made the spreadsheet with all the required details.
Why are LLMs so wrong most of the time? Aren’t they processing high quality data from multiple sources? I just don’t understand the point of even making these softwares if all they can do is sound smart while being wrong.
LLMs are not designed to give you objective factual answers. They’re designed to guess what you want to hear, like a middle school student writing a book report for a book they never read.
I don’t think it considers what the user wants to hear. It is concerned about what the data it has trained on would consider a logical answer.
What the user wants to hear is usually biased in the question. “Why are vaccines good” will have a different response from “Why are vaccines bad”
Both may or may not include factual information (again, middle school student guessing at a reading assignment analogy), but they’re shaped by the questioner to reaffirm your own biases.
no, they aren’t processing high quality data from multiple sources. They’re giving you a statistical average of that data. They will always be wrong by nature. Hallucinations cannot be eliminated. Anyone saying otherwise (irrelevant of how rich they are) is bullshitting.
If hallucinations cannot be eliminated, how are they decreasing them (allegedly)?
Actually according to studies, the most recent versions of all the major LLMbecile vendors are hallucinating more, not less.
by special casing a lot of things. Like expert systems, in the 80s
What do you mean?
the “guardrails” they mention. They are a bunch of if/then statements looking to work around methods that the developers have found to produce undesirable outputs. It doesn’t ever mean “the llm will not bo doing this again”. It means “the llm wont do this when it is asked in this particular way”, which always leaves the path open for “jailbreaking”. Because you will almost always be able to ask a differnt way that the devs (of the guardrails, they don’t have much control over the llm itself) did not anticipate.
Expert systems were kind of “if we keep adding if/then statements, we would eventually cover all the bases and get a smart, reliable system”. That didn’t work then. It won’t work now either
I have experienced this first hand. Asking LLMs explicit things leads to “I can’t help you with that” but if I ask it in a roundabout way, it gives a straight answer.
it’s by design. They are literally just guessing at what part of their database should be put in next, based on the next most likely word. There is no real point to them, because they cannot know things and they are not intelligent. Check out the works of Timnit Gebru if you’d like to know more.
What is they saying about AGI?
Why are LLMs so wrong most of the time? Aren’t they processing high quality data from multiple sources?
Well that’s the thing. LLMs don’t generally “process” data as humans would. They don’t understand the text they’re generating. So they can’t check their answers against reality.
(Except for Grok 4, but it’s apparently checking its answers to make sure they agree with Elon Musk’s Tweets, which is kind of the opposite of accuracy.)
I just don’t understand the point of even making these softwares if all they can do is sound smart while being wrong.
As someone who lived through the dotcom boom of the 2000s, and the crypto booms of 2017 and 2021, the AI boom is pretty obviously yet another fad. The point is to make money - from both consumers and investors - and AI is the new buzzword to bring those dollars in.
Don’t forget IoT, where the S stands for security! Or “The Cloud”! Make sure to rebuy the junk we will deprecate in 2 years time because we love electronic waste and planned obsolescence ;)
AI is definitely a bubble and it is going to crash the stock market one day, along with bitcoin
I can’t wait to buy stocks when that day comes
It can’t be that far away. We have been waiting since so many years. Trump is also making an effort to crash the market.
LLMs are curve fitting the function of “input text” to “expected output text”.
So when you give it an input text, it generates an output text interpolated from the expected outputs for similar inputs.
That means it’s often right for very common prompts and often wrong for prompts that are subtly different from common prompts.
This is my observation as well. Generic questions are answered well but specific situations are not.
Case1 isn’t a good use case of AI, Case 2 you’re going to want a higher quality model than o4. 4.1 is better at math and analysis, claude 4 is probably more accurate at this use case
I was thinking about the question here and how to reframe it so that it answers itself. I think I have the right way to ask the question:
Why is a hyper-advanced game of mad libs so wrong all the time?
That would get across the point, I think.
I highly recommend modern day oracles or bullshit machines, two professors explain it beautifully
Bookmarked for watching/reading this week. Will let you know my thoughts.
Cool, enjoy!
LLM image processing doesn’t work the same way reverse image lookup does.
Tldr explanation: Multimodal LLMs turn pictures into a
thousand200-500 or sowordstokens, but reverse image lookups create perceptual hashes of images and look the hash of your uploaded image up in a database.Much longer explanation:
Multimodal LLMs (technically, LMMs - large multimodal models) use vision transformers to turn images into tokens. They use tokens for words, too, but these tokens don’t also correspond to words. There are multiple ways this could be implemented, but a common approach is to break the image down into a grid, then transform each “patch” of a specific size, e.g., 16x16, into a single token. The patches aren’t transformed individually - the whole image is processed together, in context - but it still comes out of it with basically 200 or so tokens that allow it to respond to the image, the same way it would respond to text.
Current vision transformers also struggle with spatial awareness. They embed basic positional data into the tokens but it’s fragile and unsophisticated when it comes to spatial awareness. Fortunately there’s a lot to explore in that area so I’m sure there will continue to be improvements.
One example improvement, beyond improved spatial embeddings, would be to use a dynamic vision transformers that’s dependent on the context, or that can re-evaluate an image based off new information. Outside the use of vision transformers, simply training LMMs to use other tools on images when appropriate can potentially help with many of LMM image processing’s current shortcomings.
Given all that, asking an LLM to find the album for you is like - assuming you’ve given it the ability and permission to search the web - like showing the image to someone with no context, then them to help you find what music video - that they’ve never seen, by an artist whose appearance they describe with 10-20 generic words, none of which are their name - it’s in, and to hope there were, and that they remembered, the specific details that would make it would come up in the top ten results if searched for on Google. That’s a convoluted way to say that it’s a hard task.
By contrast, reverse image lookup basically uses a perceptual hash generated for each image. It’s the tool that should be used for your particular problem, because it’s well suited for it. LLMs were the hammer and this problem was a torx screw.
Suggesting you use - or better, using a reverse image lookup tool itself - is what the LLM should do in this instance. But it would need to have been trained to think to suggest this, capable of using a tool that could do the lookup, and have both access and permission to do the lookup.
Here’s a paper that might help understand the gaps between LMMs and tasks built for that specific purpose: https://arxiv.org/html/2305.07895v7
So if I am understanding it, LLMs is not using the easier option of reverse image search because it is not aware of them?
It may be aware of them, but not in that context. If you asked it how to solve the problem rather than to solve the problem for you, there’s a chance it would suggest you use a reverse image search.
But at that point, it is useful only for novice users of the internet who don’t know how to search for things. I am pretty sure a 30 second search engine search would yield the same result.
I like to think of them as artificial con men. They sound great. They have confidence and are complimentary and are very agreeable, but they will tell you what they think you want to hear. Whether or not what they are telling you is truthful isn’t even part of the equation.
Yeah, confidence is the problem. And they don’t accept that they don’t know something.
I almost always get perfect responses, but I’m very limited in what I’ll input. Often I’m just using ChatGPT to remember a word or event I’ve forgotten. Pretty much 100% accurate on that bit.
Couldn’t explain how I know what will and won’t work, but I have a sense of it. Also, the farther you drill into a thing, the more off-topic it gets. I’m almost always one and done with a prompt.
You are getting more surface level information from it which is probably going to be correct unless there is a major problem in training data.
It did give me the correct columns and rows but the formulae for calculations were off.
Did you tell it that? Assuming you were using an AI chat, you have the opportunity to provide additional info and have it try again.
Getting better success from LLM is a process of providing more context and refining things over iterations
For example I wanted it to generate a python data structure for me, along with lookup functions to cross reference the data. However I gave it further info about the data structures, the cross-mapping and how I wanted it normalized, and iterated a few times until I got something worth copy-pasting sections
I did. It did not help
The thing about LLMs is that they “store” information about the shape of their training models, not about the information contained therein. That information is lost.
A LLM will produce text that looks like the texts it was trained with, but it only can only reproduce any information contained in them if it’s common enough in its training data to statistically affect their shape, and even then it has a chance to get it wrong, since it has no way to check its output for fact accuracy.
Add to that that most models are pre-prompted to sound confident, helpful, and subservient (the companies’ main goal not being to provide information, but to get their customers hooked on their product and coming back for more), and you get the perfect scammers and yes-men. Auto-complete mentalists that will give you as much confident sounding information shaped nonsense as you want, doing their best to agree with you and confirm any biases you might have, with complete disregard for accuracy, truth, or the effects your trust in their output might have (which makes them extremely dangerous and addictive for suggestible or intellectually or emotionally vulnerable users).
AI as we know them today, will give you the most statistically probable series of data that fit the prompt.
You’re not providing any information on which AI you used so what can we say? For all we know you used a highschoolers senior project trained on failed history essays.I did say which one I used
Oh, I missed it, sorry! I’ve never tried 04 mini myself.
¯\_(ツ)_/¯ would I get downvoted for saying skill issue lol?
I have recently used llms for troubleshooting/assistance with exposing some self hosted services publicly through a VPS I recently got. I’m not a novice but I’m no pro either when it comes to the Linux terminal.
Anyway long story short in that instance the tool (llm) was extremely helpful in not only helping me correctly implement what I wanted but also explaning/teaching me as I went. I find llms are very accurate and helpful for the types of things I use it for.
But to answer your question on why llms can be wrong, it’s because they are guessing machines that just pick the next best word. They aren’t smart at all, they aren’t “ai” they are large language models.
It spouts out generic and outdated answers when asked specific questions, which I can identify as wrong (skill issue, lol).
If you are super confident with using them, maybe you are really not knowledgeable enough about those things. Skill issue, I guess.
There is a lot of hard data showing they are effective tools when used correctly, I realize we are in “FuckAI” and you’re likely biased. Just looks at this whole comment section of people talking about how they use the tools effectively.
I became biased after I used the products. I have no ethical concerns about AI, like most of this community.