Hacker Newsnew | past | comments | ask | show | jobs | submit | qnleigh's commentslogin

I've found notebookLM summaries to be too high-level and oversimplified to be useful. Hopefully in a few years they can go deeper.

You can alao use NotebookLM's as source for Gemini app and ask it to do more in-depth summaries with custom prompting.

This somewhat makes whole NotrbookLM less useful, but still.


I am kind of amazed at how many commenters respond to this result by confidently asserting that LLMs will never generate 'truly novel' ideas or problem solutions.

> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas

> it's not because the model is figuring out something new

> LLMs will NEVER be able to do that, because it doesn't exist

It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.

If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?


I might as well answer my own question, because I do think there are some coherent arguments for fundamental LLM limitations:

1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.

2. LLMs do not learn from experience. They might perform as well as most humans on certain tasks, but a human who works in a certain field/code base etc. for long enough will internalize the relevant information more deeply than an LLM.

However I'm increasingly doubtful that these arguments are actually correct. Here are some counterarguments:

1. It may be more efficient to just learn correct logical reasoning, rather than to mimic every human foible. I stopped believing this argument when LLMs got a gold metal at the Math Olympiad.

2. LLMs alone may suffer from this limitation, but RL could change the story. People may find ways to add memory. Finally, it can't be ruled out that a very large, well-trained LLM could internalize new information as deeply as a human can. Maybe this is what's happening here:

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...


I studied philosophy focusing on the analytic school and proto-computer science. LLMs are going to force many people start getting a better understanding about what "Knowledge" and "Truth" are, especially the distinction between deductive and inductive knowledge.

Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms. In the empirical world, however, knowledge only moves at the speed of experimentation, which is an entirely different framework and much, much slower, even if there are some areas to catch up in previous experimental outcomes.

Having a focus in philosophy of language is something I genuinely never thought would be useful. It’s really been helpful with LLMs, but probably not in the way most people think. I’d say that folks curious should all be reading Quine, Wittgenstein’s investigations, and probably Austin.


I think we may have similar perspectives. Regarding empirical knowledge, consider when the knowledge is in relation to chaotic systems. Characterize chaotic systems at least as systems where inaccurate observations about the system in the past and present while useful for predicting the future, nevertheless see the errors grow very quickly for the task of predicting a future state. Then indeed, prediction is difficult.

One domain of knowledge I think you have yet to mention. We can talk about fundamentally computationally hard problems. What comes to mind regarding such problems that are nevertheless of practical benefit are physics simulations, material simulations, fluid simulations, but there exist problems that are more provably computationally difficult. It seems to me that with these systems, the chaotic nature is one where even if you have one infinitely precise observation of a deterministic system, accessing a future state of the system is difficult as well, even though once accessed, memorization seems comparatively trivial.


> distinction between deductive and inductive knowledge

There's also intuitive knowledge btw.

Anyway, the recent developments of AI make a lot of very interesting things practically possible. For example, our society is going to want a way to reliably tell whether something is AI generated, and a failure to do so pretty much settles the empirical part of the Turing test issue. Or alternatively if we actually find something that AI can't reliably mimic in humans, that's going to be a huge finding. By having millions of people wonder whether posts on social media are AI generated, it is the largest scale Turing test we have inadvertently conducted.

The fact that AI seems to be able to (digitally) do anything we ask for is also very interesting. If humans are not bogged down by the small details or cost of implementation concerns, and we can just say what we want and get what we wished for (digitally), what level of creativity can we reach?

Also once we get the robots to do things in the physical space...


I don't want to do the thing where we fight on the internet. I don't know your background, but I'll push back here just because this type of comment that non-philosophers seem to present to me, which misses a lot of the points I'm trying to make.

(1) "intuitive knowledge" - whether or not you want to take "intuitive knowledge" as a type of knowledge (I don't think I would) is basically immaterial. The deductive-inductive framework dynamic is for reasoning frameworks, not knowledge. The reasoning frameworks are pointed in opposite directions. The deductive framework is inherited from rationalist tradition, it's premises are by definition arbitrary and cannot be justified, and information is perfect (excepting when you get rare truth values, like something being undecidable). Inductive/empirical framework is quite the opposite. Its premises are observations and absolutely not arbitrary, the information is wholly imperfect (by necessity, thanks Popper), and there is always a kind of adjustable resolution to any research conducted. Newton vs Einsteinian physics, for example, shows how zooming in on the resolution of experimentation shows how a perfectly workable model can fail when instruments get precise enough. I'll also note here that abduction is another niche reasoning framework, but is effectively immaterial to my point here.

(2) The Turing Test is not, and has never been, a philosophically rigorous test. It's effectively a pointless exercise. The literature about "philosophical zombies" has covered this, but the most important work here is Searle's "Chinese Room."

>The fact that AI seems to be able to (digitally) do anything we ask for is also very interesting.

I don't even know how to respond to this. It's trivially, demonstrably false. Beyond that, my entire point is that philosophy of language actually presents so hard problems with regards to what meaning actually is that might end up creating a kind of uncertainty principle to this line of thinking in the long run. Specifically Quine's indeterminacy of translation.


Searle's Chinese Room is a fallacious mess ... see the works of Larry Hauser, e.g., https://philpapers.org/rec/HAUNGT and https://philpapers.org/rec/HAUSCB-2 The importance of Searle's Chinese Room is how such extraordinarily bad argumentation has persuaded so many people open to it.

And the literature about philosophical zombies is contentious, to say the least, and much of it is also among the worst arguments in philosophy--Dennett confided in me that he thought it set back progress in Philosophy of Mind for decades, along with that monstrosity of misdirection, "the hard problem". Chalmers (nice guy, fun drunk at parties, very smart, but hopelessly deluded) once admitted to me on the Psyche-D list that his argument in The Conscious Mind that zombies are conceivable is logically equivalent to denying that physicalism is conceivable, so it's no argument against physicalism ... he said he used the argument to till the soil to make people more susceptible to his later arguments against physicalism (which I consider unethical)--all of which are bogus, like the Knowledge Argument--even Frank Jackson who originated it admits this.

Similarly, Robert Kirk, who coined the phrase "philosophical zombie" in 1974, wrote his book Zombies and Consciousness "as penance", he told me when he signed my copy.

> I don't want to do the thing where we fight on the internet.

Nor me ... I've had these "fights" too many times already and I know how they go, and I understand why people believe what they believe and why they can't be swayed, so I won't comment further ... I just want to put a dent in this "I'm a philosopher" argumentum ad verecundiam.


I would hope that philosophy would be exempt from accusations of arguments from authority. I say I don’t want to fight exactly because I don’t want to come off like a jerk because I’m arguing. If the Chinese Room is a mess, I welcome the argument, and will happily read the paper.

I’m less open to push back against philosophical zombies, as the argument seems trivially plausible, from a position of solipsism.


Philosophy may be exempt from accusations of arguments from authority--because that's a category mistake--but philosophers certainly aren't.

Hauser's papers are just a part of a large literature rejecting/refuting Searle's Chinese Room, but he has probably taken Searle more seriously than most. After Searle's well known response that waves away numerous objections, many people dismissed him as acting in bad faith. (It would have been even worse if they had known about the accusations of sexual assault. Sure, that would be ad hominem and intellectually dishonest, but we're talking about human beings, same as with arguments from authority.) See, e.g., https://www.nybooks.com/articles/1995/12/21/the-mystery-of-c... where Daniel Dennett writes:

> For his part, he has one argument, the Chinese Room, and he has been trotting it out, basically unchanged, for fifteen years. It has proven to be an amazingly popular number among the non-experts, in spite of the fact that just about everyone who knows anything about the field dismissed it long ago. It is full of well-concealed fallacies. By Searle’s own count, there are over a hundred published attacks on it. He can count them, but I guess he can’t read them, for in all those years he has never to my knowledge responded in detail to the dozens of devastating criticisms they contain; he has just presented the basic thought experiment over and over again. I just went back and counted: I am dismayed to discover that no less than seven of those published criticisms are by me (in 1980, 1982, 1984, 1985, 1987, 1990, 1991, 1993).

etc. If you've never read any of this literature yet can facilely write what you did above about Searle's discussion of the Chinese Room being "the most important work here", I don't expect you to start now ... but at least reconsider posing as a philosopher who is knowledgeable about such things.

Your reason to be less open to "push back against" (an odd formulation--the burden is on those who claim that they are conceivable, and therefore physicalism is false) philosophical zombies seems to hinge on another radical failure to understand the issue and unfamiliarity with the literature.

Philosophical zombies are completely independent of solipsism. The conceivability of zombies says that, if this is a world in which you are the sole inhabitant and you are conscious, then there is a possible world that is physically identical to this world and has the same physical laws, but the sole inhabitant (scoofy'), while physically identical to you and behaves identically, isn't conscious. That is, consciousness is not a consequence of physical laws and contingencies but is some sort of ethereal goop that accompanies physical entities. Of course Chalmers and other modern dualists don't subscribe to Descartes' substance dualism, but their attempts to formulate "process dualism" or some other nonsense solely because they need some alternative to physicalism--which they reject because they are hopelessly confused about the nature of consciousness and "qualia"--are frankly incoherent.

Maybe read Kirk's book and learn something about the subject. Here's a review that gives you a peek at what you'll find there: https://view.officeapps.live.com/op/view.aspx?src=https%3A%2...

Over and out.


Where can I read about how LLMs have changed epistemology? Is there a field of philosophy that tries to define and understand 'intelligence'? That sounds very interesting.

There is already philosophy of mind, but it was pretty young when I was in grad school, which was really at the dawn of deep learning algorithms.

I’d say the two most important topics here are philosophy of language (understanding meaning) and philosophy of science (understanding knowledge).

I’ve already mentioned the language philosophers in an edit above, but in philosophy of science I’d add Popper as extremely important here. The concept of negative knowledge as the foundation of empirical understanding seems entirely lost on people. The Black Swan, by Nassim Taleb is a very good casual read on the subject.


Also, we can do thought experiments, simulations in our heads, that often are as good as doing them for real - it has limitations and isn't perfect though. But it does work often. Einstein used to purposely dose off in a weird position so that something hit his leg or something like that to slightly nudge him half awake so he could remember his half-dreaming state - which is where he discovered some things

> Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms.

Not really; the normal way that math progresses, just like everything else, is that you get some interesting results, and then you develop the theoretical framework. We didn't receive the axioms; we developed them from the results that we use them to prove.


There are ways to go beyond the human-quality data limitation. AI can be trained on better quality than average human data because many problems are easy to verify their solutions. For example, in theory, reinforcement learning with an automatic grader on competitive programming problems can lead to an LLM that is better than humans at it.

It's also possible that there can be emergent capabilities. Perhaps a little obtuse, but you can say that humans are trained on human-quality data too and yet brilliant scientists and creative minds can rise above the rest of us.


The idea that they don’t learn from experience might be true in some limited sense, but ignores the reality of how LLMs are used. If you look at any advanced agentic coding system the instructions say to write down intermediate findings in files and refer to them. The LLM doesn’t have to learn. The harness around it allows it to. It’s like complaining that an internal combustion engine doesn’t have wheels to push it around.

LLMs are notoriously terrible at multiplying large numbers: https://claude.ai/share/538f7dca-1c4e-4b51-b887-8eaaf7e6c7d3

> Let me calculate that. 729,278,429 × 2,969,842,939 = 2,165,878,555,365,498,631

Real answer is: https://www.wolframalpha.com/input?i=729278429*2969842939

> 2 165 842 392 930 662 831

Your example seems short enough to not pose a problem.


Modern LLMs, just like everyone reading this, will instead reach for a calculator to perform such tasks. I can't do that in my head either, but a python script can so that's what any tool-using LLM will (and should) do.

This is special pleading.

Long multiplication is a trivial form of reasoning that is taught at elementary level. Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper. That was Opus with extended reasoning, it had all the opportunity to get it right, but didn't. There are people who can quickly multiply such numbers in their head (I am not one of them).

LLMs don't reason.


I tried this with Claude - it has to be explicitly instructed to not make an external tool call, and it can get the right answer if asked to show its work long-form.

i assert that by your evidentiary standards humans don't reason.

presumably one of us is wrong.

therefore, humans don't reason.


Mathematics is not the only kind of reasoning, so your conclusion is false. The human brain also has compartments for different types of activities. Why shouldn't an AI be able to use tools to augment its intelligence?

I used the mathematics example only because the GP did. There are many other examples of non-reasoning, including some papers (as recent as Feb).

Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper

LOL, talk about special pleading. Whatever it takes to reshape the argument into one you can win, I guess...

LLMs don't reason.

Let's see you do that multiplication in your head. Then, when you fail, we'll conclude you don't reason. Sound fair?


I can do it with a scratch pad. And I can also tell you when the calculation exceeds what I can do in my head and when I need a scratch pad. I can also check a long multiplication answer in my head (casting 9s, last digit etc.) and tell if there’s a mistake.

The LLMs also have access to a scratch pad. And importantly don’t know when they need to use it (as in, they will sometimes get long multiplication right if you ask them to show their work but if you don’t ask them to they will almost certainly get it wrong).


The conclusion that LLMs don't reason is not a consequence of them not being able to do arithmetic, so your argument isn't valid.

Also, see https://news.ycombinator.com/newsguidelines.html

"Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."

etc.


Plenty of humans can't do arithmetic. Can they also not reason.

Reasoning isn't a binary switch. It's a multidimensional continuum. AI can clearly reason to some extent even if it also clearly doesn't reason in the same way that a human would.


LLMs don't use tools. Systems that contain LLMs are programmed to use tools under certain circumstances.

I thought it might do better if I asked it to do long-form multiplication specifically rather than trying to vomit out an answer without any intermediate tokens. But surprisingly, I found it doesn't do much better.

Other comments indicate that asking it to do long multiplication does work, but the varying results makes sense: LLMs are probabilistic, you probably rolled an unlikely result.

This doesn’t address the author’s point about novelty at all. You don’t need 100% accuracy to have the capability to solve novel problems.

It does address the GP comment about math.

This hasn't been true for a while now.

I asked Gemini 3 Thinking to compute the multiplication "by hand." It showed its work and checked its answer by casting out nines and then by asking Python.

Sonnet 4.6 with Extended Thinking on also computed it correctly with the same prompt.


LLMs can generate anything by design. LLMs can't understand what they are generating so it may be true, it may be wrong, it may be novel or it may be known thing. It doesn't discern between them, just looks for the best statistical fit.

The core of the issue lies in our human language and our human assumptions. We humans have implicitly assigned phrases "truly novel" and "solving unsolved math problem" a certain meaning in our heads. Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know, finding a high temperature superconductor formula or creating a new drug etc. Something which involver real intelligent thinking and not randomizing possible solutions until one lands. But formally there can be a truly novel way to pack the most computer cables in a drawer, or truly novel way to tie shoelaces, or indeed a truly novel way to solve some arbitrary math equation with an enormous numbers. Which a formally novel things, but we really never needed any of that and so relegated these "issues" to a deepest backlog possible. Utilizing LLMs we can scour for the solutions to many such problems, but they are not that impressive in the first place.


> It doesn't discern between them, just looks for the best statistical fit

Of course at the lowest level, LLMs are trained on next-token prediction, and on the surface, that looks like a statistics problem. But this is an incredibly reductionist viewpoint and I don't see how it makes any empirically testable predictions about their limits. LLMs 'learned' a lot of math and science in this way.

> "truly novel" and "solving unsolved math problem"

OK again if novelty lies on a continuum, where do you draw the line? And why is it correct to draw it there and not somewhere else? It seems like you are just naming exceptionally hard research problems.


>LLMs 'learned' a lot of math and science in this way.

Did they? Or is it begging the question?


This is why I put 'learned' in quotes. They started from a state of not being able to solve algebra problems or produce basic steps of scientific reasoning to being able to. Operationally, that is what I mean by learning and they unambiguously do it.

If LLMs can come up with formerly truly novel solutions to things, and you have a verification loop to ensure that they are actual proper solutions, I don't understand why you think they could never come up with solutions to impressive problems, especially considering the thread we are literally on right now? That seems like a pure assertion at this point that they will always be limited to coming up with truly novel solutions to uninteresting problems.

"Truly novel" is fast becoming a True Scotsman.

No True Novelty, No True Understanding, etc.

The problem with these bromides is not that they're wrong, it's that they're not even wrong. They're predictive nulls.

What observable differences can we expect between an entity with True Understanding and an entity without True Understanding? It's a theological question, not a scientific one.

I'm not an AI booster by any means, but I do strongly prefer we address the question of AI agent intelligence scientifically rather than theologically.


We've tested this in the small with AI art. When people believe they're viewing human-made art which is later revealed to be AI art, they feel disappointed. The actual content is incidental, the story that supports it is more important than the thing itself.

It's the same mechanism behind artisanal food, artist struggles, and luxury goods. It is the metaphysical properties we attach to objects or the frames we use to interpret strips of events. We author all of these and then promptly forget we've done so, instead believing they are simply reality.


>The actual content is incidental, the story that supports it is more important than the thing itself.

The actual content of a work of art is the expression of lived experience. Not its form.


There are already people dealing with AI intelligence scientifically. That's what benchmarks do.

It's the "it's just a stochastic parrot!" camp that's doing the theological work. (and maybe also those in the Singularity camp...)

That said, I do think there's value in having people understand what "Understanding" means, which is kinda a theological (philosophical :D) question. IMHO, in every-day language there's a functional part (that can be tested with benchmarks), and there's a subjective part (i.e. what does it feel like to understand something?). Most people without the appropriate training simply mix up these two things, and together with whatever insecurities they have with AI taking over the world (which IMHO is inevitable to some extent), they just express their strong opinions about it online...


Well said. That's exactly what has been rubbing me the wrong way with all those "LLMs can never *really* think, ya know" people. Once we pass some level of AI capability (which we perhaps already did?), it essentially turns into an unfalsifiable statement of faith.

Agreed. We should be asking what the machines measurably can or can't do. If it can't be measured, then it doesn't matter from an engineering standpoint. Does it have a soul? Can't measure it, so it doesn't matter.

That's a bit too pessimistic. Often times you can productively find some measurable proxy for the thing you care about but can't measure. Turing's test is a famous example, of that.

Sometimes you only have a one-sided proxy. Eg I can't tell you whether Claude has a soul, but I'm fairly sure my dishwasher ain't.


> Turing's test is a famous example

Ironically, the Turing test is the OG functionalist approach. The GP's comment basically sums up with the Turing test was designed for.


Yes, but I interpret Turing's paper not as saying "souls don't matter", but as "here's a good proxy that we can actually measure".

(I don't know what Turing's opinion on souls is, and it doesn't matter for that paper!)


Claude has neither a soul nor a warbleflupper.

It probably can, but won't realize that and it won't be efficient in that. LLM can shuffle tokens for an enormous number of tries and eventually come up with something super impressive, though as you yourself have mentioned, we would need to have a mandatory verification loop, to filter slop from good output and how to do it outside of some limited areas is a big question. But assuming we have these verification loops and are running LLMs for years to look for something novel. It's like running an energy grid of small country to change a few dozen of database entries per hour. Yes, we can do that, but it's kinda weird thing to do. But it is novel, no argue about that. Just inefficient.

We never had a big demand to define how humans are intelligent or conscious etc, since it is too hard and was relegated to a some frontier researchers. And with LLMs we now do have such demand but the science wasn't ready. So we are all collectively searching in the dark, trying to define if we are different from these programs if not how. I certainly can't do that. I do know that LLMs are useful, but I also suspect that AI (aka AGI nowadays) is not yet reached.


How can people look at

- clear generalizability

- insane growth rates (go back and look at where we were maybe 2 years ago and then consider the already signed compute infrastructure deals coming online)

And still say with a straight face that this is some kind of parlor trick or monkeys with typewriters.

we don’t need to run LLMs for years. The point is look at where we are today and consider performance gets 10x cheaper every year.

LLMs and agentic systems are clearly not monkeys with typewriters regurgitating training data. And they have and continue to grow in capabilities at extremely fast rates.


I was talking about highest difficulty problems only, in the scope of that comment. Sure at mundane tasks they are useful and we optimizing that constantly.

But for super hard tasks, there is no situation when you just dump a few papers for context add a prompt and LLM will spit out correct answer. It's likely that a lead on such project would need to additionally train LLM on their local dataset, then parse through a lot of experimental data, then likely run multiple LLMs for for many iterations homing on the solution, verifying intermediate results, then repeating cycle again and again. And in parallel the same would do other team members. All in all, for such a huge hard task a year of cumulative machine-hours is not something outlandish.


This is just not true. Maybe it will be true if you increase the problem difficulty in concert with model performance? You don't need fine tuning for this and you haven't for years now. Reasoning performance for now may be SOMEWHAT brittle but again look at where we have come from in like 2 years. Then also consider the logical next steps

- better context compression (already happening) + memory solutions that extend the effective context length [memory _is_ compression]

- continual learning systems (likely already prototyped)

- these domains are _verifiable_ which I think just seems to confuse people. RL in verifiable domains takes you farther and farther. Training data is a bootstrap to get to a starting point, because RL from scratch is too inefficient.

agents can already deal with large codebases and datasets, just like any SWE, DS or researcher.

and yes! If you throw more compute at a problem you will get better solutions! But you are missing the point: for the frontier solutions, which changes with every model update, you of course need to eek out as much performance as you can, which requires a large amount of test time compute. But what you can do _without_ this is continually improving. The pattern _already in place_ is that at first you need an extreme amount of compute, then the next model iterations need far less compute to reach that same solution, etc etc. The costs + compute requirements to perform a particular task decrease exponentially.


> We never had a big demand to define how humans are intelligent or conscious etc, since it is too hard and was relegated to a some frontier researchers. And with LLMs we now do have such demand but the science wasn't ready. So we are all collectively searching in the dark, trying to define if we are different from these programs if not how. I certainly can't do that. I do know that LLMs are useful, but I also suspect that AI (aka AGI nowadays) is not yet reached.

Alternative perspective: the science may not have been ready, so instead we brute-forced the problem, through training of LLMs. Consider what the overall goal function of LLM training is: it's predicting tokens that continue given input in a way that makes sense to humans - in fully general meaning of this statement.

It's a single training process that gives LLMs the ability to parse plain language - even if riddled with 1337-5p34k, typos, grammar errors, or mixing languages - and extract information from it, or act on it; it's the same single process that makes it equally good at writing code and poetry, at finding bugs in programs, inconsistencies in data, corruptions in images, possibly all at once. It's what makes LLMs good at lying and spotting lies, even if input is a tree of numbers.

(It's also why "hallucinations" and "prompt injection" are not bugs, but fundamental facets of what makes LLMs useful. They cannot and will not be "fixed", any more than you can "fix" humans to be immune to confabulation and manipulation. It's just the nature of fully general sytems.)

All of that, and more, is encoded in this simple goal function: if a human looks at the output, will they say it's okay or nonsense? We just took that and thrown a ton of compute at it.


> (It's also why "hallucinations" and "prompt injection" are not bugs, but fundamental facets of what makes LLMs useful. They cannot and will not be "fixed", any more than you can "fix" humans to be immune to confabulation and manipulation. It's just the nature of fully general sytems.)

This is spot on and one of the reasons why I don't think putting LLMs or LLM based devices into anything that requires security is a good idea.


> It doesn't discern between them, just looks for the best statistical fit.

Why this is not true for humans?


We can't tell yet if that is true, partially true, or false for humans. We do know that LLM can't do anything else besides that (I mean as a fundamental operating principle).

Why is it important? “Statistical fit” is what you want…not understanding this is indicative of a limited understanding of what statistics is. What do you think it means to truly understand something? I don’t get it: read probability theory by Jaynes. It doesn’t really matter if the brain does Bayesian updates but that’s what’s optimal…

"Statistical fit" to environment is arguably what all life does.

> LLMs can't understand what they are generating

You don't understand what "understanding" means. I'm sure you can't explain it. You are probably just hallucinating the feeling of understanding it.

> Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know...

Yeah.


> Which a formally novel things, but we really never needed any of that

The history of science and maths is littered with seemingly useless discoveries being pivotal as people realised how they could be applied.

It's impossible to tell what we really "need"


I've been working on a utility that lets me "see through" app windows on macOS [1] (I was a dev on Apple's Xcode team and have a strong understanding of how to do this efficiently using private APIs).

I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.

It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.

This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.

[1]: https://imgur.com/a/gWTGGYa


That's actually pretty cool. What made you think of doing this in the first place?

Thanks! I've been doing a lot of work on a laptop screen (I normally work on an ultrawide) and got tired of constantly switching between windows to find the information I need.

I've also added the ability to create a picture-in-picture section of any application window, so you can move a window to the background while still seeing its important content.

I'll probably do a Show HN at some point.


Was it a novel solution for you or for everyone? Because that's a pretty big difference. A lot stuff novel for me would be something someone had been doing for decades somewhere.

Unless you worked on the macOS content server directly you’d have no idea that my solution was even possible.

That fact that Claude skipped over all the obvious solutions is why I used the word novel.


How confident are you that this knowledge was not part of the training data? Was there no stackoverflow questions/replies with it, no tech forum posts, private knowledge bases, etc?

Not trying to diminish its results, just one should always assume that LLMs have a rough memory on pretty much the whole of the internet/human knowledge. Google itself was very impressive back then in how it managed to dig out stuff interesting me (though it's no longer good at finding a single article with almost exact keywords...), and what makes LLMs especially great is that they combine that with some surface level transformation to make that information fit the current, particular need.


Do you think AlphaGo is regurgitating human gameplay? No it’s not: it’s learning an optimal policy based on self play. That is essentially what you’re seeing with agents. People have a very misguided understanding of the training process and the implications of RL in verifiable domains. That’s why coding agents will certainly reach superhuman performance. Straw/steel man depending on what you believe: “But they won’t be able to understand systems! But a good spec IS programming!” also a bad take: agents absolutely can interact with humans, interpret vague deseridata, fill in the gaps, ask for direction. You are not going to need to write a spec the same way you need to today. It will be exactly like interacting with a very good programmer in EVERY sense of the word

How does alphago come into picture? It works in a completely different way all together.

I'm not saying that LLMs can't solve new-ish problems, not part of the training data, but they sure as hell not got some Apple-specific library call from a divine revelation.


AlphaGo comes into the picture to explain that in fact coding agents in verifiable domains are absolutely trained in very similar ways.

It’s not magic they can’t access information that’s not available but they are not regurgitating or interpolating training data. That’s not what I’m saying. I’m saying: there is a misconception stemming from a limited understanding of how coding agents are trained that they somehow are limited by what’s in the training data or poorly interpolating that space. This may be true for some domains but not for coding or mathematics. AlphaGo is the right mental model here: RL in verifiable domains means your gradient steps are taking you in directions that are not limited by the quality or content of the training data that is used only because starting from scratch using RL is very inefficient. Human training data gives the models a more efficient starting point for RL.


Well said.

Why is ScreenCaptureKit a bad choice for performance?

Because you can't control what the content server is doing. SCK doesn't care if you only need a small section of a window: it performs multiple full window memory copies that aren't a problem for normal screen recorders... but for a utility like mine, the user needs to see the updated content in milliseconds.

Also, as I mentioned above, when using SCK, the user cannot minimize or maximize any "watched" window, which is, in most cases, a deal-breaker.

My solution runs at under 2% cpu utilization because I don't have to first receive the full window content. SCK was not designed for this use case at all.


What was the solution?

Well, I'm not going to share either solution as this is actually a pretty useful utility that I plan on releasing, but the short answer is: 1) don't use ScreenCaptureKit, and 2) take advantage of what CGWindowListCreateImage() offers through the content server. This is a simple IPC mechanism that does not trigger all the SKC limitations (i.e., no multi-space or multi-desktop support). In fact, when using SKC, the user cannot even minimize the "watched" window.

Claude realized those issues right from the start.

One of the trickiest parts is tracking the window content while the window is moving - the content server doesn't, natively, provide that information.


Huh, Claude one-shotted it out of a single message from me. Man, LLMs have gotten good.

No it didn't. Like I said... it may have gotten something that worked but there is no way Claude got it to work while supporting multi-spaces, multi-desktops, and using under 2% cpu utilization. My solution can display app window content even when those windows are minimized, which is not something the content server supports.

My point was that Claude realized all the SKC problems and came up with a solution that 99% of macOS devs wouldn't even know existed.


> it may have gotten something that worked but there is no way Claude got it to work while supporting multi-spaces, multi-desktops, and using under 2% cpu utilization.

Maybe, but that's the magic of LLMs - they can now one-shot or few-shot (N<10) you something good enough for a specific user. Like, not supporting multi-desktops is fine if one doesn't use them (and if that changes, few more prompts about this particular issue - now the user actually knows specifically what they need - should close the gap).


And now it does.

Sorry, "now it does", what?

The things it didn't, which you then helpfully spelled out.

Do you believe my brief overview of the problem will help Claude identify the specific undocumented functions required for my solution? Is that how you think data gets fed back into models during training?

Yes. I don't think you appreciate just how much information your comments provide. You just told us (and Claude) what the interesting problems are, and confirmed both the existence of relevant undocumented functions, and that they are the right solution to those problems. What you didn't flag as interesting, and possible challenges you did not mention (such as these APIs being flaky, or restricted to Apple first-party use, or such) is even more telling.

Most hard problems are hard because of huge uncertainty around what's possible and how to get there. It's true for LLMs as much as it is for humans (and for the same reasons). Here, you gave solid answers to both, all but spelling out the solution.

ETA:

> Is that how you think data gets fed back into models during training?

No, one comment chain on a niche site is not enough.

It is, however, how the data gets fed into prompt, whether by user or autonomously (e.g. RAG).


> Yes. I don't think you appreciate just how much information your comments provide

Lol... no. You don't know how I solved the problem and you just read everything that Claude did.

Absolutely nothing in the key part of my solution uses a single public API (and there are thousands). And you think that Claude can just "figure that out" when my HK comments gets fed back in during training?

I sincerely wish we'd see less /r/technology ridiculousness on HN.


I wonder how many 'ideas guys' will now think that with LLMs they can keep their precious to themselves while at the same bragging about them in online fora. Before they needed those pesky programmers negotiating for a slice of the pie, but this time it will be different.

Next up: copyright protection and/or patents on prompts. Mark my words.


I'm pretty sure a large fraction of the vibecoded stuff out there is from the "ideas guys." This time will be different because they'll find out very quickly whether their ideas are worth anything. The term "slop" substantially applies to the ideas themselves.

I don't think there will be copyright or patents on prompts per se, but I do think patents will become a lot more popular. With AI rewriting entire projects and products from scratch, copyright for software is meaningless, so patents are one of the very few moats left. Probably the only moat for the little guys.


It one-shotted what exactly?

Because LatencyKills is clearly describing a broader set of requirements related to their solution.


> 67,383 * 426,397 = 71,371,609,051 ... You need to say why it can do some novel tasks but could never do others.

Model interpretability gives us the answers. The reason LLMs can (almost) do new multiplication tasks is because it saw many multiplication problems in its training data, and it was cheaper to learn the compressed/abstract multiplication strategies and encode them as circuits in the network, rather than memorize the times tables up to some large N. This gives it the ability to approximate multiplication problems it hasn't seen before.


> This gives it the ability to approximate multiplication problems it hasn't seen before.

More than approximate. It straight up knows the algorithms and will do arbitrarily long multiplications correctly. (Within reason. Obviously it couldn't do a multiplication so large the reasoning tokens would exceed its context window.)

Having ChatGPT 5.4 do 1566168165163321561 * 115616131811365737 without tools, after multiplying out a lot of coefficients, it eventually answered 181074305022287409585376614708755457, which is correct.

At this point, it's less misleading to say it knows the algorithm.


Why are we reducing AIs to LLMs?

Claude, OpenAI, etc.'s AIs are not just LLMs. If you ask it to multiply something, it's going to call a math library. Go feed it a thousand arithmetic problems and it'll get them 100% right.

The major AIs are a lot more than just LLMs. They have access to all sorts of systems they can call on. They can write code and execute it to get answers. Etc.


Yup, I agree with this. So based on this, where do you draw the line between what will be possible and what will not be possible?

Which is exactly how humans learn many things too.

E.g. observing a game being played to form an understanding of the rules, rather than reading the rulebook

Or: Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.


Most inventions are an interpolation of three existing ideas. These systems are very good at that.

My take as well. Furthermore, most innovations come relatively shortly after their technological prerequisites have been met, so that suggests the "novelty space" that humans generally explore is a relatively narrow band around the current frontier. Just as humans can search through this space, so too should machines be capable of it. It's not an infinitely unbounded search which humans are guided through by some manner of mystic soul or other supernatural forces.

Indeed. Every time someone complains that LLMs can't come up with anything new, I'm assaulted with the depressing remembrance that neither do I.

I can't even find a good example of an invention that is not an interpolation.

The inclined plane, the wheel, shall I keep going?

Stand on a fallen log on a hillside and you'll interpolate pretty hard.

> asserting that LLMs will never generate 'truly novel' ideas or problem solutions

I don't think I've had one of these my entire life. Truly novel ideas are exceptionally rare:

- Darwin's origin of the species - Godel's Incompleteness - Buddhist detachment

Can't think of many.


The hardest part about any creativity is hiding your influences

This is poetry.

People rarely create things that are wholly new.

Most created things are remixes of existing things.

Hallucinations are “something new”. And like most new things, useless. But the truth is the entire conversation is a hallucination. We just happen to agree that most of it is useful.


I think "novel" is ill defined here, perhaps. LLMs do appear to be poor general reasoners[0], and it's unclear if they'll improve here.

It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.

A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:

1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).

2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.

I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.

I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.

[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...


> I think "novel" is ill defined here

That's exactly my point. When people say "LLMs will never do something novel," they seem to be leaning on some vague, ill-defined notion of novelty. The burden of proof is then to specify what degree of novelty is unattainable and why.

As for evidence that they can do novel things, there is plenty:

1. I really did ask Gemini to multiply 167,383 * 426,397 before posting this question. It answered correctly.

2. SVGs of pelicans riding bicycles

3. People use LLMs to write new apps/code every day

4. LLMs have achieved gold-medal performance on Math Olympiad problems that were not publicly available

5. LLMs have solved open problems in physics and mathematics [0,1]

That is as far as they have advanced so far. What's next? Where is the limit? All I want to say is that I don't know, and neither do you :).

[0] https://www.reddit.com/r/Physics/comments/1n77h10/and_severa... (Mark Raamsdonk is a pretty famous researcher in high-energy physics, not just some random guy)

[1] https://mathstodon.xyz/@tao/115855840223258103

[2] https://news.ycombinator.com/item?id=47497757


Actually here's an even better list of progress on a number of open math problems, with plenty of caveats and exposition:

https://github.com/teorth/erdosproblems/wiki/AI-contribution...


This is great observational data but it's an early "step 1", I'd definitely need to see an actual analysis of these cases and likely want to have that analysis involve a review of relevant training data.

The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?

Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.


> The “good deal of evidence” is everywhere. The proof is in the pudding.

I'm open! Please, by all means.

> the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark?

The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

> Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means.

Okay.

> That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training).

This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?


> The “good deal of evidence” is everywhere. The proof is in the pudding. I'm open! Please, by all means.

sure here are but a few: [1] you get smooth gains in reasoning with more RL train-time compute and more test-time compute (o1)

[2] DeepSeek-R1 showed that RL on verifiable rewards produces behavior like backtracking, adaptation, reflection, etc.

[3] SWE-Bench is a relatively decent benchmark and perf here is continually improving — these are real GitHub issues in real repos

[4] MathArena — still good perf on uncontaminated 2025 AIME problems

[5] the entire field of reinforcement learning, plus successes in other fields with verifiable domains (e.g. AlphaGo); Bellman updates will give you optimal policies eventually

[6] Anthropics cool work looking effectively at biology of a large language models: https://transformer-circuits.pub/2025/attribution-graphs/met... — if you trace internal circuits in Haiku 3.5 you see what you expect from a real reasoning system: planning ahead, using intermediate concepts, operating in a conceptual latent space (above tokens). And thats Haiku 3.5!!! We’re on Opus 4.6 now…

people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.

> The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

Appeals to authority are a fine prior, but lo and behold I also have a PhD and have worked on and led benchmark development professionally for several years at an AI lab. That’s ultimately no reason to really trust either of us. As I said, the blog post rightfully decries benchmarks but it then presents a new benchmark as though that isn’t subject to all of the same problems. It’s a good article! I think they do a good job here! I agree with all of their complaints about benchmarks! It rightfully identifies failure modes, and there are plenty of other papers pointing out similar failure modes. Reasoning is still brittle, lots of areas where LLMs/agentic systems fail in ways that are incredible given their talent in other areas. But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”. This is just not true, but it is true that they are brittle and fallible in weird ways, today.

> This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

"They do well, therefore intelligence" is not an argument, sure. But that’s also not what I’m saying. The Occam’s razor here is that reasoning-like computation is the best explanation for an increasing amount of the observed behavior, especially in fresh math and real software tasks where memorization is a much worse fit.

> As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

I would encourage you to read Kuhn’s structure of scientific revolutions. "That’s how science is done" is a bit of an oversimplification of how the sausage is made here. Real science moves forward in a messy mix of partial theory + better measurements + interventions long before anyone has some sort of grand unified framework. Neuroscience is no different here. And I would say at this point with LLMs we now do have pretty decent tests: fresh verifiable-task evals, mechanistic circuit tracing, causal activation patching, and scaling results for RL/test-time compute. The claim that there is no framework + no real tests is just not true anymore. It’s not like we have some finished theory of reasoning, but thats a bit of an unfair demand at this point and is asymmetrical as well.

> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

>> Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

>> So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?

The model is: reasoning is not inherently human, it’s mathematical. It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.

What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above? Also I’m confused by your last part. You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd, but why? You start from the position that brains are intelligent, so why is this absurd within your argument? Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?


Thanks, this is great and I'll have quite a bit to read here.

> people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.

I don't think it's goal post moving to acknowledge improvements but still reject the conclusion that AI has reached a specific milestone if those improvements don't justify the position. I doubt anyone sensible is rejecting improvements.

> But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”.

I don't think I've ever made any definitive claims at all, quite the contrary - I've tried to express exactly how open I am to what you're saying. As I've said, I'm a functionalist, and I already am largely supportive of reductive intelligence, so I'm exactly the type of person who would be sympathetic to what you're saying.

> "That’s how science is done" is a bit of an oversimplification

Of course, but I don't think it's too much to ask for to have a theory and evidence. I don't need a lined up series of papers that all start with perfectly syllogisms and then map to well controlled RCTs or whatever. Just an "I think this accounts for it, here's how I support that".

> The claim that there is no framework + no real tests is just not true anymore.

I didn't say it wasn't true, to be clear, I asked for it. Again, I'm sympathetic to the view at a glance so I simply need a way to reason about it.

No need for a complete view, I'd never expect such a thing.

> The model is: reasoning is not inherently human, it’s mathematical.

Well, hand wringing perhaps, but I'd say it's maybe mathematical, computational, structural, functional, whatever - I think we're on the same page here regardless.

> It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.

Sure, but I grant that, in fact I believe it entirely. But that doesn't mean that every mathematical construct exhibits the function of intelligence.

> What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above?

Sorry, I'm not fully understanding this framing. We can do those things with LLMs, and it's hard to say what I would be satisfied. In general, I'd be satisfied with a theory that (a) accounts for the data (b) has supporting evidence (c) does not contradict any major prior commitments. I don't think (c) will be an issue here.

> You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd,

Because intelligence could have been a property of our brains being wet, or roundish, or it could have been a property of our spines, or maybe some force we hadn't discovered, or a soul, etc. We formed a theory, it accounted for observations, we performed tests, we've modeled things, etc, and so the theories we've adopted have been extremely successful and I think hold up quite well. But certainly we didn't go "the brain has electricity, the brain is intelligent, therefor electricity in the brain is what drives intelligence".

> Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?

Certainly nothing on my world view.


Beliefs are not rooted in facts. Beliefs are a part of you, and people aren't all that happy to say "this LLM is better than me"

I'm very happy to say calculators are far better than me in calculations (to a given precision). I'm happy to admit computers are so much better than me in so many aspects. And I have problem saying LLMs are very helpful tools able to generate output so much better than mine in almost every field of knowledge.

Yet, whenever I ask it to do something novel or creative, it falls very short. But humans are ingenious beasts and I'm sure or later they will design an architecture able to be creative - I just doubt it will be Transformer-based, given the results so far.


But the question isn't whether you can get LLMs to do something novel, it's whether anyone can get them to do something novel. Apparently someone can, and the fact that you can't doesn't mean LLMs aren't good for that.

When it comes to LLMs doing novel things, is it just the infinite monkey theorem[0] playing out at an accelerated rate, helped along by the key presses not being truly random?

Surely if we tell the LLM to do enough stuff, something will look novel, but how much confirmation bias is at play? Tens of millions of people are using AI and the biggest complaint is hallucinations. From the LLMs perspective, is there any difference between a novel solution and a hallucination, other than dumb luck of the hallucination being right?

[0] https://en.wikipedia.org/wiki/Infinite_monkey_theorem


This argument doesn't go the way you want it to go. Billions of people exist, but maybe a few tens of thousands produce novel knowledge. That's a much worse rate than LLMs.

I’m not sure how we equate the number of humans to AI to determine a success rate.

We also can’t ignore than it was humans who thought up this problem to give to the AI. Thinking has two parts, asking and answering questions. The AI needed the human to formulate and ask the question to start. AI isn’t just dropping random discoveries on us that we haven’t even thought of, at least not that I’ve seen.


To have a proper discussion we would have to define the word "novel" and that's a challenge in itself. In any case, millions of poeple tried to ask LLMs to do something creative and the results were bland. Hence my conclusion LLMs aren't good for that. But I'm also open they can be an element of a longer chain that could demonstrate some creativity - we'll see.

Novel is a tricky word. In this case, the LLM produced a python program that was similar to other programs in its corpus, and this oython program generated examples of hypergraphs that hadn't been seen before.

That's a new result, but I don't know about novel. The technique was the same as earlier work in this vein. And it seems like not much computational power was needed at all. (The article mentions that an undergrad left a laptop running overnight to produce one of the previous results, that's absolute peanuts when compared to most computational research).


I have never seen a human produce a Python program that wasn't similar to other programs they'd seem.

So? I certainly have.

Truly novel? All art is derivative.

If all art is derivative then the earlier statement is a tautology.

People still call things other people do novel. There's clear social proof that humans do things that other humans consider novel. Otherwise the word would probably not exist.

Just today I wrote a python program that did not resemble anything I'd written before, nor had I seen anything similar. I had to reason it out myself. That passes thr test that the original comment set.


Your threshold for "resemble" is obviously quite high, which is fair, but assuming that you're an encultured programmer your python code represents other people's python code. It might be doing something novel, but that thing it's doing is interacting or in response to, or otherwise relative to existing concepts you learned or saw elsewhere. All art is derivative, we can do things other people haven't done before but all of it derives from the works of others in some way.

Anyway, I've coded all kinds of wacky shit with claude that I guarantee nobody has implemented before, if only because they're stupid and tedious ideas. They can't all be winners, but they were novel, and yet claude code implemented them as confidently as if they were yet another note taking app. They have no problem handling novel ideas, and although the novel ideas in this case were my own, its easy to see how finding new ideas could be automated by exploring the combinatorial space of existing ideas.


I'm not talking about wacky. My barrier for novel is 1) new capabilities 2) useful, and 3) end-to-end tested.

For example, what I refered to that I've written is a dynamic storage solution for n-dimensional grids, that can grow arbitrarily in any direction, and is locally dense (organized into spatially indexed blocks of contiguous data).

I had never considered this problem before, and I certainly had never seen a solution before (even though there may well be one).

I worked it out on paper, considering how integer lattices can be partitioned and indexed, and then I transformed that into a design which I then implemented. Working purely from the design, not considering existing solutions.


It's not possible to know something without believing it to be true. https://en.wikipedia.org/wiki/Belief#/media/File:Classical_d...

This is objectively wrong. If that was the case every scientist performing a test would have always had their expectations and beliefs proven true. If you're trying to disprove something also because you believe it to be wrong you would never be proven wrong.

re-read your post - it's just a bunch of nonsense, no actual reasoning in there

> e.g. 167,383 * 426,397 = 71,371,609,051

They may be wrong, but so are you.



You missed the point.

I missed it too, care to explain?

Not sure what you mean?

Can you elaborate?


You could have just checked the math yourself, you know.

My pocket calculator says the same thing and it doesn't even have training data.

Huh, casually flexing with a sentient pocket calculator!!!

It is like not trusting someone who attained highest score in some exam by by-hearting the whole text book, to do the corresponding job.

Not very hard to understand.


Yet we do that all the time by hiring based on GPA/degree.

Do you hire or screen based on them?

>>AI is a remixer; it remixes all known ideas together. It won't come up with new ideas

I always found this argument very weak. There isn't that much truly new anyway. Creativity is often about mixing old ideas. Computers can do that faster than humans if they have a good framework. Especially with something as simple as math - limited set of formal rules and easy to verify results - I find a belief computers won't beat humans at it to be very naive.


It's fear.

Do we know for a fact that LLMs aren't now configured to pass simple arithmetic like this in a simpler calculator, to add illusion of actual insight?

The major AIs have access to all sorts of tools, including a math library. I thought this was well-known. There's no "illusion of actual insight" - they're just "using a calculator" (in the sense that they call a math library when needed). AIs are not just LLMs.

You can train a LLM on just multiplication and test it on ones it has never seen before, it's nothing particularly magical.

It's not 'magic' though but previously LLMs have performed very badly on longer multiplication, 'insight' is the wrong word but I'm saying maybe they're not wildly better at this calculation... maybe they are just optimising these well known jagged edges.

Ximm's Law applies ITT: every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.

Especially the lemmas:

- any statement about AI which uses the word "never" to preclude some feature from future realization is false.

- contemporary implementations have almost always already been improved upon, but are unevenly distributed.


Anti-Ximm's Law: every response to a critique of AI assumes as much arbitrary level of future improvement as is necessary to make the case.

When I read through what they're doing? It sure doesn't sound like it's generating something new as people typically think of it. The link, they provide a very well defined problem and they just loop through it.

I think you're arguing with semantics.


Yes! I call these the "it's just a stochastic parrot" crowd.

Ironically, they are the stochastic parrots, because they're confidently repeating something that they read somehwere and haven't examined critically.


That would not be stochastic, just parroting

It's not deterministic therefore it's stochastic.

I guess when it can't be tripped up by simple things like multiplying numbers, counting to 100 sequentially or counting letters in a string without writing a python program, then I might believe it.

Also no matter how many math problems it solves it still gets lost in a codebase


LLMs are bad at arithmetic and counting by design. It's an intentional tradeoff that makes them better at language and reasoning tasks.

If anybody really wanted a model that could multiply and count letters in words, they could just train one with a tokenizer and training data suited to those tasks. And the model would then be able to count letters, but it would be bad at things like translation and programming - the stuff people actually use LLMs for. So, people train with a tokenizer and training data suited to those tasks, hence LLMs are good at language and bad at arithmetic,


Arguments like "but AI cannot reliably multiply numbers" fundamentally misunderstand how AI works. AI cannot do basic math not because AI is stupid, but because basic math is an inherently difficult task for otherwise smart AI. Lots of human adults can do complex abstract thinking but when you ask them to count it's "one... two... three... five... wait I got lost".

> fundamentally misunderstand how AI works

Who does fundamentally understand how LLMs work? Many claims flying around these days, all backed by some of the largest investments ever collectively made by humans. Lots of money to be lost because of fundamental misunderstandings.

Personally, I find that AI influencers conveniently brush away any evidence (like inability to perform basic arithmetic) about how LLMs fundamentally work as something that should be ignored in favor of results like TFA.

Do LLMs have utility? Undoubtedly. But it’s a giant red flag for me that their fundamental limitations, of which there are many, are verboten to be spoken about.


You're not doing yourself a favor when you point out "but they can't do arithmetic!" as if anyone says otherwise. Yes, we all know they can't do arithmetic, and that's just how they work.

I feel like I'm saying "this hammer is so cool, it's made driving nails a breeze" and people go "but it can't screw screws in! Why won't anyone talk about that! Hammers really aren't all they're cracked up to be".


Maybe because society has invested $trillions into this hammer and influencers are trying to convince CEOs to fire everyone and buy a bunch of hammers instead.

My comment even said “LLMs have utility”. I gave an inch, and now the mile must be taken.


Saying that the fundamental limitations are things like counting the number of rs in strawberry is boring, though. That's how tokens work and it's trivial to work around.

Talking about how they find it hard to say they aren't sure of something is a much more interesting limitation to talk about, for example.


> Talking about how they find it hard to say they aren't sure of something is a much more interesting limitation to talk about, for example.

Sure, thank you for steelmanning my argument. I didn’t think I needed to actually spell out all of the fundamental limitations of LLMs in this specific thread. They are spoken at length across the web, but are often met with pushback, which was my entire point.

Here’s another one: LLMs do not have a memory property. Shut off the power and turn it back on and you lose all context. Any “memory” feature implemented by companies that sell LLM wrappers are a hack on top of how LLMs work, like seeding a context window before letting the user interact with the LLM.


But that's also like saying "humans don't have a memory property, any 'memory' is in the hippocampus". It's not useful to say that "an LLM you don't bother to keep training has no memory". Of course it doesn't, you removed its ability to form new memories!

So why then do we stop training LLMs and keep them stored at a specific state? Is it perhaps because the results become terrible and LLMs have a delicate optimal state for general use? This sounds like an even worse case for a model of intelligence.

Nope, it's not that, but it's nice of you to offer a straw man. Makes the argument flow better.

Not entirely a straw man. What is the purpose of storing and retrieving LLMs at a fixed state if not to guarantee a specific performance? Wouldn’t a strong model of intelligence be capable of, to extend your analogy, running without having its hippocampus lobotomized?

Given the precariousness of managing LLM context windows, I don’t think it’s particularly unfair to assume that LLMs that learn without limit become very unstable.

To steelman, if it’s possible, it may be prohibitively expensive. But somehow I doubt it’s possible.


It is, indeed, prohibitively expensive. But it's not impossible. The proof is in the fact that you can fine-tune LLMs.

Because know one owns a $300 billion dollar hammer that literally runs on fancy calculators.

Ok, I'll bite. Show me an LLM that comes up with a new math operator. Or which will come up with theory of relativity if only Newton physics is in its training dataset. That it could remix existing ideas which leads to novel insights is expected, however the current LLMs can't come up with paradigm shifts that require novel insights. Even humans have a rather limited time they can come up with novel insights (when they are young, capable of latent thinking, not yet ossified from the existing formalization of science and their brain is still energetically capable without vascular and mitochondrial dysfunction common as we age).

How many humans have been born until now and how many Einsteins have been born? And in how many hundreds of thousands of years?

The point is that humans do have some edge compared to current LLMs which are essentially next token predictors. If we all start relying on current AI and stop thinking, we would only be able to "exhaust the remix space" of existing ideas but won't be able to do any paradigm jumps. Moreover, it's quite likely that current training sets are self-contradictory, containing Dutch books, carrying some innate error in them.

It takes a lot of intelligence to "essentially predict" next token when you're doing a math proof. Or writing code.

Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.

https://epoch.ai/frontiermath/open-problems


I’d hope this isn’t a goal post move - an open math problem of any sort being solved by a language model is absolute science fiction.

That's been achieved already with a few Erdös problems, though those tended to be ambiguously stated in a way that made them less obviously compelling to humans. This problem is obscure, even the linked writeup admits that perhaps ~10 mathematicians worldwide are genuinely familiar with it. But it's not unfeasibly hard for a few weeks' or months' work by a human mathematician.

FWIW https://github.com/teorth/erdosproblems/wiki/AI-contribution... in particular the disclaimers are very interesting.

It is not. You're operating under the assumption that all open math problems are difficult and novel.

This particular problem was about improving the lower bound for a function tracking a property of hypergraphs (undirected graphs where edges can contain more than two vertices).

Both constructing hypergraphs (sets) and lower bounds are very regular, chore type tasks that are common in maths. In other words, there's plenty of this type of proof in the training data.

LLMs kind of construct proofs all the time, every time they write a program. Because every program has a corresponding proof. It doesn't mean they're reasoning about them, but they do construct proofs.

This isn't science fiction. But it's nice that the LLMs solved something for once.


> nice that the LLMs solved something for once.

That sentence alone needs unpacking IMHO, namely that no LLM suddenly decided that today was the day it would solve a math problem. Instead a couple of people who love mathematics, doing it either for fun or professionally, directly ask a model to solve a very specific task that they estimated was solvable. The LLM itself was fed countless related proofs. They then guided the model and verified until they found something they considered good enough.

My point is that the system itself is not the LLM alone, as that would be radically more impressive.


I 100% agree. The LLM was just used to autocomplete a ready-made strategy.

But human researchers are also remixers. Copying something I commented below:

> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.


This is a way too simplistic model of the things humans provide to the process. Imagination, Hypothesis, Testing, Intuition, and Proofing.

An AI can probably do an 'okay' job at summarizing information for meta studies. But what it can't do is go "Hey that's a weird thing in the result that hints at some other vector for this thing we should look at." Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.

LLMs will NEVER be able to do that, because it doesn't exist. They're not going to discover and define a new chemical, or a new species of animal. They're not going to be able to describe and analyze a new way of folding proteins and what implication that has UNLESS you basically are constantly training the AI on random protein folds constantly.


I think you are vastly underestimating the emergent behaviours in frontier foundational models and should never say never.

Remember, the basis of these models is unsupervised training, which, at sufficient scale, gives it the ability to to detect pattern anomalies out of context.

For example, LLMs have struggled with generalized abstract problem solving, such as "mystery blocks world" that classical AI planners dating back 20+ years or more are better at solving. Well, that's rapidly changing: https://arxiv.org/html/2511.09378v1


No idea how underestimate things are, but marketing terms like "frontier foundational models" don't help to foster trust in a domain hyperhyped.

That is, even if there are cool things that LLM make now more affordable, the level of bullshit marketing attached to it is also very high which makes far harder to make a noise filter.


>Hey that's a weird thing in the result that hints at some other vector for this thing we should look at

Kinda funny because that looked _very_ close to what my Opus 4.6 said yesterday when it was debugging compile errors for me. It did proceed to explore the other vector.


> Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.

This is the crucial part of the comment. LLMs are not able to solve stuff that hasn't been solve in that exact or a very similar way already, because they are prediction machines trained on existing data. It is very able to spot outliers where they have been found by humans before, though, which is important, and is what you've been seeing.


""Hey that's a weird thing in the result that hints at some other vector for this thing we should look at." "

This is very common already in AI.

Just look at the internal reasoning of any high thinking model, the trace is full of those chains of thought.


But just like how there were never any clips of Will Smith eating spaghetti before AI, AI is able to synthesize different existing data into something in between. It might not be able to expand the circle of knowledge but it definitely can fill in the gaps within the circle itself

> LLMs will NEVER be able to do that, because it doesn't exist.

I mean, TFA literally claims that an AI has solved an open Frontier Math problem, descibed as "A collection of unsolved mathematics problems that have resisted serious attempts by professional mathematicians. AI solutions would meaningfully advance the state of human mathematical knowledge."

That is, if true, it reasoned out a proof that does not exist in its training data.


It generated a proof that was close enough to something in its training data to be generated.

That may be, and we can debate the level of novelty, but it is novel, because this exact proof didn't exist before, something which many claim was not possible with AI. In fact, just a few years ago, based on some dabbling in NLP a decade ago, I myself would not have believed any of this was remotely possible within the next 3 - 5 decades at least.

I'm curious though, how many novel Math proofs are not close enough to something in the prior art? My understanding is that all new proofs are compositions and/or extensions of existing proofs, and based on reading pop-sci articles, the big breakthroughs come from combining techniques that are counter-intuitive and/or others did not think of. So roughly how often is the contribution of a proof considered "incremental" vs "significant"?


Well, for one the proof would have to use actual proof techniques.

What really happened here was that the LLM produced a python script that generated examples of hypergraphs that served as proof by example.

And the only thing that has been verified are these examples. The LLM also produced a lot of mathematical text that has not been analyzed.


I see, thanks for the explanation!

Do you know that from reading the proof, or are you just assuming this based on what you think LLMs should be capable of? If the latter, what evidence would be required for you to change your mind?

- Edit: I can't reply, probably because the comment thread isn't allowed to go too deep, but this is a good argument. In my mind the argument isn't that coding is harder than math, but that the problems had resisted solution by human researchers.


1) this is a proof by example 2) the proof is conducted by writing a python program constructing hypergraphs 3) the consensus was this was low-hanging fruit ready to be picked, and tactics for this problem were available to the LLM

So really this is no different from generating any python program. There are also many examples of combinatoric construction in python training sets.

It's still a nice result, but it's not quite the breakthrough it's made out to be. I think that people somehow see math as a "harder" domain, and are therefore attributing more value to this. But this is a quite simple program in the end.


One of the possible outcomes of this journey is that “LLMs can never do X”. Another is that X is easier than we thought.

Or that some quixotic problems nobody cared about to the extent to actually work on them do have some solution.

>But human researchers are also remixers.

Some human researchers are also remixers to Some degree.

Can you imagine AI coming up with refraction & separation lie Newton did?


That sets a vastly higher bar than what we're talking about here. You're comparing modern AI to one of the greatest geniuses in human history. Obviously AI is not there yet.

That being said, I think this is a great question. Did Einstein and Newton use a qualitatively different process of thought when they made their discoveries? Or were they just exceedingly good at what most scientists do? I honestly don't know. But if LLMs reach super-human abilities in math and science but don't make qualitative leaps of insight, then that could suggest that the answer is 'yes.'


AI does not have a physical body to make experiments in the real world and build and use equipment

Maybe not, but more than 99.999999% of humans would also not come up with that.

Or even gravity to explain an apple falling from a tree- when almost all of the knowledge until then realistically suggested nothing about gravity?

Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.

My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.


This is my take as well. A human who learns, say, a Towers of Hanoi algorithm, will be able to apply it and use it next time without having to figure it out all over again. An LLM would probably get there eventually, but would have to do it all over again from scratch the next time. This makes it difficult combine lessons in new ways. Any new advancement relying on that foundational skill relies on, essentially, climbing the whole mountain from the ground.

I suppose the other side of it is that if you add what the model has figured out to the training set, it will always know it.


I got Gemini to find a polynomial-time algorithm for integer factoring, but then I mysteriously got locked out of my Google account. They should at least refund me the tokens.

That sounds like the start of a very lucrative career. Are you sure it was Gemini and not an AI competitor offering affiliate commission? ;)

For number of mathematicians familiar with and actively working on the problem, modern mathematics research is incredibly specialized, so it's easy to keep track of who's working on similar problems. You read each other's papers, go to the same conferences etc.

For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).


I hope any hold-outs who aren't convinced yet will be after reading this comment!

Did you wind up sticking with Windows (or Mac) for a long time after this? How long until you tried again?


Yeah this reduces the time required to crack a password from

(# available characters) ^ (password length)

to

(# available characters) * (password length).

If you were patient you could crack someone's passwords by hand.


100%. Few things on this site have resonated with me so much.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: