Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I know I'm running a bit late to the party here, but maybe someone can provide some color that I (on the slightly older end of the spectrum when it comes to this) don't fully understand.

When people talk about leaving their agents to run overnight, what are those agents actually doing? The limited utility I've had using agent-supported software development requires a significant amount of hand holding, maybe because I'm in an industry with limited externally available examples to build am model off of (though all of the specifications are public, I've yet to see an agent build an appropriate implementation).

So it's much more transactional...I ask, it does something (usually within seconds), I correct, it iterates again...

What sort of tasks are people putting these agents to? How are people running 'multiple' of these agents? What am I missing here?

 help



My impression so far is that the parallel agent story is a fabrication of "ai influencers" and the labs themselves.

I might run 3-4 claude sessions because that's the only way to have "multiple chats" to e.g. ask unrelated things. Occasionally a task takes long enough to keep multiple sessions busy, but that's rather rare and if it happens its because the agent runs a long running task like the whole test suite.

The story of running multiple agents to build full features in parallel... doesn't really add up in my experience. It kinda works for a bit if you have a green field project where the complexity is still extremely low.

However once you have a feature interaction matrix that is larger than say 3x3 you have to hand hold the system to not make stupid assumptions. Or you prompt very precisely but this also takes time and prevents you from ever running into the parallel situation.

The feature interaction matrix size is my current proxy "pseudo-metric" for when agentic coding might work well and at which abstraction level.


This is exactly my experience as well. The feature interaction matrix is growing as models get better, and I tend to build "prompt library components" for each project which saves time on "you prompt very precisely but this also takes time".

But so far that doesn't change the reality - I can't find any opportunities to let an agent run for more than 30 minutes at best, and parallel agents just seem to confuse each other.


idk I haven't really hit the point with any llm that it comes up with useful abstractions on its own unless those abstractions have been in the training data.

E.g. imagine building a google docs clone where you have different formatting options. Claude would happily build bold and italic for you but if afterwards you add headings, tables, colors, font size, etc. It would just produce a huge if/else tree instead of building a somewhat sensible text formatting abstraction.

Tbf I wouldn't actually know how to build this myself but e.g. bold and italic work together but if you add a "code block" thing that should probably not work with font color and putting a table inside that also makes no sense.

Claude might get some of these interactions intuitively correct but at some point you'll have so many NxM interactions between features that it just forgets half of them and then the experience becomes sloppy and crashes on all edge cases.

The point of good software engineering is to simplify the matrix to something that you can keep arguing about e.g. classify formatting options into categories and then you only have to argue and think about how those categories interact.

This is the kind of thing LLMs just aren't really good at if the problem space isn't in the training data already => doing anything remotely novel. And I haven't seen it improve at this either over the releases.

Maybe this kind of engineering will eventually be dead because claude can just brute force the infinitely growing if/else tree and keep it all in context but that does not seem very likely to me. So far we still have to think of these abstraction levels ourselves and then for the sub-problems I can apply agentic coding again.

Just need to make sure that Claude doesn't breach these abstractions, which it also happily does to take short cuts btw.


FWIW I’ve used LLMs to invent new things. Not super groundbreaking fundamental research, but they were able to use physics to design a device that didn’t exist yet, from first principles.

Would you share a bit more?

Pics or it didn't happen

More seriously, what in the world "novel" physics device did you invent?


I didn’t say “novel physics” or “physics device”.

Okay, so rereading it as pedantically as you seem to insist:

You "invented" ("Designed") a "device" "using physics", and nobody has designed that "device" before, making it novel.

"From first principles" is a fun statement because people like Aristotle also thought they were reasoning from "first principles" and look how far it got them. The entire point of science is that "first principles" are actually not something we have access to, so we should instead prioritize what literally happens and can be observed. It's not possible as far as we know to trick mother nature into giving us the answer we want rather than the real answer.

Did you ever actually build or test this "device"?


Same. The only situation when I've consistently gotten a system to run for 20+ minutes was a data-analysis with tight guardrails and explicit multi-phase operations.

Outside that I'm juggling 2-3 sessions at most with nothing staying unattended for more than 10 minutes.


I might be able to shine a little light on this.

I came from embedded, where I wasn't able to use agents very effectively for anything other than quick round trip iterative stuff. They were still really useful, but I definitely could never envision just letting an agent run unattended.

But I recently switched domains into vaguely "fullstack web" using very popular frameworks. If I spend a good portion of my day going back and forth with an agent, working on a detailed implementation plan that spawns multiple agents, there is seemingly no limit* to the scope of the work they are able to accurately produce. This is because I'm reading through the whole plan and checking for silly gotchyas and larger implementation mistakes before I let them run. It's also great because I can see how the work can be parallelized at certain parts, but blocked at others, and see how much work can be parallelized at once.

Once I'm ready, I can usually let it start with not even the latest models, because the actual implementation is so straightforwardly prompted that it gets it close to perfectly right. I usually sit next to it and validate it while it's working, but I could easily imagine someone letting it run overnight to wake up to a fresh PR in the morning.

Don't get me wrong, it's still more work that just "vibing" the whole thing, but it's _so_ much more efficient than actually implementing it, especially when it's a lot of repetitive patterns and boilerplate.

* I think the limit is how much I can actually keep in my brain and spec out in a well thought out manner that doesn't let any corner cases through, which is still a limit, but not necessarily one coming from the agents. Once I have one document implemented, I can move on to the next with my own fresh mental context which makes it a lot easier to work.


The amount of boilerplate people talk about seems like the fault of these big modern frameworks honestly. A good system design shouldn't HAVE so much boilerplate. Think people would be better off simplifying and eliminating it deterministically before reaching for the LLM slot machine.

I'm not so sure I agree. To me it's somewhat magical that I can write even this amount of code and have this stuff just magically work on pretty much every platform via docker, the web platform, etc. Maybe this again is me having started with embedded, but I am blown away at the ratio of actual code to portability we currently have.

> To me it's somewhat magical that I can write even this amount of code

It's because you're not writing it, you adopted the role of Project Manager or Chief Engineer. How much cognitive debt are you accumulating?


Interesting. What would you say is your ratio of "sit down and make the implementation" time to "multi-agent system builds the thing" time?

I had a few useful examples of this. In order to make it work you need to define your quality gates, and rather complex spec. I personally use https://github.com/probelabs/visor for creating the gates. It can be a code-review gate, or how well implementation align with the spec and etc. And basically it makes agent loop until it pass it. One of the tips, especially when using Claude Code, is explictly ask to create a "tasks", and also use subagents. For example I want to validate and re-structure all my documentation - I would ask it to create a task to research state of my docs, then after create a task per specific detail, then create a task to re-validate quality after it has finished task. You can also play around with the gates with a more simple tooling, for example https://probelabs.com/vow/

Hope it helps!


> One of the tips, especially when using Claude Code, is explictly ask to create a "tasks", and also use subagents. For example I want to validate and re-structure all my documentation - I would ask it to create a task to research state of my docs, then after create a task per specific detail, then create a task to re-validate quality after it has finished task.

This is definitely a way to keep those who wear Program and Project manager hats busy.


That is interesting. Never considered trying to throw one or two into a loop together to try to keep it honest. Appreciate the Visor recommendation, I'll give it a look and see if I can make this all 'make sense'.

Not a dev but doing some side projects.

As i build with agents, i frequently run into new issues that arent in scope for the task im on and would cause context drift. I have the agent create a github issue with a short problem description and keep going on the current task. In another terminal i spin up a new agent and just tell it “investigate GH issue 123” and it starts diving in, finds the root cause, and proposes a fix. Depending on what parts of the code the issue fix touches and what other agents ive got going, i can have 3-4 agents more or less independently closing out issues/creating PRs for review at a time. The agents log their work in a work log- what they did, what worked what didnt, problems they encountered using tools - and about once a day i have an agent review the worklog and update the AGENTS.md with lessons learned.


What are you using for environment for this, I am running into similar issues, can't really spin up a second agent because they would collide. Just a newly cloned repo?

With 5.3 Codex, the execplans skill and a well specified implementation task, you can get a good couple of hours work in a single turn. That's already in the scope of "set it up before bed and review it in the morning".

If you have a loop set up, e.g., using OpenClaw or a Ralph loop, you can stretch that out further.

I would suggest that when you get to that point really, you want some kind of adversarial system set up with code reviews (e.g., provided by CodeRabbit or Sourcery) and automation to feed that back into the coding agent.


> what are those agents actually doing?

Providing material for attention-grabbing headlines and blog posts, primarily. Can't (in good conscience, at least) claim you had an agent running all night if you didn't actually run an agent all night.


Maybe it's the programmer equivalent of rolling coal.

Fellow Midwesterner?

If you visualize it as AI Agents throwing a rope to wrangle a problem, and then visualize a dozen of these agents throwing their ropes around a room, and at each other -- very quickly you'll also visualize the mess of code that a collections of agents creates without oversight. It might even run, some might say that's the only true point but... at what cost in code complexity, performance waste, cascading bugs, etc.

Is it possible? Yes, I've had success with having a model output a 100 step plan that tried to deconflict among multiple agents. Without re-creating 'Gas town', I could not get the agents to operate without stepping on toes. With _me_ as the grand coordinator, I was able to execute and replicate a SaaS product (at a surface level) in about 24hrs. Output was around 100k lines of code (without counting css/js).

Who can prove that it works correctly though? An AI enthusiasts will say "as long as you've got test coverage blah blah blah". Those who have worked large scale products know that tests passing is basically "bare minimum". So you smoke test it, hope you've got all the paths, and toss it up and try to collect money from people? I don't know. If _this_ is the future, this will collapse under the weight of garbage code, security and privacy breaches, and who knows what else.


I will give you an example I heard from an acquaintance yesterday - this person is very smart but not strictly “technical”.

He is building a trading automation for personal use. In his design he gets a message on whatsapp/signal/telegram and approves/rejects the trade suggestion.

To define specifications for this, he defined multiple agents (a quant, a data scientist, a principal engineer, and trading experts - “warren buffett”, “ray dalio”) and let the agents run until they reached a consensus on what the design should be. He said this ran for a couple of hours (so not strictly overnight) after he went to sleep; in the morning he read and amended the output (10s of pages equivalent) and let it build.

This is not a strictly-defined coding task, but there are now many examples of emerging patterns where you have multiple agents supporting each other, running tasks in parallel, correcting/criticising/challenging each other, until some definition of “done” has been satisfied.

That said, personally my usage is much like yours - I run agents one at a time and closely monitor output before proceeding, to avoid finding a clusterfuck of bad choices built on top of each other. So you are not alone my friend :-)


This is my experience of it too. Perhaps if it was chunking through a large task like upgrading all of our repos to the latest engine supported by our cloud provider, I could leave it overnight. Even then it would just result in a large daylight backlog of "not quite right" to review and redo.

I think that's the issue I have with using these tools so far (definitely professionally, but even in pet projects for embedded systems). The mental load of having to go back through and make sure all of the lines of code do what the agent claims they do, even with tests, is significantly more than it would take to learn the implementation myself.

I can see the utility in creating very simple web-based tools where there's a monstrous wealth of public resources to build a model off of, but even the most recent models provided by Anthro, OpenAI, or MSFT seem prone to not quite perfection. And every time I find an error I'm left wondering what other bugs I'm not catching.


What I tell my kids is: You know how when you ask AI about something you know very well, how its answers are always somewhat wrong? It's like that for things you do not know very well too.

This is very dependent on what kind of work you're asking the agent to do. For software, I've had quite a bit of success providing detailed API specifications and asking an LLM to build a client library for that. You can leave it running unattended as long as it knows what it's supposed to build and it won't need a lot of correction since you're providing the routes, returned statuses and possible error messages.

Do some people just create complete SaaSlop apps with it overnight? Of course, just put together a plan (by asking the LLM to write the plan) with everything you want the app to do and let it run.


> it won't need a lot of correction since you're providing the routes, returned statuses and possible error messages.

Wouldn’t be better to setup an api docs (Postman, RapidApi,…), extract an OpenAPI version from that, then use a generator for your language of choice (Nswag,…)?


I have agents run at night to work through complicated TTRPG campaigns. For example I have a script that runs all night simulating NPCs before a session. The NPCs have character sheets + motivations and the LLMs do one prompt per NPC in stages so combat can happen after social interactions. IF you run enough of these and make the prompts well written you can save a lot of time. You can't like... simulate the start of a campaign and then jump in. Its more like you know there is a big event, you already have characters, you can throw them in a folder to see how things would cook all else being equal and then use that to riff off of when you actually write your notes.

I think of my agents like golems from disc world, they are defined by their script. Adding texture to them improves the results so I usually keep a running tally of what they have worked on and add that to the header. They are a prompt in a folder that a script loops over and sends to gemeni(spawning an agent and moving to the next golem script)

I also was curious to see if it could be used it for developing some small games, whenever I would run into a problem I couldn't be bothered to solve or needed a variety of something I would let a few llms work on it so in the morning I had something to bounce off. I had pretty good success with this for RTS games and shooting games where variety is something well documented and creativity is allowed. I imagine there could be a use here, I've been calling it dredging cause I imagine myself casting a net down into the slop to find valuables.

I did have an idea where all my sites and UI would be checked against some UI heuristic like Oregon State's inclusivity heuristic but results have been mixed so far. The initial reports are fine, the implementation plans are ok but it seems like the loop of examine, fix, examine... has too much drift? That does seem solvable but I have a concern that this is like two lines that never touch but get closer as you approach infinity.

There is some usefulness in running these guys all night but I'm still figuring out when its useful and when its a waste of resources.


Spin up a mid sized linux vm (or any machine with 8 or 12 cores will do with at least 16GB RAM with nmve). Add 10 users. Install claude 10 times (one per user). Clone repo 10 times (one per user). Have a centralized place to get tasks from (db, trello, txt, etc) - this is the memory. Have a cron wake up every 10 minutes and call your script. Your script calls claude in non-interactive mode + auto accept. It grabs a new task, takes a crack at it and create a pull request. That is 6 tasks per hour per user, times 12 hours. Go from there and refine your harnesses/skills/scripts that claude's can use.

In my case, I built a small api that claude can call to get tasks. I update the tasks on my phone.

The assumption is that you have a semi-well structured codebase already (ours is 1M LOC C#). You have to use languages with strong typing + strict compiler.You have to force claude to frequently build the code (hence the cpu cores + ram + nmve requirement).

If you have multiple machines doing work, have single one as the master and give claude ssh to the others and it can configure them and invoke work on them directly. The usecase for this is when you have a beefy proxmox server with many smaller containers (think .net + debian). Give the main server access to all the "worker servers". Let claude document this infrastructure too and the different roles each machine plays. Soon you will have a small ranch of AI's doing different things, on different branches, making pull requests and putting feedback back into task manager for me to upvote or downvote.

Just try it. It works. Your mind will be blown what is possible.


So is this something you do with a monthly subscription or is this using API tokens?

At first used Claude Max x5, but we are using the api now.

We only give it very targeted tasks, no broad strokes. We have a couple of "prompt" templates, which we select when creating tasks. The new opus model one shots about 90% of tasks we throw at it. Getting a ton of value from diagnostic tasks, it can troubleshoot really quickly (by ingesting logs, exceptions, some db rows).


Thanks, in your example are you saying that you had 10 Claude accounts or all 10 user accounts able to work in the allotment for a single Claude subscription. I've only ever dealt with the API and it got way too expensive quickly for the quality I was getting back.

There has only been one instance of coding where I let the agent run for like 7 hours. To generate playwright test. Once the scaffolding is done, it is just matter of writing test for each of the component. But yeah even for that I didn't just fire and forget.

I wrote a program to classify thousands of images but that was using a model running on my gaming PC. Took about 3 days to classify them all. Only cost me the power right?

Power, gaming rig, internet, somewhere to store the rig, probably pay property taxes too.

You can draw the line wherever you want. :) Personally, I wish I'd built a new gaming rig a year ago so I could mess with local models and pay all these same costs.


> what are those agents actually doing

Generate material for yet another retarded twitter hype post.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: