I often feel these types of blogposts would be more helpful if they demonstrated...

simonw · 2025-10-11T15:10:21 1760195421

Here's one from today: https://mitchellh.com/writing/non-trivial-vibing

qsort · 2025-10-11T16:46:50 1760201210

> Important: there is a lot of human coding, too.

I'm not highlighting this to gloat or to prove a point. If anything in the past I have underestimated how big LLMs were going to be. Anyone so inclined can take the chance to point and laugh at how stupid and wrong that was. Done? Great.

I don't think I've been intentionally avoiding coding assistants and as a matter of fact I have been using Claude Code since the literal day it first previewed, and yet it doesn't feel, not even one bit, that you can take your hands off the wheel. Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.

simonw · 2025-10-11T16:49:27 1760201367

Yeah, my current opinion on this is that AI tools make development harder work. You can get big productivity boosts out of them but you have to be working at the top of your game - I often find I'm mentally exhausted after just a couple of hours.

dotinvoke · 2025-10-11T18:11:51 1760206311

My experience with AI tools is the opposite. The biggest energy thieves for me are configuration issues, library quirks, or trivial mistakes that are hard to spot. With AI I can often just bulldoze past those things and spend more time on tangible results.

When using it for code or architecture or design, I’m always watching for signs that it is going off the rails. Then I usually write code myself for a while, to keep the structure and key details of whatever I’m doing correct.

troupo · 2025-10-11T21:29:59 1760218199

For me, LLMs always, without fail get important details wrong.

- incessantly duplicating already existing functionality: utility functions, UI components etc.

- skipping required parameters like passing current user/actor to DB-related functions

- completely ignoring large and small chunks of existing UI and UI-related functionality like layouts or existing styles

- using ad-hoc DB queries or even iterating over full datasets in memory instead of setting up proper DB queries

And so on and so forth.

YYMV of course depending on language and project

simonw · 2025-10-11T21:47:33 1760219253

Sounds to me like you'd benefit from providing detailed instructions to LLMs about how they should avoid duplicating functionality (which means documenting the functionality they should be aware of), what kind of parameters are always required, setting up "proper DB queries" etc.

... which is exactly the kind of thing this new skills mechanism is designed to solve.

troupo · 2025-10-12T06:40:24 1760251224

> Sounds to me like you'd benefit from providing detailed instructions to LLMs about how they should avoid duplicating functionality

That they routinely ignore.

> which means documenting the functionality they should be aware of

Which means spending inordinate amounts of time writing down about every single function and component and css and style which can otherwise be easily discovered by just searching. Or by looking at adjacent files.

> which is exactly the kind of thing this new skills mechanism is designed to solve.

I tried it yesterday. It immediately duplicated functionality, ignored existing styles and components, and created ad-hoc queries. It did feel like there were fewer times when it did that, but it's hard to quantify.

james_marks · 2025-10-11T19:03:45 1760209425

100%. It’s like managing an employee that always turns their work in 30 seconds later; you never get a break.

I also have to remember all of the new code that’s coming together, and keep it from re-inventing other parts of the codebase, etc.

More productive, but hard work.

sawmurai · 2025-10-11T18:05:31 1760205931

I have a similar experience. It feels like riding your bike in a higher gear - you can go faster but it will take more effort and you need the potential (stronger legs) to make use of it

specproc · 2025-10-11T20:40:09 1760215209

It's more like shifting from a normal to an electric bike.

You can go further and faster, but you can get to a point where you're out of juice miles from home, and getting back is a chuffing nightmare.

Also, you discover that you're putting on weight and not getting that same buzz you got on your old pushbike.

truetraveller · 2025-10-12T01:07:28 1760231248

Hey, that's a great analogy, 10/10! This explains in a few words what an entire article might explain.

jstummbillig · 2025-10-11T17:00:13 1760202013

Considering the last 2 years, has it become harder or easier?

simonw · 2025-10-11T19:25:17 1760210717

Definitely harder.

A year ago I was using GitHub Copilot autocomplete in VS Code and occasionally asking ChatGPT or Claude to help write me a short function or two.

Today I have Claude Code and Codex CLI and Codex Web running, often in parallel, hunting down and resolving bugs and proposing system designs and collaborating with me on detailed specs and then turning those specs into working code with passing tests.

The cognitive overhead today is far higher than it was a year ago.

dingdingdang · 2025-10-11T19:53:03 1760212383

Also better and faster though!! It's close to a Daft Punk type situation.

Fuzzwah · 2025-10-11T21:37:34 1760218654

Copilot is the perfect name.

truetraveller · 2025-10-11T23:36:46 1760225806

Woah, that's huge coming from you. This comment itself is worth an article. Do it. Call it "AI tools make development harder work".

P.s. always thought you were one of those irrational AI bros. Later, found that you were super reasonable. That's the way it should be. And thank you!

Pannoniae · 2025-10-11T19:13:16 1760209996

In fact, I've been writing more code myself since these tools exist - maybe I'm not a real developer but in the past I might have tried to either find a library online or try to find something on the internet to copypaste and adapt, nowadays I give it a shot myself with Claude.

For context, I mainly do game development so I'm viewing it through that lens - but I find it easier to debug something bad than to write it from scratch. It's more intensive than doing it yourself but probably more productive too.

scuff3d · 2025-10-12T03:13:38 1760238818

> Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.

It's funny because not far below this comment there is someone doing literally this.

oblio · 2025-10-11T16:54:31 1760201671

LLMs are autonomous driving level 2.

j_bum · 2025-10-11T15:44:41 1760197481

This was a fun read.

I’ve similarly been using spec.md and running to-do.md files that capture detailed descriptions of the problems and their scoped history. I mark each of my to-do’s with informational tags: [BUG], [FEAT], etc.

I point the LLM to the exact to-do (or section of to-do’s) with the spec.md in memory and let it work.

This has been working very well for me.

lcnPylGDnU4H9OF · 2025-10-11T16:15:23 1760199323

Do you mind linking to example spec/to-do files?

j_bum · 2025-10-11T19:36:15 1760211375

Sure thing. Here is an example set of the agent/spec/to-do files for a hobby project I'm actively working on.

https://gist.github.com/JacobBumgarner/d29b660cb81a227885acc...

lcnPylGDnU4H9OF · 2025-10-11T21:14:04 1760217244

Thanks!

j_bum · 2025-10-11T22:02:57 1760220177

No problem! I’d love to hear any approach you’ve taken as well.

SteveJS · 2025-10-11T17:35:26 1760204126

Here is a (3 month old) repo where i did something like that and all the tasks are checked into the linear git history — https://github.com/KnowSeams/KnowSeams

nightski · 2025-10-11T16:31:43 1760200303

Even though the author refers to it as "non-trivial", and I can see why that conclusion is made, I would argue it is in fact trivial. There's very little domain specific knowledge needed, this is purely a technical exercise integrating with existing libraries for which there is ample documentation online. In addition, it is a relatively isolated feature in the app.

On top of that, it doesn't sound enjoyable. Anti slop sessions? Seriously?

Lastly, the largest problem I have with LLMs is that they are seemingly incapable of stopping to ask clarifying questions. This is because they do not have a true model of what is going on. Instead they truly are next token generators. A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.

simonw · 2025-10-11T16:50:35 1760201435

The hardest problem in computer science in 2025 is presenting an example of AI-assisted programming that somebody won't call "trivial".

nightski · 2025-10-11T17:30:51 1760203851

If all I did was call it trivial that would be a fair critique. But it was followed up with a lot more justification than that.

simonw · 2025-10-11T21:56:31 1760219791

Here's the PR. It touched 21 files. https://github.com/ghostty-org/ghostty/pull/9116/files

If that's your idea of trivial then you and I have very different standards in terms of what's a trivial change and what isn't.

groby_b · 2025-10-12T00:02:36 1760227356

It's trivial in the sense that a lot of the work isn't high cognitive load. But... that's exactly the point of LLMs. It takes the noise away so you can focus on high-impact outcomes.

Yes, the core of that pull requests is an hour or two of thinking, the rest is ancillary noise. The LLM took away the need for the noise.

If your definition of trivial is signal/noise ratio, then, sure, relatively little signal in a lot of noise. If your definition of "trivial" hinges on total complexity over time, then this kicks the pants of manual writing.

I'd assume OP did the classic senior engineer stick of "I can understand the core idea quickly, therefore it can't be hard". Whereas Mitchel did the heavy lifting of actually shipping the "not hard" idea - still understanding the core idea quickly, and then not getting bogged down in unnecessary details.

That's the beauty of LLMs - it turns the dream of "I could write that in a weekend" into actually reality, where it before was always empty bluster.

kannanvijayan · 2025-10-11T16:41:07 1760200867

I've wondered about exposing this "asking clarifying questions" as a tool the AI could use. I'm not building AI tooling so I haven't done this - but what if you added an MCP endpoint whose description was "treat this endpoint as an oracle that will answer questions and clarify intent where necessary" (paraphrased), and have that tool just wire back to a user prompt.

If asking clarifying questions is plausible output text for LLMs, this may work effectively.

simonw · 2025-10-11T16:57:37 1760201857

I think the asking clarifying questions thing is solved already. Tell a coding agent to "ask clarifying questions" and watch what it does!

nightski · 2025-10-11T17:27:29 1760203649

Obviously if you instruct the autocomplete engine to fill in questions it will. That's not the point. The LLM has no model of the problem it is trying to solve, nor does it attempt to understand the problem better. It is merely regurgitating. This can be extremely useful. But it is very limiting when it comes to using as an agent to write code.

wrs · 2025-10-11T18:34:13 1760207653

You can work with the LLM to write down a model for the code (aka a design document) that it can then repeatedly ingest into the context before writing new code. That what “plan mode” is for. The technique of maintaining a design document and a plan/progress document that get updated after each change seems to make a big difference in keeping the LLM on track. (Which makes sense…exactly the same thing works for human team mambers too.)

habinero · 2025-10-12T06:53:14 1760251994

Every time I hear someone say something like this, I think of the pigeons in the Skinner box who developed quirky superstitious behavior when pellets were dispensed at random.

troupo · 2025-10-12T07:50:57 1760255457

> that it can then repeatedly ingest into the context

1. Context isn't infinite

2. Both Claude and OpenAI get increasingly dumb after 30-50% of context had been filled

wrs · 2025-10-13T21:53:05 1760392385

Not sure how that's relevant... I haven't seen many design documents of infinite size.

troupo · 2025-10-14T06:15:19 1760422519

"Infinite" is a handy shortcut for "large enough".

Even the "million token context window" becomes useless once it's filled to 30-50% and the model starts "forgetting" useful things like existing components, utility functions, AGENTS.md instructions etc.

Even a junior programmer can search and remember instructions and parts of the codebase. All current AI tools have to be reminded to recreate the world from scratch every time, and promptly forget random parts of it.

subjectivationx · 2025-10-12T14:14:27 1760278467

I think at some point we will stop pretending we have real AI. We have a breakthrough in natural language processing but LLMs are much closer to Microsoft Word than something as fantastical as "AGI". We don't blame Microsoft Word for not having a model of what is being typed in. It would be great if Microsoft Word could model the world and just do all the work for us but it is a science fiction fantasy. To me, LLMs in practice are largely massively compute inefficient search engines plus really good language disambiguation. Useful, but we have actually made no progress at all towards "real" AI. This is especially obvious if you ditch "AI" and call it artificial understanding. We have nothing.

danielbln · 2025-10-11T21:06:48 1760216808

I've added "amcq means ask me clarifying questions" to my global Claude.md so I can spam "amcq" at various points in time, to great avail.

antonvs · 2025-10-11T17:15:15 1760202915

> A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.

Didn’t you just describe Agile?

Retric · 2025-10-11T18:20:41 1760206841

Who hurt you?

Sorry couldn’t resist. Agile’s point was getting feedback during the process rather than after something is complete enough to be shipped thus minimizing risk and avoiding wasted effort.

Instead people are splitting up major projects into tiny shippable features and calling that agile while missing the point.

rkomorn · 2025-10-11T18:28:52 1760207332

I've never seen a working scrum/agile/sprint/whatever product/project management system and I'm convinced it's because I've just never seen an actual implementation of one.

"Splitting up major projects into tiny shippable features and calling that agile" feels like a much more accurate description of what I've experienced.

I wish I'd gotten to see the real thing(s) so I could at least have an informed opinion.

Retric · 2025-10-11T19:31:09 1760211069

Yea, I think scrum etc is largely a failure in practice.

The manager for the only team I think actually checked all the agile boxes had a UI background so she thought in terms of mock-ups, backend, and polishing as different tasks and was constantly getting client feedback between each stage. That specific approach isn’t universal, the feedback as part of the process definitely should be though.

What was a little surreal is the pace felt slow day to day but we were getting a lot done and it looked extremely polished while being essentially bug free at the end. An experienced team avoiding heavy processes, technical debt, and wasted effort goes a long way.

habinero · 2025-10-12T06:58:01 1760252281

People misunderstand the system, I think. It's not holy writ, you take the parts of it that work for your team and ditch the rest. Iterate as you go.

The failure modes I've personally seen is an organization that isn't interested in cooperating or the person running the show is more interested in process than people. But I'd say those teams would struggle no matter what.

rkomorn · 2025-10-12T07:09:05 1760252945

I put a lot of the responsibility for the PMing failures I've seen on the engineering side not caring to invest anything at all into the relationship.

Ultimately, I think it's up to the engineering side to do its best to leverage the process for better results, and I've seen very little of that (and it's of course always been the PM side's fault).

And you're right: use what works for you. I just haven't seen anything that felt like it actually worked. Maybe one problem is people iterating so fast/often they don't actually know why it's not working.

Balinares · 2025-10-11T19:51:59 1760212319

I've seen the real thing and it's pretty much splitting major projects into tiny shippable bits. Picking which bits and making it so they steadily add up to the desired outcomes is the hard part.

antonvs · 2025-10-11T23:36:12 1760225772

Agile’s point was to get feedback based on actual demoable functionality, and iterate on that. If you ignore the “slop” pejorative, in the context of LLMs, what I quoted seems to fit the intent of Agile.

Retric · 2025-10-12T01:38:32 1760233112

There’s generally a big gap between the minimum you can demo and an actual feature.

antonvs · 2025-10-12T04:16:58 1760242618

If you want to use an LLM to generate a minimal demoable increment, you can. The comment I replied to mentioned "feature", but that's a choice based on how you direct the LLM. On the other hand, LLM capabilities may change the optimal workflow somewhat.

Either way, the ability to produce "working software" (as the manifesto puts it) in "frequent" iterations (often just seconds with an LLM!) and iterate on feedback is core to Agile.

causal · 2025-10-11T16:37:22 1760200642

Using LLMs for coding complex projects at scale over a long time is really challenging! This is partly because defining requirements alone is much more challenging than most people want to believe. LLMs accelerate any move in the wrong direction.

dexwiz · 2025-10-11T16:58:01 1760201881

My analogy is LLMs are a gas pedal. Makes you go fast, but you still have to know when to turn.

sreekanth850 · 2025-10-11T17:01:56 1760202116

SteveJS · 2025-10-11T17:26:01 1760203561

Having the llm write the spec/workunit from a conversation works well. Exploring a problem space with a (good) coding agent is fantastic.

However for complex projects IMO one must read what was written by the llm … every actual word.

When it ‘got away’ from me, in each case I left something in the llm written markdown that I should have removed.

99% “I can ask for that later” and 1% “that’s a good idea i hadn’t considered” might be the right ratio when reading an llm generated plan/spec/workunit.

Breaking work into single context passes … 50-60k tokens in sonnet 4.5 has had typically fantastic results for me.

My side project is using lean 4 and a carelessly left in ‘validate’ rather than ‘verify’ lead down a hilariously complicated path equivalent to matching an output against a known string.

I recovered, but it wasn’t obvious to me that was happening. I however would not be able to write lean proofs myself, so diagnosing the problem and fixing it is a small price to be able to mechanically verify part of my software is correct.

sreekanth850 · 2025-10-11T17:01:43 1760202103

One should know theend to end design and architecture. Should stop llm when adding complex fancy things.

khaledh · 2025-10-11T13:58:26 1760191106

Agreed. The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.

The most challenging part when working with coding agents is that they seem to do well initially on a small code base with low complexity. Once the codebase gets bigger with lots of non-trivial connections and patterns, they almost always experience tunnel vision when asked to do anything non-trivial, leading to increased tech debt.

mwigdahl · 2025-10-11T14:48:29 1760194109

The problem is that you're talking about a multistep process where each step beyond the first depends on the particular path the agent starts down, along with human input that's going to vary at each step.

I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.

It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.

https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...

potatolicious · 2025-10-11T16:28:20 1760200100

What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.

So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.

This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.

marcosdumay · 2025-10-11T16:55:11 1760201711

It's the heart of the problem with all software engineer research. That's why we have so little reliable knowledge.

It applies to using LLMs too. I guess the one largest difference here is that LLM has few enough companies with abundant enough money pushing it to make it trivial for them to run a test like this. So the fact that they aren't doing that also says a lot.

oblio · 2025-10-11T16:56:06 1760201766

> What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

> "We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

Heh, I'd rephrase the first part to:

> What you're getting at is the heart of the problem with software development though, isn't it?

simonw · 2025-10-11T19:29:14 1760210954

The UK government ran a study with thousands of developers quite recently: https://www.gov.uk/government/publications/ai-coding-assista...

redhale · 2025-10-12T10:38:17 1760265497

I don't necessarily think the conclusions are wrong, but this relies entirely on self-reported survey results to measure productivity gains. That's too easy to poke holes in, and I think studies like this are unlikely to convince real skeptics in the near term.

simonw · 2025-10-12T14:16:53 1760278613

At this point it's becoming clear from threads similar to this one that quite a lot of the skeptics are actively working not to be convinced by anything.

redhale · 2025-10-12T14:24:03 1760279043

Do you have a study to back that up? /s

I agree. I think there are too many resources, examples, and live streams out there for someone to credibly claim at this point that these tools have no value and are all hype. I think the nuance is in how and where you apply it, what your expectations and tolerances are, and what your working style is. They are bad at many things, but there is tremendous value to be discovered. The loudest people on both sides of this debate are typically wrong in similar ways imo.

subjectivationx · 2025-10-13T12:40:37 1760359237

I am not a software engineer but I am using my own vibe coded video efx software, my own vibe coded audio synth, my own vibe coded art generator for art. These aren't software products though. No one else is ever going to use them. The output is what matters to me. Even I can see that committing LLM generated code at your software job is completely insane. The only way to get a productivity increase is to not bother understanding what the program is doing. If you need to understand what is going on then why not just type it in yourself? My productivity increase is immeasurable because I wouldn't be able to write this video player I made. I have absolutely no idea how it works. It is exactly why I am not a software engineer. Professionals claiming a productivity boost have to be doing something along the lines of not understanding what the program is doing that is proportional to the claimed productivity increase. I don't see how you can have it both ways unless someone is just that slow of a typist.

b_e_n_t_o_n · 2025-10-12T02:25:01 1760235901

Woah, finally something with actual metrics instead of vibes!

> Trial participants saved an average of 56 minutes a working day when using AICAs

That feels accurate to me, but again I'm just going on vibes :P

troupo · 2025-10-11T21:33:04 1760218384

Before you get into the expensive part, how do you get past "non-deterministic blackbox with unknown layers in between imposed by vendors"

potatolicious · 2025-10-11T23:05:20 1760223920

You can measure probabilistic systems that you can't examine! I don't want to throw the baby out with the bathwater here - before LLMs became the all-encompassing elephant in the room we did this routinely.

You absolutely can quantify the results of a chaotic black box, in the same way you can quantify the bias of a loaded die without examining its molecular structure.

claytongulick · 2025-10-11T17:39:47 1760204387

> The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.

If that's what we need to do, don't we already have the answer to the question?

spankibalt · 2025-10-11T15:40:07 1760197207

> "Maybe I'm a curmudgeon but most of these types of blogs feel like marketing pieces with the important bit is that so much is left unsaid and not shown, that it comes off like a kid trying to hype up their own work without the benefit of nuance or depth."

C'mon, such self-congratulatory "Look at My Potency: How I'm using Nicknack.exe" fluffies always were and always will be a staple of the IT industry.

lcnPylGDnU4H9OF · 2025-10-11T16:16:48 1760199408

Still, the best such pieces are detailed and explanatory.

danielmarkbruce · 2025-10-11T18:55:18 1760208918

Why not just use claude code and come to your own conclusion?

coolKid721 · 2025-10-11T14:54:53 1760194493

Yeah I was reading this seeing if there was something he'd actually show that would be useful, what pain point he is solving, but it's just slop.