Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I often feel these types of blogposts would be more helpful if they demonstrated someone using the tools to build something non-trivial.

Is Claude really "learning new skills" when you feed it a book, or does it present it like that because you're prompting encourages that sort of response-behavior. I feel like it has to demo Claude with the new skills and Claude without.

Maybe I'm a curmudgeon but most of these types of blogs feel like marketing pieces with the important bit is that so much is left unsaid and not shown, that it comes off like a kid trying to hype up their own work without the benefit of nuance or depth.




> Important: there is a lot of human coding, too.

I'm not highlighting this to gloat or to prove a point. If anything in the past I have underestimated how big LLMs were going to be. Anyone so inclined can take the chance to point and laugh at how stupid and wrong that was. Done? Great.

I don't think I've been intentionally avoiding coding assistants and as a matter of fact I have been using Claude Code since the literal day it first previewed, and yet it doesn't feel, not even one bit, that you can take your hands off the wheel. Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.


Yeah, my current opinion on this is that AI tools make development harder work. You can get big productivity boosts out of them but you have to be working at the top of your game - I often find I'm mentally exhausted after just a couple of hours.


My experience with AI tools is the opposite. The biggest energy thieves for me are configuration issues, library quirks, or trivial mistakes that are hard to spot. With AI I can often just bulldoze past those things and spend more time on tangible results.

When using it for code or architecture or design, I’m always watching for signs that it is going off the rails. Then I usually write code myself for a while, to keep the structure and key details of whatever I’m doing correct.


For me, LLMs always, without fail get important details wrong.

- incessantly duplicating already existing functionality: utility functions, UI components etc.

- skipping required parameters like passing current user/actor to DB-related functions

- completely ignoring large and small chunks of existing UI and UI-related functionality like layouts or existing styles

- using ad-hoc DB queries or even iterating over full datasets in memory instead of setting up proper DB queries

And so on and so forth.

YYMV of course depending on language and project


Sounds to me like you'd benefit from providing detailed instructions to LLMs about how they should avoid duplicating functionality (which means documenting the functionality they should be aware of), what kind of parameters are always required, setting up "proper DB queries" etc.

... which is exactly the kind of thing this new skills mechanism is designed to solve.


> Sounds to me like you'd benefit from providing detailed instructions to LLMs about how they should avoid duplicating functionality

That they routinely ignore.

> which means documenting the functionality they should be aware of

Which means spending inordinate amounts of time writing down about every single function and component and css and style which can otherwise be easily discovered by just searching. Or by looking at adjacent files.

> which is exactly the kind of thing this new skills mechanism is designed to solve.

I tried it yesterday. It immediately duplicated functionality, ignored existing styles and components, and created ad-hoc queries. It did feel like there were fewer times when it did that, but it's hard to quantify.


100%. It’s like managing an employee that always turns their work in 30 seconds later; you never get a break.

I also have to remember all of the new code that’s coming together, and keep it from re-inventing other parts of the codebase, etc.

More productive, but hard work.


I have a similar experience. It feels like riding your bike in a higher gear - you can go faster but it will take more effort and you need the potential (stronger legs) to make use of it


It's more like shifting from a normal to an electric bike.

You can go further and faster, but you can get to a point where you're out of juice miles from home, and getting back is a chuffing nightmare.

Also, you discover that you're putting on weight and not getting that same buzz you got on your old pushbike.


Hey, that's a great analogy, 10/10! This explains in a few words what an entire article might explain.


Considering the last 2 years, has it become harder or easier?


Definitely harder.

A year ago I was using GitHub Copilot autocomplete in VS Code and occasionally asking ChatGPT or Claude to help write me a short function or two.

Today I have Claude Code and Codex CLI and Codex Web running, often in parallel, hunting down and resolving bugs and proposing system designs and collaborating with me on detailed specs and then turning those specs into working code with passing tests.

The cognitive overhead today is far higher than it was a year ago.


Also better and faster though!! It's close to a Daft Punk type situation.


Copilot is the perfect name.


Woah, that's huge coming from you. This comment itself is worth an article. Do it. Call it "AI tools make development harder work".

P.s. always thought you were one of those irrational AI bros. Later, found that you were super reasonable. That's the way it should be. And thank you!


In fact, I've been writing more code myself since these tools exist - maybe I'm not a real developer but in the past I might have tried to either find a library online or try to find something on the internet to copypaste and adapt, nowadays I give it a shot myself with Claude.

For context, I mainly do game development so I'm viewing it through that lens - but I find it easier to debug something bad than to write it from scratch. It's more intensive than doing it yourself but probably more productive too.


> Many are acting as if writing any code manually means "you're holding it wrong", which I feel it's just not true.

It's funny because not far below this comment there is someone doing literally this.


LLMs are autonomous driving level 2.


This was a fun read.

I’ve similarly been using spec.md and running to-do.md files that capture detailed descriptions of the problems and their scoped history. I mark each of my to-do’s with informational tags: [BUG], [FEAT], etc.

I point the LLM to the exact to-do (or section of to-do’s) with the spec.md in memory and let it work.

This has been working very well for me.


Do you mind linking to example spec/to-do files?


Sure thing. Here is an example set of the agent/spec/to-do files for a hobby project I'm actively working on.

https://gist.github.com/JacobBumgarner/d29b660cb81a227885acc...


Thanks!


No problem! I’d love to hear any approach you’ve taken as well.


Here is a (3 month old) repo where i did something like that and all the tasks are checked into the linear git history — https://github.com/KnowSeams/KnowSeams


Even though the author refers to it as "non-trivial", and I can see why that conclusion is made, I would argue it is in fact trivial. There's very little domain specific knowledge needed, this is purely a technical exercise integrating with existing libraries for which there is ample documentation online. In addition, it is a relatively isolated feature in the app.

On top of that, it doesn't sound enjoyable. Anti slop sessions? Seriously?

Lastly, the largest problem I have with LLMs is that they are seemingly incapable of stopping to ask clarifying questions. This is because they do not have a true model of what is going on. Instead they truly are next token generators. A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.


The hardest problem in computer science in 2025 is presenting an example of AI-assisted programming that somebody won't call "trivial".


If all I did was call it trivial that would be a fair critique. But it was followed up with a lot more justification than that.


Here's the PR. It touched 21 files. https://github.com/ghostty-org/ghostty/pull/9116/files

If that's your idea of trivial then you and I have very different standards in terms of what's a trivial change and what isn't.


It's trivial in the sense that a lot of the work isn't high cognitive load. But... that's exactly the point of LLMs. It takes the noise away so you can focus on high-impact outcomes.

Yes, the core of that pull requests is an hour or two of thinking, the rest is ancillary noise. The LLM took away the need for the noise.

If your definition of trivial is signal/noise ratio, then, sure, relatively little signal in a lot of noise. If your definition of "trivial" hinges on total complexity over time, then this kicks the pants of manual writing.

I'd assume OP did the classic senior engineer stick of "I can understand the core idea quickly, therefore it can't be hard". Whereas Mitchel did the heavy lifting of actually shipping the "not hard" idea - still understanding the core idea quickly, and then not getting bogged down in unnecessary details.

That's the beauty of LLMs - it turns the dream of "I could write that in a weekend" into actually reality, where it before was always empty bluster.


I've wondered about exposing this "asking clarifying questions" as a tool the AI could use. I'm not building AI tooling so I haven't done this - but what if you added an MCP endpoint whose description was "treat this endpoint as an oracle that will answer questions and clarify intent where necessary" (paraphrased), and have that tool just wire back to a user prompt.

If asking clarifying questions is plausible output text for LLMs, this may work effectively.


I think the asking clarifying questions thing is solved already. Tell a coding agent to "ask clarifying questions" and watch what it does!


Obviously if you instruct the autocomplete engine to fill in questions it will. That's not the point. The LLM has no model of the problem it is trying to solve, nor does it attempt to understand the problem better. It is merely regurgitating. This can be extremely useful. But it is very limiting when it comes to using as an agent to write code.


You can work with the LLM to write down a model for the code (aka a design document) that it can then repeatedly ingest into the context before writing new code. That what “plan mode” is for. The technique of maintaining a design document and a plan/progress document that get updated after each change seems to make a big difference in keeping the LLM on track. (Which makes sense…exactly the same thing works for human team mambers too.)


Every time I hear someone say something like this, I think of the pigeons in the Skinner box who developed quirky superstitious behavior when pellets were dispensed at random.


> that it can then repeatedly ingest into the context

1. Context isn't infinite

2. Both Claude and OpenAI get increasingly dumb after 30-50% of context had been filled


Not sure how that's relevant... I haven't seen many design documents of infinite size.


"Infinite" is a handy shortcut for "large enough".

Even the "million token context window" becomes useless once it's filled to 30-50% and the model starts "forgetting" useful things like existing components, utility functions, AGENTS.md instructions etc.

Even a junior programmer can search and remember instructions and parts of the codebase. All current AI tools have to be reminded to recreate the world from scratch every time, and promptly forget random parts of it.


I think at some point we will stop pretending we have real AI. We have a breakthrough in natural language processing but LLMs are much closer to Microsoft Word than something as fantastical as "AGI". We don't blame Microsoft Word for not having a model of what is being typed in. It would be great if Microsoft Word could model the world and just do all the work for us but it is a science fiction fantasy. To me, LLMs in practice are largely massively compute inefficient search engines plus really good language disambiguation. Useful, but we have actually made no progress at all towards "real" AI. This is especially obvious if you ditch "AI" and call it artificial understanding. We have nothing.


I've added "amcq means ask me clarifying questions" to my global Claude.md so I can spam "amcq" at various points in time, to great avail.


> A software engineer would never just slop out an entire feature based on the first discussion with a stakeholder and then expect the stakeholder to continuously refine their statement until the right thing is slopped out. That's just not how it works and it makes very little sense.

Didn’t you just describe Agile?


Who hurt you?

Sorry couldn’t resist. Agile’s point was getting feedback during the process rather than after something is complete enough to be shipped thus minimizing risk and avoiding wasted effort.

Instead people are splitting up major projects into tiny shippable features and calling that agile while missing the point.


I've never seen a working scrum/agile/sprint/whatever product/project management system and I'm convinced it's because I've just never seen an actual implementation of one.

"Splitting up major projects into tiny shippable features and calling that agile" feels like a much more accurate description of what I've experienced.

I wish I'd gotten to see the real thing(s) so I could at least have an informed opinion.


Yea, I think scrum etc is largely a failure in practice.

The manager for the only team I think actually checked all the agile boxes had a UI background so she thought in terms of mock-ups, backend, and polishing as different tasks and was constantly getting client feedback between each stage. That specific approach isn’t universal, the feedback as part of the process definitely should be though.

What was a little surreal is the pace felt slow day to day but we were getting a lot done and it looked extremely polished while being essentially bug free at the end. An experienced team avoiding heavy processes, technical debt, and wasted effort goes a long way.


People misunderstand the system, I think. It's not holy writ, you take the parts of it that work for your team and ditch the rest. Iterate as you go.

The failure modes I've personally seen is an organization that isn't interested in cooperating or the person running the show is more interested in process than people. But I'd say those teams would struggle no matter what.


I put a lot of the responsibility for the PMing failures I've seen on the engineering side not caring to invest anything at all into the relationship.

Ultimately, I think it's up to the engineering side to do its best to leverage the process for better results, and I've seen very little of that (and it's of course always been the PM side's fault).

And you're right: use what works for you. I just haven't seen anything that felt like it actually worked. Maybe one problem is people iterating so fast/often they don't actually know why it's not working.


I've seen the real thing and it's pretty much splitting major projects into tiny shippable bits. Picking which bits and making it so they steadily add up to the desired outcomes is the hard part.


Agile’s point was to get feedback based on actual demoable functionality, and iterate on that. If you ignore the “slop” pejorative, in the context of LLMs, what I quoted seems to fit the intent of Agile.


There’s generally a big gap between the minimum you can demo and an actual feature.


If you want to use an LLM to generate a minimal demoable increment, you can. The comment I replied to mentioned "feature", but that's a choice based on how you direct the LLM. On the other hand, LLM capabilities may change the optimal workflow somewhat.

Either way, the ability to produce "working software" (as the manifesto puts it) in "frequent" iterations (often just seconds with an LLM!) and iterate on feedback is core to Agile.


Using LLMs for coding complex projects at scale over a long time is really challenging! This is partly because defining requirements alone is much more challenging than most people want to believe. LLMs accelerate any move in the wrong direction.


My analogy is LLMs are a gas pedal. Makes you go fast, but you still have to know when to turn.


True


Having the llm write the spec/workunit from a conversation works well. Exploring a problem space with a (good) coding agent is fantastic.

However for complex projects IMO one must read what was written by the llm … every actual word.

When it ‘got away’ from me, in each case I left something in the llm written markdown that I should have removed.

99% “I can ask for that later” and 1% “that’s a good idea i hadn’t considered” might be the right ratio when reading an llm generated plan/spec/workunit.

Breaking work into single context passes … 50-60k tokens in sonnet 4.5 has had typically fantastic results for me.

My side project is using lean 4 and a carelessly left in ‘validate’ rather than ‘verify’ lead down a hilariously complicated path equivalent to matching an output against a known string.

I recovered, but it wasn’t obvious to me that was happening. I however would not be able to write lean proofs myself, so diagnosing the problem and fixing it is a small price to be able to mechanically verify part of my software is correct.


One should know theend to end design and architecture. Should stop llm when adding complex fancy things.


Agreed. The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.

The most challenging part when working with coding agents is that they seem to do well initially on a small code base with low complexity. Once the codebase gets bigger with lots of non-trivial connections and patterns, they almost always experience tunnel vision when asked to do anything non-trivial, leading to increased tech debt.


The problem is that you're talking about a multistep process where each step beyond the first depends on the particular path the agent starts down, along with human input that's going to vary at each step.

I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.

It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.

https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...


What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.

So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.

This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.


It's the heart of the problem with all software engineer research. That's why we have so little reliable knowledge.

It applies to using LLMs too. I guess the one largest difference here is that LLM has few enough companies with abundant enough money pushing it to make it trivial for them to run a test like this. So the fact that they aren't doing that also says a lot.


> What you're getting at is the heart of the problem with the LLM hype train though, isn't it?

> "We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.

Heh, I'd rephrase the first part to:

> What you're getting at is the heart of the problem with software development though, isn't it?


The UK government ran a study with thousands of developers quite recently: https://www.gov.uk/government/publications/ai-coding-assista...


I don't necessarily think the conclusions are wrong, but this relies entirely on self-reported survey results to measure productivity gains. That's too easy to poke holes in, and I think studies like this are unlikely to convince real skeptics in the near term.


At this point it's becoming clear from threads similar to this one that quite a lot of the skeptics are actively working not to be convinced by anything.


Do you have a study to back that up? /s

I agree. I think there are too many resources, examples, and live streams out there for someone to credibly claim at this point that these tools have no value and are all hype. I think the nuance is in how and where you apply it, what your expectations and tolerances are, and what your working style is. They are bad at many things, but there is tremendous value to be discovered. The loudest people on both sides of this debate are typically wrong in similar ways imo.


I am not a software engineer but I am using my own vibe coded video efx software, my own vibe coded audio synth, my own vibe coded art generator for art. These aren't software products though. No one else is ever going to use them. The output is what matters to me. Even I can see that committing LLM generated code at your software job is completely insane. The only way to get a productivity increase is to not bother understanding what the program is doing. If you need to understand what is going on then why not just type it in yourself? My productivity increase is immeasurable because I wouldn't be able to write this video player I made. I have absolutely no idea how it works. It is exactly why I am not a software engineer. Professionals claiming a productivity boost have to be doing something along the lines of not understanding what the program is doing that is proportional to the claimed productivity increase. I don't see how you can have it both ways unless someone is just that slow of a typist.


Woah, finally something with actual metrics instead of vibes!

> Trial participants saved an average of 56 minutes a working day when using AICAs

That feels accurate to me, but again I'm just going on vibes :P


Before you get into the expensive part, how do you get past "non-deterministic blackbox with unknown layers in between imposed by vendors"


You can measure probabilistic systems that you can't examine! I don't want to throw the baby out with the bathwater here - before LLMs became the all-encompassing elephant in the room we did this routinely.

You absolutely can quantify the results of a chaotic black box, in the same way you can quantify the bias of a loaded die without examining its molecular structure.


> The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.

If that's what we need to do, don't we already have the answer to the question?


> "Maybe I'm a curmudgeon but most of these types of blogs feel like marketing pieces with the important bit is that so much is left unsaid and not shown, that it comes off like a kid trying to hype up their own work without the benefit of nuance or depth."

C'mon, such self-congratulatory "Look at My Potency: How I'm using Nicknack.exe" fluffies always were and always will be a staple of the IT industry.


Still, the best such pieces are detailed and explanatory.


Why not just use claude code and come to your own conclusion?


Yeah I was reading this seeing if there was something he'd actually show that would be useful, what pain point he is solving, but it's just slop.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: