undefined | Better HN

undefined | Better HN

0 comments

idlewords11mo ago

An exponential curve looks locally the same at all points in time. For a very long period of time, computers were always vastly better than they were a year ago, and that wasn't because the computer you'd bought the year before was junk.

Consider that what you're reacting to is a symptom of genuine, rapid progress.

godelski11mo ago

  > An exponential curve looks locally the same at all points in time

This is true for any curve...

If your curve is continuous, it is locally linear.

There's no use in talking about the curve being locally similar without the context of your window. Without the window you can't differentiate an exponential from a sigmoid from a linear function.

Let's be careful with naive approximations. We don't know which direction things are going and we definitely shouldn't assume "best case scenario"

Retr0id11mo ago

I don't think anyone's contesting that LLMs are better now than they were previously.

pera11mo ago

A flatline also looks locally the same at all points in time.

simonw11mo ago

tptacek wasn't making this argument six months ago.

LLMs get better over time. In doing so they occasionally hit points where things that didn't work start working. "Agentic" coding tools that run commands in a loop hit that point within the past six months.

If your mental model is "people say they got better every six months, therefore I'll never take them seriously because they'll say it again in six months time" you're hurting your own ability to evaluate this (and every other) technology.

cmdli11mo ago

> tptacek wasn't making this argument six months ago.

Yes, but other smart people were making this argument six months ago. Why should we trust the smart person we don't know now if we (looking back) shouldn't have trusted the smart person before?

Part of evaluating a claim is evaluating the source of the claim. For basically everybody, the source of these claim is always "the AI crowd", because those outside the AI space have no way of telling who is trustworthy and who isn't.

JohnKemeny11mo ago

But they say "yes, it didn't work 6 months ago, but it does now", and they say this every month. They're constantly resetting the goal post.

Today it works, it didn't in the past, but it does now. Rinse and repeat.

esperent11mo ago

I stopped paying attention for a few days so I'm way out of date. What is the state of the art for agentic coding now?

I've been using Cline and it can do a few of the things suggested as "agentic", but I'd have no idea how to leave it writing and then running tests in a VM and creating a PR for me to review. Or let it roam around in the file tree and create new files as needed. How does that work? Are there better tools for this? Or do I need to configure Cline in some way?

whoisthemachine11mo ago

Have the models significantly improved, or have we just developed new programs that take better advantage of them?

orionsbelt11mo ago

At what point would you be impressed by a human being if you asked it to help you with a task every 6 months from birth until it was 30 years old?

If you ask different people the above question, and if you vary it based on type of task, or which human, you would get different answers. But as time goes on, more and more people would become impressed with what the human can do.

I don't know when LLMs will stop progressing, but all I know is they continue to progress at what is to me a similar astounding rate as to a growing child. For me personally, I never used LLMs for anything, and since o3 and Gemini 2.5 Pro, I use them all the time for all sorts of stuff.

You may be smarter than me and still not impressed, but I'd try the latest models and play around, and if you aren't impressed yet, I'd bet money you will be within 3 years max (likely much earlier).

Velorivox11mo ago

> At what point would you be impressed by a human being if you asked it to help you with a task every 6 months from birth until it was 30 years old?

In this context, never. Especially because the parent knows you will always ask 2+2 and can just teach the child to say “four” as their first and only word. You’ll be on to them, too.

stouset11mo ago

I saw this article and thought, now's the time to try again!

Using Claude Sonnet 4, I attempted to add some better configuration to my golang project. An hour later, I was unable to get it to produce a usable configuration, apparently due to a recent v1-to-v2 config format migration. It took less time to hand-edit one based on reading the docs.

I keep getting told that this time agents are ready. Every time I decide to use them they fall flat on their face. Guess I'll try again in six months.

simonw11mo ago

If you share your conversation (with the share link in Claude) I'd be happy to see if there are any tweaks I can suggest to how you prompted it.

porridgeraisin11mo ago

Yes.

I made the mistake of procrastinating on one part of a project thinking "Oh, that is easily LLMable". By God, was I proven wrong. Was quite the rush before the deadline.

On the flip side, I'm happy I don't have to write the code for a matplotlib scatterplot for the 10000th time, it mostly gets the variables in the current scope that I intended to plot. But I've really not had that much success on larger tasks.

The "information retrieval" part of the tech is beautiful though. Hallucinations are avoided only if you provide an information bank in the context in my experience. If it needs to use the search tool itself, it's not as good.

Personally, I haven't seen any improvement from the "RLd on math problems" models onward (I don't care for benchmarks). However, I agree that deepseek-r1-zero was a cool result. Pure RL (plain R1 used a few examples) automatically leading to longer responses.

A lot of the improvements suggested in this thread are related to the infra around LLMs such as tool use. These are much more well organised these days with MCP and what not, enabling you to provide it the aforementioned information bank easily. But all of it is built on top of the same fragile next-token generator we know and love.

DarmokJalad170111mo ago

> It took less time to hand-edit one based on reading the docs.

You can give it the docs as an "artifact" in a project - this feature has been available for almost one year now.

Or better yet, use the desktop version + a filesystem MCP server pointing to a folder containing your docs. Tell it to look at the docs and refactor as necessary. It is extremely effective at this. It might also work if you just give it a link to the docs.

Yiin11mo ago

you can add links to docs to llm agents instead of letting them work blindfolded with hardcoded assumptions

mathgorges11mo ago

In my experience it's less about the latest generation of LLMs being better, and more about the tooling around them for integration into a programmer's workflow being waaaay better.

The article doesn't explicitly spell it out until several paragraphs later, but I think what your quoted sentence is alluding to is that Cursor, Cline et al can be pretty revolutionary in terms of removing toil from the development process.

Need to perform a gnarly refactor that's easy to describe but difficult to implement because it's spread far and wide across the codebase? Let the LLM handle it and then check its work. Stuck in dependency hell because you updated one package due to a CVE? The LLM can (often) sort that out for you. Heck, did the IDE's refactor tool fail at renaming a function again? LLM.

I'm remain skeptical of LLM-based development insofar as I think the enshitification will inevitably come when the Magic Money Machine breaks down. And I don't think I would hire a programmer that needs LLM assistance in order to program. But it's hard to deny that it has made me a lot more productive. At the current price it's a no-brainer to use it.

tho23j4o3j432411mo ago

It's great when it works, but half the time IME it's so stupid that it can't even use the edit/path tools properly even when given line numbers prepended inputs.

(I should know since I've created half-a-dozen tools for this with gptel. Cline hasn't been any better on my codebase.)

carpo11mo ago

I think they just meant it hit an inflection point. Some people were copying pasting to ChatGPT and saying it was crap and others were using agents that could see the context of the code and worked much, much better. It's the workflow used not just the specific LLM.

libraryofbabel11mo ago

This isn't a particularly useful filter, because it applies to many very successful technologies as well. Early automobiles generated a lot of hype and excitement, but they were not very good (unreliable, loud, and dangerous, and generally still worse than horses). They got steadily better until eventually they hit an inflection point where the skeptics were dug in repeating the same increasingly old complaints, while Henry Ford was building the Model T.

anxoo11mo ago

name 5 tasks which you think current AIs can't do. then go and spend 30 minutes seeing how current AIs can do on them. write it on a sticky note and put it somewhere that you'll see it.

otherwise, yes, you'll continue to be irritated by AI hype, maybe up until the point where our civilization starts going off the rails

TheRoque11mo ago

Well, I'll try to do a sticky note here:

- they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient

- they fail at doing clean DRY practices even though they are supposed to skim through the codebase much faster than me

- they bait me into inexisting apis, or hallucinate solutions or issues

- they cannot properly pick the context and the files to read in a mid-size app

- they suggest to download some random packages, sometimes low quality ones, or unmaintained ones

alisonatwork11mo ago

The problem with AI hype is not really about whether a particular model can - in the abstract - solve a particular programming problem. The problem with AI hype is that it is selling a future where all software development companies become entirely dependent on closed systems.

All of the state-of-the-art models are online models - you have no choice, you have to pay for a black box subscription service controlled by one of a handful of third-party gatekeepers. What used to be a cost center that was inside your company is now a cost center outside your company, and thus it is a risk to become dependent on it. Perhaps the risk is worthwhile, perhaps not, but the hype is saying that real soon now it will be impossible to not become dependent on these closed systems and still exist as a viable company.

apwell2311mo ago

> name 5 tasks which you think current AIs can't do.

For coding it seems to back itself into a corner and never recover from it until i "reset" it .

AI can't write software without an expert guiding it. I cannot open a non trivial PR to postgres tonight using AI.

poincaredisk11mo ago

1. create a working (moderately complex) ghidra script without hallucinating.

Granted I was trying to do this 6 months ago, but maybe a miracle has happened. But I'm the past I had very bad experience with using LLMs for niche things (i.e. things that were never mentioned on stackoverflow)

AtlasBarfed11mo ago

Everyone keeps thinking AI improvement is linear. I don't know if this is correct, but it's just my basic impression that the current AI boost came from instead of being limiting yourself to the CPU and its throughput adding the massive amount of computing power in graphics cards.

But for each nine of reliability you want out of llms everyone's assuming it's just a linear growth. I don't think it is. I think it's polynomial at least.

As for your tasks and maybe it's just cuz I'm using chat GPT, but I asked it to Port sed, something with full open source code availability, tons of examples/test cases, a fully documented user interface and I wanted it moved to Java as a library.

And it failed pretty spectacularly. Yeah it got the very very very basic functionality of sed.

chinchilla202011mo ago

If AI can do anything, why can't I just prompt "Here is sudo access to my laptop, please do all my work for me, respond to emails, manage my household budget, and manage my meetings".

I've tried everything. I have four AI agents. They still have an accuracy rate of about 50%.

ipaddr11mo ago

Make me a million dollars

Tell me about this specific person who isn't famous

Create a facebook clone

Recreate Windows including drivers

Create a way to transport matter like in Star Trek.

I'll see you in 6 months.

someothherguyy11mo ago

Also, professional programmers have varying needs. These people are coding in different languages, with varying complexity, domains, existing code bases and so on.

People making arguments based on sweeping generalizations to a wide audience are often going to be perceived as delusional, as their statements do not apply universally to everyone.

To me, thinking LLMs can code generally because you have success with them and then telling others they are wrong in how they use them is making a gigantic assumptive leap.

spacemadness11mo ago

I just assume every blog post in HN starts with “As a web dev, TITLE”

dolebirchwood11mo ago

> Here’s the thing from the skeptic perspective: This statement keeps getting made on a rolling basis.

Dude, just try the things out. It's just undeniable in my day-to-day life that I've been able to rely on Sonnet (first 3.7 and now 4.0) and Gemini 2.5 to absolutely crush code. I've done 3 side projects in the past 6 months that I would have been way too lazy to build without these tools. They work. Never going back.

ryandrake11mo ago

Why can't reviews of AI be somewhere in the middle between "useless" and "the second coming"?

I tried Copilot a few months ago just to give it a shot and so I could discuss it with at least a shred of experience with the tool, and yea, it's a neat feature. I wouldn't call it a gimmick--it deserves a little more than that, but I didn't exactly cream my pants over it like a lot of people seem to be doing. It's kind of convenient, like a smart autocomplete. Will it fundamentally change how I write software? No way. But it's cool.

killerstorm11mo ago

Bullshit. We have absolute numbers, not just vibes.

The top of SWE-bench Verified leaderboard was at around 20% in mid-2024, i.e. AI was failing at most tasks.

Now it's at 70%.

Clearly it's objectively better at tackling typical development tasks.

And it's not like it went from 2% to 7%.

lexandstuff11mo ago

Isn't SWE-bench based on public Github issues? Wouldn't the increase in performance also be explained by continuing to train on newer scraped Github data, aka training on the test set?

The pressure for AI companies to release a new SOTA model is real, as the technology rapidly become commoditised. I think people have good reason to be skeptical of these benchmark results.

j / k navigate · click thread line to collapse