undefined | Better HN

0 pointssimonw8mo ago0 comments

I think the hardest problem in computer science right now may be coming up with an LLM demo that doesn't get called "pretty trivial".

0 comments

1dom8mo ago

I'm very pro LLM and AI. But I completely agree with the comment about how many pieces praising LLMs are doing so with trivial examples. Trivial might not be the right word, but I can't think of a better one that doesn't have a negative connotation, but this shouldn't be negative. Your examples are good and useful, and capture a bunch of tasks a software engineer would do.

I'd say your mandelbrot debug and the LLVM patch are both "trivial" in the same sense: they're discrete, well defined, clear-success-criteria-tasks that could be assigned to any mid/senior software engineer in a relevant domain and they could chip through it in a few weeks.

Don't get me wrong, that's an insane power and capability of LLMs, I agree. But ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.

Non-trivial examples are things that would take a team of different specialist skillsets months to create. One obvious potential reason why there's few non-trivial AI examples is because non-trivial AI examples require non-trivial amount of time to be able to generate and verify.

A non-trivial example isn't an example you can look at the output and say "yup, AI's done well here". It requires someone spends time going into what's been produced, assessing it, essentially redesigning it as a human to figure out all the complexity of a modern non-trivial system to confirm the AI actually did all that stuff correctly.

An in depth audit of a complex software system can take months or even years and is a thorough and tedious task for a human, and the Venn diagrams of humans who are thinking "I want to spend more time doing thorough, tedious code tasks" and "I want to mess around with AI coding" is 2 separate circles.

edmundsauto8mo ago

Current state AI is a best fit for jobs that can be easily verified as correct. In my 20+ years, this is at least 75% of the work I’ve ever done. Maybe 99.999% (I have led a very boring career.)

There’s an enormous amount of value in doing this. For the harder problems you mentioned - most IC SWE are also incapable or unwilling to do the work. So maybe the current state has equivalent capabilities to 95% of coders out there? But it works faster, cheaper, and doesn’t object to tedious work like documentation. It doesn’t require labor law compliance, hiring, onboarding/offboarding, or cause interpersonal conflict.

sokoloff8mo ago

> ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.

Doing for < $10 and under an hour what could be done in a few weeks by $10K+ worth of senior staff time is pretty valuable.

1dom8mo ago

If it's something a single senior staff member can do, then - personally - I'd consider it not complex, it's relatively trivial: it can be done by literally a single person.

I'm pro AI, I'm not saying it's not valuable for trivial things. But that's a distinct discussion to the trivial nature of many LLM examples/demos in relation to genuinely complex computer systems.

1 more reply

simonwOP8mo ago

> Non-trivial examples are things that would take a team of different specialist skillsets months to create.

Thank you for providing a spelled out definition of "non-trivial" there!

1dom8mo ago

Haha, it was made up on the spot, thank you though! I think your articles and notes are proof that there's a lot of value and use in "trivial" examples. They're very close to the sort of examples a lot of tech people can actually use as individual professional engineers.

I think the void where non-trivial examples should be is the same space where contrarians and the last remaining few LLMs-are-useless crowd hangout.

j458mo ago

There is a scale somewhere in these types of articles that will emerge.

It might be something being actually new (cutting edge) vs new to someone vs the human mind wanting to have it be novel and different enough as a comparable percentage of the experience of the first time using ChatGPT 4.

There is also the wiring of non-deterministic software frameworks and architectures compared to the deterministic only software development we're used to.

The former is a different thing than the latter.

sroussey8mo ago

LLMs are best demonstrated with greenfield examples.

j458mo ago

Plus, applying non-deterministic algorithms in a deterministic way might not always work the same. The software developers are also changing the frames and terms of reference.

fho8mo ago

Point in case: i've been trying for weeks now to generate a CFD solver that is more than the basic FDM "toy example".

The models clearly know the equations, but run into the same issues I had when implementing it myself (namely exploding simulations that the models try to paper over by applying more and more relaxation terms).

sundache8mo ago

I only see 148 lines of assembly and a dockerfile that's 7 lines long. Am I missing something or should that take a human less then several weeks.

dotancohen8mo ago

Depends on what's in those 148 lines.

sroussey8mo ago

Convert react-stockcharts to react v19. I’ve tried Claude Code and Cursor but only ended up with hilariously bad results.

simonwOP8mo ago

I had great success with o4-mini via ChatGPT for they kind of upgrade, since of can use its search tool to look up what's changed.

I used this prompt a few weeks ago:

> This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

dust428mo ago

I have one for you: implement gemma 3n multimodel support in llama.cpp

fragmede8mo ago

I think Cloudflare's oauth library qualifies https://news.ycombinator.com/item?id=44159166

gen6acd60af8mo ago

This one?

>Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards.

>To emphasize, this is not "vibe coded". Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.

Some time later...

https://github.com/advisories/GHSA-4pc9-x2fx-p7vj / CVE-2025-4143

>The OAuth implementation in workers-oauth-provider that is part of MCP framework https://github.com/cloudflare/workers-mcp, did not correctly validate that redirect_uri was on the allowed list of redirect URIs for the given client registration.

kentonv8mo ago

Sorry, my code has bugs sometimes.

afro888mo ago

It coming from computer science might be the issue. There's a lot of open source repos out there that have tricky bugs, and todo lists of features that are too complex or time consuming for casual contributors to tackle. Adding significant value to an open source project is a pretty nice demo that won't get called "pretty trivial".

Can't be too far off!

cranium8mo ago

Instead of "pretty trivial", I'd say it's "well-defined and generally understood".

The implicit decisions it had to make were also inconsequential, eg. selection of ASCII chars, color or not, bounds of the domain,...

However, it shows that agents are powerful translators / extractors of general knowledge!

raxxorraxor8mo ago

The complexity of the problem masqerades the common problem of providing sensible context to your AI of choice to have it doing something constructive in your personal codebase. Or giving it tools to check the truth of one of its assertions. Something a developer does countless times.

th0ma58mo ago

Maybe you should try something other than demos? Have you tried creating a reliable system?

1 more reply

j458mo ago

Many big problems are made up of small problems.

jkhdigital8mo ago

No the hardest problem is teaching CS undergrads. I just started this year (no background in academia, just 75% of a PhD and well-rounded life experience) and I’ve basically torn up the entire curriculum they handed to me and started vibe-teaching.

skydhash8mo ago

Because they are trivial in a way that you can go on GitHub and copy one of those while not pretending LLM isn't a mashup of the internet.

What people agree on being non-trivial is working on a real project. There's a lot of opensource projects that could benefit from a useful code contribution. But they only got slop thrown at them.

1 more reply

kayge8mo ago

The "No True Scotsware" problem? :)

x0x08mo ago

I have one: features I've tried this on in my codebase. Because claude and gemini have both failed pretty badly.

So it's pretty stupid to just assume that critics haven't tried.

Example feature: send analytics events on app start triggered by notifications. Both Gemini and Claude completely failed to understand the component tree; rewrote hundreds of lines of code in broken ways; and even when prompted with the difficulty (this is happening outside of the component tree), failed to come up with a good solution. And even when deliberately prompted not to, like to simultaneously make cosmetic code changes to other pieces of the files they're touching.

pydry8mo ago

Really? This paper cut through the same kind of bullshit with puzzles: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

What do you think is so difficult about doing the same thing with coding problems?

simonwOP8mo ago

I don't understand the connection between that paper and my comment.

pydry8mo ago

They created an environment to expose LLMs to problems and test their performance which were immune from benchmark hacking using puzzles.

Your comment was about how this was unreasonably hard (for coding challenges).

Anecdotally Ive seen LLMs do all sorts of amazing shit which was obviously drawn from their training set and fall flat on their faces doing simple coding tasks which are novel enough to not appear in the training set.

1 more reply

j / k navigate · click thread line to collapse

0 comments

1dom8mo ago

Don't get me wrong, that's an insane power and capability of LLMs, I agree. But ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.

edmundsauto8mo ago

Current state AI is a best fit for jobs that can be easily verified as correct. In my 20+ years, this is at least 75% of the work I’ve ever done. Maybe 99.999% (I have led a very boring career.)

sokoloff8mo ago

> ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.

Doing for < $10 and under an hour what could be done in a few weeks by $10K+ worth of senior staff time is pretty valuable.

1dom8mo ago

If it's something a single senior staff member can do, then - personally - I'd consider it not complex, it's relatively trivial: it can be done by literally a single person.

I'm pro AI, I'm not saying it's not valuable for trivial things. But that's a distinct discussion to the trivial nature of many LLM examples/demos in relation to genuinely complex computer systems.

1 more reply

simonwOP8mo ago

> Non-trivial examples are things that would take a team of different specialist skillsets months to create.

Thank you for providing a spelled out definition of "non-trivial" there!

1dom8mo ago

I think the void where non-trivial examples should be is the same space where contrarians and the last remaining few LLMs-are-useless crowd hangout.

j458mo ago

There is a scale somewhere in these types of articles that will emerge.

There is also the wiring of non-deterministic software frameworks and architectures compared to the deterministic only software development we're used to.

The former is a different thing than the latter.

sroussey8mo ago

LLMs are best demonstrated with greenfield examples.

j458mo ago

Plus, applying non-deterministic algorithms in a deterministic way might not always work the same. The software developers are also changing the frames and terms of reference.

fho8mo ago

Point in case: i've been trying for weeks now to generate a CFD solver that is more than the basic FDM "toy example".

sundache8mo ago

I only see 148 lines of assembly and a dockerfile that's 7 lines long. Am I missing something or should that take a human less then several weeks.

dotancohen8mo ago

Depends on what's in those 148 lines.

sroussey8mo ago

Convert react-stockcharts to react v19. I’ve tried Claude Code and Cursor but only ended up with hilariously bad results.

simonwOP8mo ago

I had great success with o4-mini via ChatGPT for they kind of upgrade, since of can use its search tool to look up what's changed.

I used this prompt a few weeks ago:

> This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

dust428mo ago

I have one for you: implement gemma 3n multimodel support in llama.cpp

fragmede8mo ago

I think Cloudflare's oauth library qualifies https://news.ycombinator.com/item?id=44159166

gen6acd60af8mo ago

This one?

>Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards.

>To emphasize, this is not "vibe coded". Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.

Some time later...

https://github.com/advisories/GHSA-4pc9-x2fx-p7vj / CVE-2025-4143

kentonv8mo ago

Sorry, my code has bugs sometimes.

afro888mo ago

Can't be too far off!

cranium8mo ago

Instead of "pretty trivial", I'd say it's "well-defined and generally understood".

The implicit decisions it had to make were also inconsequential, eg. selection of ASCII chars, color or not, bounds of the domain,...

However, it shows that agents are powerful translators / extractors of general knowledge!

raxxorraxor8mo ago

th0ma58mo ago

Maybe you should try something other than demos? Have you tried creating a reliable system?

1 more reply

j458mo ago

Many big problems are made up of small problems.

jkhdigital8mo ago

skydhash8mo ago

Because they are trivial in a way that you can go on GitHub and copy one of those while not pretending LLM isn't a mashup of the internet.

What people agree on being non-trivial is working on a real project. There's a lot of opensource projects that could benefit from a useful code contribution. But they only got slop thrown at them.

1 more reply

kayge8mo ago

The "No True Scotsware" problem? :)

x0x08mo ago

I have one: features I've tried this on in my codebase. Because claude and gemini have both failed pretty badly.

So it's pretty stupid to just assume that critics haven't tried.

pydry8mo ago

Really? This paper cut through the same kind of bullshit with puzzles: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

What do you think is so difficult about doing the same thing with coding problems?

simonwOP8mo ago

I don't understand the connection between that paper and my comment.

pydry8mo ago

They created an environment to expose LLMs to problems and test their performance which were immune from benchmark hacking using puzzles.

Your comment was about how this was unreasonably hard (for coding challenges).

1 more reply

j / k navigate · click thread line to collapse