You should write an agent (opens in new tab)

(fly.io)

1070 pointstabletcorry6mo ago395 comments

395 comments

Two years ago I wrote an agent in 25 lines of PHP [0]. It was surprisingly effective, even back then before tool calling was a thing and you had to coax the LLM into returning structured output. I think it even worked with GPT-3.5 for trivial things.

In my mind LLMs are just UNIX strong manipulation tools like `sed` or `awk`: you give them an input and command and they give you an output. This is especially true if you use something like `llm` [1].

It then seems logical that you can compose calls to LLMs, loop and branch and combine them with other functions.

[0] https://github.com/dave1010/hubcap

[1] https://github.com/simonw/llm

simonw6mo ago

I love hubcap so much. It was a real eye-opener for me at the time, really impressive result for so little code. https://simonwillison.net/2023/Sep/6/hubcap/

dave1010uk6mo ago

Thanks Simon!

It only worked because of your LLM tool. Standing on the shoulders of giants.

dingnuts6mo ago

You're posting too fast please slow down

1 more reply

saghm6mo ago

The obvious difference between UNIX tools and LLMs is the non-determinism. You can't necessarily reason about what the output will be, and then continue to pipe into another LLM, etc., and eventually `eval` the result. From a technical perspective you can deal do this, but the hard part seems like it would be how to make sure it doesn't do something you really don't want it to do. I'd imagine that any potential deviations from your expectations in a given stage would be compounded as you continue to pipe along into additional stages that might have similar deviations.

I'm not saying it's not worth doing, considering how the software development process we've already been using as an industry ends up with a lot of bugs in our code. (When talking about this with people who aren't technical, I sometimes like to say that the reason software has bugs in it is that we don't really have a good process for writing software without bugs at any significant scale, and it turns out that software is useful for enough stuff that we still write it knowing this). I do think I'd be pretty concerned with how I could model constraints in this type of workflow though. Right now, my fairly naive sense is that we've already moved the needle so far on how much easier it is to create new code than review it and notice bugs (despite starting from a place where it already was tilted in favor of creation over review) that I'm not convinced being able to create it even more efficiently and powerfully is something I'd find useful.

keyle6mo ago

> a small Autobot that you can't trust

That gave me a hearty chuckle!

nativeit6mo ago

I let it watch my kids. Was that a mistake?

pjmlp6mo ago

And that is how we end up with iPaaS products powered by agentic runtimes, slowly dragging us away from programming language wars.

Only a selected few get to argue about what is the best programming language for XYZ.

singularity20016mo ago

what's the point of specialized agents when you just have one universal agent that can do anything e.g. Claude

baq6mo ago

If you can get a specialized agent to work in its domain at 10% parameters of a foundation model, you can feasibly run locally, which opens up e.g. offline use cases.

Personally I’d absolutely buy an LLM in a box which I could connect to my home assistant via usb.

monomers6mo ago

What use cases do you imagine for LLMs in home automation?

I have HA and a mini PC capable of running decently sized LLMs but all my home automation is super deterministic (e.g. close window covers 30 minutes after sunset, turn X light on if Y condition, etc.).

2 more replies

criddell6mo ago

> Personally I’d absolutely buy an LLM in a box

In a box? I want one in a unit with arms and legs and cameras and microphones so I can have it do useful things for me around my home.

1 more reply

throwaway40126mo ago

Can you (or someone else) explain how to do that? How much does it typically cost to create a specialized agents that uses a local model? I thought it was expensive?

1 more reply

ljm6mo ago

Composing multiple smaller agents allows you to build more complex pipelines, which is a lot easier than getting a single monolithic agent to switch between contexts for different tasks. I also get some insight into how the agent performs (e.g via langfuse) because it’s less of a black box.

To use an example: I could write an elaborate prompt to fetch requirements, browse a website, generate E2E test cases, and compile a report, and Claude could run it all to some degree of success. But I could also break it down into four specialised agents, with their own context windows, and make them good at their individual tasks.

fennecbutt6mo ago

Plus I'd say that the smaller context or more specific context is the important thing there.

Even the biggest models seem to have attention problems if you've got a huge context. Even though they support these long contexts it's kinda like a puppy distracted by a dozen toys around the room rather than a human going through a checklist of things.

So I try to give the puppy just one toy at a time.

1 more reply

andy996mo ago

LLMs are good at fuzzy pattern matching and data manipulation. The upstream comment comparing to awk is very apt. Instead of having to write a regex to match some condition you instruct an LLM and get more flexibility. This includes deciding what the next action to take is in the agent loop.

But there is no reason (and lots of downside) to leave anything to the LLM that’s not “fuzzy” and you could just write deterministically, thus the agent model.

1 more reply

ericd6mo ago

Absolutely, especially the part about just rolling your own alternative to Claude Code - build your own lightsaber. Having your coding agent improve itself is a pretty magical experience. And then you can trivially swap in whatever model you want (Cerebras is crazy fast, for example, which makes a big difference for these many-turn tool call conversations with big lumps of context, though gpt-oss 120b is obviously not as good as one of the frontier models). Add note-taking/memory, and ask it to remember key facts to that. Add voice transcription so that you can reply much faster (LLMs are amazing at taking in imperfect transcriptions and understanding what you meant). Each of these things takes on the order of a few minutes, and it's super fun.

_the_inflator6mo ago

I agree with you mostly.

On the other hand, I think that show or it didn’t happen is essential.

Dumping a bit of code into an LLM doesn’t make it a code agent.

And what Magic? I think you never hit conceptual and structural problems. Context window? History? Good or bad? Large Scale changes or small refactoring here and there? Sample size one or several teams? What app? How many components? Green field or not? Which programming language?

I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.

Code Agents are incredibly hard to build and use. Vibe Coding is dead for a reason. I remember vividly the inflation of Todo apps and JS frameworks (Ember, Backbone, Knockout are survivors) years ago.

The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.

I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.

ericd6mo ago

To be clear, no one in this thread said this is replacing all senior engineers. But it is still amazing to see it work, and it’s very clear why the hype is so strong. But you’re right that you can quickly run into problems as it gets bigger.

Caching helps a lot, but yeah, there are some growing pains as the agent gets larger. Anthropic’s caching strategy (4 blocks you designate) is a bit annoying compared to OpenAI’s cache-everything-recent. And you start running into the need to start summarizing old turns, or outright tossing them, and deciding what’s still relevant. Large tool call results can be killer.

I think at least for educational purposes, it’s worth doing, even if people end up going back to Claude code, or away from genetic coding altogether for their day to day.

lowbloodsugar6mo ago

>build your own lightsaber

I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.

The other takeaway is that agents are fantastically simple.

ericd6mo ago

Agreed, and it's actually how I've been thinking about it, but it's also straight from the article, so can't claim credit. But it was fun to see it put into words by someone else.

And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.

afc6mo ago

I also started building my own, it's fun and you get far quickly.

I'm now experimenting with letting the agent generate its own source code from a specification (currently generating 9K lines of Python code (3K of implementation, 6K of tests) from 1.5K lines in specifications (https://alejo.ch/3hi).

threecheese6mo ago

Just reading through your docs, and feeling inspired. What are you spending, token-wise? Order of magnitude.

andai6mo ago

What are you using for transcription?

I tried Whisper, but it's slow and not great.

I tried the gpt audio models, but they're trained to refuse to transcribe things.

I tried Google's models and they were terrible.

I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.

So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!

tptacek6mo ago

I recently bought a mint-condition Alf phone, in the shape of Gordon Shumway of TV's "Alf", out of the back of an old auto shop in the south suburbs of Chicago, and naturally did the most obvious thing, which was to make a Gordon Shumway phone that has conversations in the voice of Gordon Shumway (sampled from Youtube and synthesized with ElevenLabs). I use https://github.com/etalab-ia/faster-whisper-server (I think?) as the Whisper backend. It's fine! Asterix feeds me WAV files, an ASI program feeds them to Whisper (running locally as a server) and does audio synthesis with the ElevenLabs API. Took like 2 hours.

t_akosuke6mo ago

Been meaning to build something very similar! What hardware did you use? I'm assuming that a Pi or similar won't cut it

1 more reply

ericd6mo ago

Whisper.cpp/Faster-whisper are a good bit faster than OpenAI's implementation. I've found the larger whisper models to be surprisingly good in terms of transcription quality, even with our young children, but I'm sure it varies depending on the speaker, no idea how well it handles heavy accents.

I'm mostly running this on an M4 Max, so pretty good, but not an exotic GPU or anything. But with that setup, multiple sentences usually transcribe quickly enough that it doesn't really feel like much of a delay.

If you want something polished for system-wide use rather than rolling your own, I've been liking MacWhisper on the Mac side, currently hunting for something on Arch.

richardlblair6mo ago

The new Qwen model is supposed to be very good.

Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.

1 more reply

nostrebored6mo ago

Parakeet is sota

1 more reply

raymond_goo6mo ago

https://github.com/rhulha/Speech2Speech

https://github.com/rhulha/EchoMate

segu6mo ago

Handy is free, open-source and local model only. Supports Parakeet: https://github.com/cjpais/Handy

ty000016mo ago

Speechmatics - it is on the expensive side, but provides access to a bunch of languages and the accuracy is phenomenal on all of them - even with multi-speakers.

greenfish66mo ago

I use Willow AI, which I think is pretty good

Uehreka6mo ago

The reason a lot of people don’t do this is because Claude Code lets you use a Claude Max subscription to get virtually unlimited tokens. If you’re using this stuff for your job, Claude Max ends up being like 10x the value of paying by the token, it’s basically mandatory. And you can’t use your Claude Max subscription for tools other than Claude Code (for TOS reasons. And they’ll likely catch you eventually if you try to extract and reuse access tokens).

unshavedyak6mo ago

Is using CC outside of the CC binary even needed? CC has a SDK, could you not just use the proper binary? I've debated using it as the backend for internal chat bots and whatnot unrelated to "coding". Though maybe that's against the TOS as i'm not using CC in the spirit of it's design?

simonw6mo ago

That's very much in the spirit of Claude Code these days. They renamed the Claude Code SDK to the Claude Agent SDK precisely to support this kind of usage of it: https://www.anthropic.com/engineering/building-agents-with-t...

sumedh6mo ago

> catch you eventually if you try to extract and reuse access tokens

What does that mean?

baq6mo ago

How do they know your requests come from Claude Code?

2 more replies

ericd6mo ago

When comparing, are you using the normal token cost, or cached? I find that the vast majority of my token usage is in the 90% off cached bucket, and the costs aren’t terrible.

ay6mo ago

Kimi is noticeably better at tool calling than gpt-oss-120b.

I made a fun toy agent where the two models are shoulder surfing each other and swap the turns (either voluntarily, during a summarization phase), or forcefully if a tool calling mistake is made, and Kimi ends up running the show much much more often than gpt-oss.

And yes - it is very much fun to build those!

anonym296mo ago

Cerebras now has glm 4.6. Still obscenely fast, and now obscenely smart, too.

DeathArrow6mo ago

Aren't there cheaper providers of GLM 4.6 on Openrouter? What are the advantages of using Cerebras? Is it much faster?

3 more replies

ericd6mo ago

Ooh thanks for the heads up!

lukevp6mo ago

What’s a good staring point for getting into this? I don’t even know what Cerebras is. I just use GitHub copilot in VS Code. Is this local models?

ericd6mo ago

A lot of it is just from HN osmosis, but /r/LocalLLaMA/ is a good place to hear about the latest open weight models, if that's interesting.

gpt-oss 120b is an open weight model that OpenAI released a while back, and Cerebras (a startup that is making massive wafer-scale chips that keep models in SRAM) is running that as one of the models they provide. They're a small scale contender against nvidia, but by keeping the model weights in SRAM, they get pretty crazy token throughput at low latency.

In terms of making your own agent, this one's pretty good as a starting point, and you can ask the models to help you make tools for eg running ls on a subdirectory, or editing a file. Once you have those two, you can ask it to edit itself, and you're off to the races.

andai6mo ago

Here is ChatGpt in 50 lines of Python:

https://gist.github.com/avelican/4fa1baaac403bc0af04f3a7f007...

No dependencies, and very easy to swap out for OpenRouter, Groq or any other API. (Except Anthropic and Google, they are special ;)

This also works on the frontend: pro tip you don't need a server for this stuff, you can make the requests directly from a HTML file. (Patent pending.)

GardenLetter276mo ago

But it's way more expensive since most providers won't give you prompt caching?

riskable6mo ago

It's interesting how much this makes you want to write Unix-style tools that do one thing and only one thing really well. Not just because it makes coding an agent simpler, but because it's much more secure!

tptacek6mo ago

One thing that radicalized me was building an agent that tested network connectivity for our fleet. Early on, in like 2021, I deployed a little mini-fleet of off-network DNS probes on, like, Vultr to check on our DNS routing, and actually devising metrics for them and making the data that stuff generated legible/operationalizable was annoying and error prone. But you can give basic Unix network tools --- ping, dig, traceroute --- to an agent and ask it for a clean, usable signal, and they'll do a reasonable job! They know all the flags and are generally better at interpreting tool output than I am.

I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now. But I do know that getting an agent across the 90% threshold of utility for a problem like this is much, much easier than building the good telemetry system is.

0xbadcafebee6mo ago

> I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now.

And that's why I won't touch 'em. All the agents will be abandoned when people realize their inherent flaws (security, reliability, truthfulness, etc) are not worth the constant low-grade uncertainty.

In a way it fits our times. Our leaders don't find truth to be a very useful notion. So we build systems that hallucinate and act unpredictably, and then invest all our money and infrastructure in them. Humans are weird.

simonw6mo ago

Some of us have been happily using agentic coding tools (Claude Code etc) since February and we're still not abandoning them for their inherent flaws.

2 more replies

foobarian6mo ago

Honestly the top AI use case for me right now is personal throwaway dev tools. Where I used to write shell oneliners with dozen pipes including greps and seds and jq and other stuff, now I get an AI to write me a node script and throw in a nice Web UI to boot.

Edit: reflecting on what the lesson is here, in either case I suppose we're avoiding the pain of dealing with Unix CLI tools :-D

jacquesm6mo ago

Interesting. You have to wonder if all the tools that is based on would have been written in the first place if that kind of thing had been possible all along. Who needs 'grep' when you can write a prompt?

2 more replies

andai6mo ago

Could you give some examples? I'm having the AI write the shell scripts, wondering if I'm missing out on some comfy UIs...

2 more replies

chickensong6mo ago

I hadn't given much thought to building agents, but the article and this comment are inspiring, thx. It's interesting to consider agents as a new kind of interface/function/broker within a system.

zahlman6mo ago

> They know all the flags and are generally better at interpreting tool output than I am.

In the toy example, you explicitly restrict the agent to supply just a `host`, and hard-code the rest of the command. Is the idea that you'd instead give a `description` something like "invoke the UNIX `ping` command", and a parameter described as constituting all the arguments to `ping`?

tptacek6mo ago

Honestly, I didn't think very hard about how to make `ping` do something interesting here, and in serious code I'd give it all the `ping` options (and also run it in a Fly Machine or Sprite where I don't have to bother checking to make sure none of those options gives code exec). It's possible the post would have been better had I done that; it might have come up with an even better test.

I was telling a friend online that they should bang out an agent today, and the example I gave her was `ps`; like, I think if you gave a local agent every `ps` flag, it could tell you super interesting things about usage on your machine pretty quickly.

4 more replies

chemotaxis6mo ago

You could even imagine a world in which we create an entire suite of deterministic, limited-purpose tools and then expose it directly to humans!

SatvikBeri6mo ago

Half my use of LLM tools is just to remember the options for command line tools, including ones I wrote but only use every few months.

layer86mo ago

I wonder if we could develop a language with well-defined semantics to interact with and wire up those tools.

chubot6mo ago

> language with well-defined semantics

That would certainly be nice! That's why we have been overhauling shell with https://oils.pub , because shell can't be described as that right now

It's in extremely poor shape

e.g. some things found from building several thousand packages with OSH recently (decades of accumulated shell scripts)

- bugs caused by the differing behavior of 'echo hi | read x; echo x=$x' in shells, i.e. shopt -s lastpipe in bash.

- 'set -' is an archaic shortcut for 'set +v +x'

- Almquist shell is technically a separate dialact of shell -- namely it supports 'chdir /tmp' as well as cd /tmp. So bash and other shells can't run any Alpine builds.

I used to maintain this page, but there are so many problems with shell that I haven't kept up ...

https://github.com/oils-for-unix/oils/wiki/Shell-WTFs

OSH is the most bash-compatible shell, and it's also now Almquist shell compatible: https://pages.oils.pub/spec-compat/2025-11-02/renamed-tmp/sp...

It's more POSIX-compatible than the default /bin/sh on Debian, which is dash

The bigger issue is not just bugs, but lack of understanding among people who write foundational shell programs. e.g. the lastpipe issue, using () as grouping instead of {}, etc.

---

It is often treated like an "unknowable" language

Any reasonable person would use LLMs to write shell/bash, and I think that is a problem. You should be able to know the language, and read shell programs that others have written

1 more reply

zahlman6mo ago

As it happens, I have a prototype for this, but the syntax is honestly rather unwieldy. Maybe there's a way to make it more like natural human language....

1 more reply

utopiah6mo ago

Hmmm but how would you name that? Agent skills? Meta cognition agentic tooling? Intelligence driven self improving partial building blocks?

Oh... oh I know how about... UNIX Philosophy? No... no that'd never work.

danpalmer6mo ago

Doing one thing well means you need a lot more tools to achieve outcomes, and more tools means more context and potentially more understanding of how to string them together.

I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.

tinodb6mo ago

Indeed. I have a tiny wrapper around the llm cli that gives it 3 tools: read these docs for program X, read its config and search-replace in said config. I use it for adopting Ghostty for example. I can now ask it: “how do I switch between window panes?” Then: “change that shortcut to …”

losvedir6mo ago

I appreciate the goal of demystifying agents by writing one yourself, but for me the key part is still a little obscured by using OpenAI APIs in the examples. A lot of the magic has to do with tool calls, which the API helpfully wraps for you, with a format for defining tools and parsed responses helpfully telling you the tools it wants to call.

I kind of am missing the bridge between that, and the fundamental knowledge that everything is token based in and out.

Is it fair to say that the tool abstraction the library provides you is essentially some niceties around a prompt something like "Defined below are certain 'tools' you can use to gather data or perform actions. If you want to use one, please return the tool call you want and it's arguments, delimited before and after with '###', and stop. I will invoke the tool call and then reply with the output delimited by '==='".

Basically, telling the model how to use tools, earlier in the context window. I already don't totally understand how a model knows when to stop generating tokens, but presumably those instructions will get it to output the request for a tool call in a certain way and stop. Then the agent harness knows to look for those delimiters and extract out the tool call to execute, and then add to the context with the response so the LLM keeps going.

Is that basically it? Or is there more magic there? Are the tool call instructions in some sort of permanent context, or could the interaction demonstrated in a fine tuning step, and inferred by the model and just in its weights?

simonw6mo ago

Yeah, that's basically it. Many models these days are specifically trained for tool calling though so the system prompt doesn't need to spend much effort reminding them how to do it.

You can see the prompts that make this work for gpt-oss in the chat template in their Hugging Face repo: https://huggingface.co/openai/gpt-oss-120b/blob/main/chat_te... - including this bit:

    {%- macro render_tool_namespace(namespace_name, tools) -%}
        {{- "## " + namespace_name + "\n\n" }}
        {{- "namespace " + namespace_name + " {\n\n" }}
        {%- for tool in tools %}
            {%- set tool = tool.function %}
            {{- "// " + tool.description + "\n" }}
            {{- "type "+ tool.name + " = " }}
            {%- if tool.parameters and tool.parameters.properties %}
                {{- "(_: {\n" }}
                {%- for param_name, param_spec in tool.parameters.properties.items() %}
                    {%- if param_spec.description %}
                        {{- "// " + param_spec.description + "\n" }}
                    {%- endif %}
                    {{- param_name }}
    ...

As for how LLMs know when to stop... they have special tokens for that. "eos_token_id" stands for End of Sequence - here's the gpt-oss config for that: https://huggingface.co/openai/gpt-oss-120b/blob/main/generat...

    {
      "bos_token_id": 199998,
      "do_sample": true,
      "eos_token_id": [
        200002,
        199999,
        200012
      ],
      "pad_token_id": 199999,
      "transformers_version": "4.55.0.dev0"
    }

The model is trained to output one of those three tokens when it's "done".

https://cookbook.openai.com/articles/openai-harmony#special-... defines some of those tokens:

200002 = <|return|> - you should stop inference

200012 = <|call|> - "Indicates the model wants to call a tool."

I think that 199999 is a legacy EOS token ID that's included for backwards compatibility? Not sure.

vinhnx6mo ago

Thank you Simon! This information is invaluable to know about the underlying tools coherent of language model, gladly we have gpt-oss for clear example for how the model understand and perform tool.

JoshMandel6mo ago

I think that it's basically fair and I often write simple agents using exactly the technique that you describe. I typically provide a TypeScript interface for the available tools and just ask the model to respond with a JSON block and it works fine.

That said, it is worth understanding that the current generation of models is extensively RL-trained on how to make tool calls... so they may in fact be better at issuing tool calls in the specific format that their training has focused on (using specific internal tokens to demarcate and indicate when a tool call begins/ends, etc). Intuitively, there's probably a lot of transfer learning between this format and any ad-hoc format that you might request inline your prompt.

There may be recent literature quantifying the performance gap here. And certainly if you're doing anything performance-sensitive you will want to characterize this for your use case, with benchmarks. But conceptually, I think your model is spot on.

fryz6mo ago

The "magic" is done via the JSON schemas that are passed in along with the definition of the tool.

Structured Output APIs (inc. the Tool API) take the schema and build a Context-free Grammar, which is then used during generation to mask which tokens can be output.

I found https://openai.com/index/introducing-structured-outputs-in-t... (have to scroll down a bit to the "under the hood" section) and https://www.leewayhertz.com/structured-outputs-in-llms/#cons... to be pretty good resources

hoppp6mo ago

I should? what problems can I solve, that can be only done with an agent? As long as every AI provider is operating at a loss starting a sustainably monetizable project doesn't feel that realistic.

johnfn6mo ago

The post is just about playing around with the tech for fun. Why does monetization come into it? It feels like saying you don't want to use Python because Astral, the company that makes uv, is operating at a loss. What?

hoppp6mo ago

Agents use Apis that I will need to pay for and generally software dev is a job for me that needs to generate income.

If the Apis I call are not profitable for the provider then they won't be for me either.

This post is a fly.io advertisement

6 more replies

balder19916mo ago

Yeah we have open source models too that we can use, and it’s actually more fun than using cloud providers in my opinion.

simonw6mo ago

> what problems can I solve, that can be only done with an agent?

The problem that you might not intuitively understand how agents work and what they are and aren't capable of - at least not as well as you would understand it if you spent half an hour building one for yourself.

lelanthran6mo ago

>> what problems can I solve, that can be only done with an agent?

> The problem that you might not intuitively understand how agents work and what they are and aren't capable of

I don't necessarily agree with the GP here, but I also disagree with this sentiment: I don't need to go through the experience of building a piece of software to understand what the capabilities of that class of software is.

Fair enough, with most other things (software or otherwise), they're either deterministic or predictably probabilistic, so simply using it or even just reading how it works is sufficient for me to understand what the capabilities are.

With LLMs, the lack of determinism coupled with completely opaque inner-workings is a problem when trying to form an intuition, but that problem is not solved by building an agent.

veryemartguy6mo ago

Seems like it would be a lot easier for everyone if we knew the answer to his/her question.

furyofantares6mo ago

> As long as every AI provider is operating at a loss

None of them are doing that.

They need funding because the next model has always been much more expensive to train than the profits of the previous model. And many do offer a lot of free usage which is of course operated at a loss. But I don't think any are operating inference at a loss, I think their margins are actually rather large.

roadside_picnic6mo ago

Parent comment never said operating inference at a loss, though it wouldn't surprise me, they just said "operating at a loss" which they most definitely are [0].

However, knowing a few people on teams at inference-only providers, I can promise you some of them absolutely are operating inference at a loss.

0. https://www.theregister.com/2025/10/29/microsoft_earnings_q1...

furyofantares6mo ago

> Parent comment never said operating inference at a loss

Context. Whether inference is profitable at current prices is what informs how risky it is to build a product that depends on buying inference, which is what the post was about.

1 more reply

GoatInGrey6mo ago

So AI companies are profitable when you ignore some of the things they have to spend money on to operate?

Snark aside, inference is still being done at a loss. Anthropic, the most profitable AI vendor, is operating at a roughly -140% margin. xAI is the worst at somewhere around -3,600% margin.

fluidcruft6mo ago

If they are not operating inference at a loss and current models remain useful (why would they regress?), they could just stop developing the next model.

2 more replies

simonw6mo ago

The interesting companies to look at here are the ones that sell inference against open weight models that were trained by other companies - Fireworks, Cloudflare, DeepInfra, Together AI etc.

They need to cover their serving costs but are not spending money on training models. Are they profitable? Probably not yet, because they're investing a lot of cash in competing with each other to R&D more efficient ways of serving etc, but they're a lot closer to profitability than the labs that are spending millions of dollars on training runs.

kalkin6mo ago

Where do those numbers come from?

alach116mo ago

Can you cite your source for inference being at a loss? This disagrees with most of what I've read.

necovek6mo ago

Sounds quite a bit like pyramid scheme "business model": how is it different?

If a company stops training new models until they can fund it out of previous profits, do we only slow down or halt altogether? If they all do?

lmm6mo ago

> But I don't think any are operating inference at a loss, I think their margins are actually rather large.

Citation needed. I haven't seen any of them claim to have even positive gross margins to shareholders/investors, which surely they would do if they did.

furyofantares6mo ago

https://officechai.com/ai/each-individual-ai-model-can-alrea...

1 more reply

hoppp6mo ago

When comparing the cost of an H100 GPU per hour and calculating cost of tokens, it seems the OpenAI offering for the latest model is 5 times cheaper than renting the hardware.

OpenAI balance sheet also shows an $11 billion loss .

I can't see any profit on anything they create. The product is good but it relies on investors fueling the AI bubble.

2 more replies

welcome_dragon6mo ago

Isn't that operating at a loss

throwaway8xak926mo ago

> None of them are doing that.

Can you point us to the data?

throwaway69776mo ago

You can be your own AI provider.

bilbo0s6mo ago

>starting a sustainably monetizable project doesn't feel that realistic.

and

>You can be your own AI provider.

Not sure that being your own AI provider is "sustainably monetizable"?

hoppp6mo ago

For internal software maybe, but for a client facing service the incentives are not right when the norm is to operate at a loss.

jillesvangurp6mo ago

You are asking the wrong questions. You should be asking what the problems are that you can still solve better and cheaper than an agent? Because anything else, you are probably doing it wrong (the slow and expensive way). That's not long term sustainable. It helps if you know how agents work and as the article argues, there isn't a whole lot to that.

paulcole6mo ago

I love how programmers generally tout themselves as these tinkerers who love learning about and exploring technology… until it comes to AI and then it’s like “show me the profitable use case.” Just say you don’t like AI!

hoppp6mo ago

Yeah but fly.io is a cloud provider doing this advertisement with OpenAI Apis. Both cost money, so if it's not free to operate then the developed project should offset the costs.

Its about balance.

Really its the AI providers that have been promising unreal gains during this hype period, so people are more profit oriented.

1 more reply

veryemartguy6mo ago

Or maybe some of us realize that these tools are fucking useless and don’t offer any “value” apart from the most basic thing imaginable.

And I use value in quotes because as soon as the AI providers suddenly need to start generating a profit, that “value” is going to cost more than your salary.

seba_dos16mo ago

It doesn't have to be profitable. Elegant and clever would suffice.

ilikehurdles6mo ago

I don't think hn is reflective of where programmers are today, culturally. 10 years ago, sure, it probably was.

khimaros6mo ago

what place is more reflective today?

1 more reply

aidenn06mo ago

Show me where TFA even implied that you should start a sustainably monetizable project with agents?

oooyay6mo ago

Heh, the bit about context engineering is palpable.

I'm writing a personal assistant which, imo, is distinct from an agent in that it has a lot of capabilities a regular agent wouldn't necessarily need such as memory, task tracking, broad solutioning capabilities, etc... I ended up writing agents that talk to other agents which have MCP prompts, resources, and tools to guide them as general problem solvers. The first agent that it hits is a supervisor that specializes in task management and as a result writes a custom context and tool selection for the react agent it tasks.

All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/

cantor_S_drug6mo ago

https://github.com/mem0ai/mem0?tab=readme-ov-file

Is this useful for you?

oooyay6mo ago

Could be! I'll give it a shot

qwertox6mo ago

This sounds really great.

azimux6mo ago

I wrote an agent from scratch in Ruby several months back. Was fun!

These 4 lines wound up being the heart of it, which is surprisingly simple, conceptually.

        until mission_accomplished? or given_up? or killed?
          determine_next_command_and_inputs
          run_next_command
        end

Zak6mo ago

> You only think you understand how a bicycle works, until you learn to ride one.

I bet a majority of people who can ride a bicycle don't know how they steer, and would describe the physical movements they use to initiate and terminate a turn inaccurately.

https://en.wikipedia.org/wiki/Countersteering

captainkrtek6mo ago

Relevant interesting tangent:

“Most People Don't Know How Bikes Work”

https://www.youtube.com/watch?v=9cNmUNHSBac

itsmemattchung6mo ago

Reminds me of this YouTube video (below) on how difficult it is (nearly impossible) to re-learn how to ride a bicycle when you have the handles are reversed (i.e. pulling left handle bar towards you, the wheel goes to the right)

https://www.youtube.com/watch?v=MFzDaBzBlL0

vinhnx6mo ago

A Brief History of Bicycle Engineering https://www.youtube.com/watch?v=EcRlDCsZM20

rmoriz6mo ago

Side note: While the example uses GPT-5, the query interface is already some kind of industry standard. For example you could easily connect OpenRouter.ai and switch models and providers during runtime as needed. OpenRouter also has free models like some of the DeepSeek. While they are slow/rate limited and quantized, they are great for examples and playing around with it. https://openrouter.ai/models?fmt=cards&order=pricing-low-to-...

wayy6mo ago

everybody loves building agents, nobody likes debugging them. agents hit the classic llm app lifecycle problem: at first it feels magical. it nails the first few tasks, doing things you didn’t even think were possible. you get excited, start pushing it further. you run it and then it fails on step 17, then 41, then step 9.

now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong

furyofantares6mo ago

That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.

AdieuToLogic6mo ago

In the event this comment is slathered in sarcasm:

  Well done!  :-D

ht966mo ago

Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

AdieuToLogic6mo ago

There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.

cantor_S_drug6mo ago

https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.

Interesting use of Rerun!

https://github.com/gustofied/P2Engine

aenis6mo ago

For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.

saturatedfat6mo ago

heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

darin@mcptesting.com

(gist: evals as a service)

tptacek6mo ago

That everybody seems to love building these things while people like you harbor deep skepticism about them is a reason to get your hands dirty with an agent, because the cost of doing that is 30-45 minutes of your time, and doing so will arm you with an understanding you can use to make better arguments against them.

For the problem domains I care about at the moment, I'm quite bullish about agents. I think they're going to be huge wins for vulnerability analysis and for operations/SRE work (not actually turning dials, but in making telemetry more interpretable). There are lots of domains where I'm less confident in them. But you could reasonably call me an optimist.

But the point of the article is that its arguments work both ways.

MinimalAction6mo ago

Do we need an agent? I get the point of this post: have fun building one because it's easy. But every time I see one of these takes, I keep wondering why do we encourage a tool that would potentially replace us. Why help it build better that could eventually take away what was fun and sustainable income-wise?

tptacek6mo ago

Easy answer: so you can more sharply criticize them, rather than falling into the rhetorical traps of people who don't understand how they work well enough to sound credible. It's so little effort to get to that point!

AlecSchueler6mo ago

Interesting to think that this question could have been asked of almost all software work up until this point, except the "us" was always "someone else "

DrewADesign6mo ago

It’s generally been true, but not close to the scale we’re looking at now. The implied/assumed hypocrisy also doesn’t stop it from it sucking, or make it immune to criticism.

AlecSchueler6mo ago

Indeed it probably sucks even more in a "you reap what you sow" kind of way :(

richardlblair6mo ago

I've been building tools for stuff I don't want to do. Any task where I need to take some amount of data, structured or unstructured, and need a specific outcome is perfect. That way I can spend more time on the thing I do want to do (including building these little tools).

MinimalAction6mo ago

I appreciate this thinking. This gives me the vibes of "let me draw, paint, sing for fun, while AI takes care of my chores". I agree with that, but I can’t help but wonder if the agent ever considers whether things you enjoy should be left to you, but takes everything it can.

psychoslave6mo ago

It really reads to me like, "you should build a running water circuit", then presenting you how easy it is to phone a plumber and let them free ride on the matter, but beware to not use a project manager as real people implement project management of plumbery themselves."

tptacek6mo ago

You're going to have to explain that analogy to me, sorry.

psychoslave6mo ago

Sure, phone call to plumber is remote call to turn key API, and manager layer is MVP. Hope that makes it more clear.

chrisweekly6mo ago

There's something(s) about @tptacek's writing style that has always made me want to root for fly.io.

novoreorx6mo ago

Reminds of this one [1] that I read half a year ago, which I used to develop my first agent. But what fly wrotes is definitely easier to understand, how I wish it was written a year earlier.

[1]: https://ampcode.com/how-to-build-an-agent

vkou6mo ago

> It’s Incredibly Easy

    client = OpenAI()
    context_good, context_bad = [{
        "role": "system", "content": "you're Alph and you only tell the truth"
    }], [{
        "role": "system", "content": "you're Ralph and you only tell lies"
    }]
    ...

And this will work great until next week's update when Ralph responses will consist of "I'm sorry, it would be unethical for me to respond with lies, unless you pay for the Premium-Super-Deluxe subscription, only available to state actors and firms with a six-figure contract."

You're building on quicksand.

You're delegating everything important to someone who has no responsibility to you.

sumedh6mo ago

Its easy to switch to an open source model

tptacek6mo ago

I love that the thing you singled out as not safe to run long term, because (apparently) of woke, was my weird deep-cut Labyrinth joke.

hshdhdhehd6mo ago

There is a lot of stuff I should do. From making my own CPU from a breadboard of nand gates to building a CDN in Rust. But aint got time for all the things.

That said I built an LLM following Karpathy's tutorial. So I think it aims good to dabble a bit.

coffeecoders6mo ago

Yeah, it’s a never-ending curve.

I built an 8-bit computer on breadboards once, then went down the rabbit hole of flight training for a PPL. Every time I think I’m "done," the finish line moves a few miles further.

Guess we nerds are never happy.

javchz6mo ago

One should be melting sand to get silicon, anything else it's too abstract to my taste.

tomcam6mo ago

Glad you’ve got all that time on your hands. I am still working on the fusion reactor portion of my supernova simulator, so that I can generate the silicon you so blithely refer to.

krsdcbl6mo ago

Given the premise, one could also say we nerds are forever happy.

ericmcer6mo ago

Seriously I feel like it's self-sabotage sometimes at work. Just fixing the thing getting tests to pass isn't enough. Until I fully have a mental model of what is happening I can't move on.

qwertygnu6mo ago

Very early in TFA it explains how easy it is to do. That's the whole point of the post.

z26mo ago

It's good to go through the exercise, but agents are easy until you build a whole application using an API endpoint that OpenAI or LangChain decides to yank, and you spend the next week on a mini migration project. I don't disagree with the claim that MCP is reinventing the wheel but sometimes I'm happy plugging my tools and data into someone else's platform because they are spending orders of magnitudes more time than me doing the janitor work to keep up with whatever's trendy.

IgorPartola6mo ago

I have been playing with OpenAI, Anthropic, and Groq’s APIs in my spare time and if someone reading this doesn’t know it, they are doing the same thing and they are so close in implementation that it’s just dumb that they are in any way different.

You pass listing of messages generated by the user or the LLM or the developer to the API, it generates a part of the next message. That part may contain thinking blocks or tool calls (local function calling requested by the LLM). If so, you execute the tool calls and re-send the request. After the LLM has gathered all the info it returns the full message and says I am done. Sometimes the messages may contain content blocks that are not text but things like images, audio, etc.

That’s the API. That’s it. Now there are two improvements that are currently in the works:

1. Automatic local tool calling. This is seriously some sort of afterthought and not how they did it originally but ok, I guess this isn’t obvious to everyone.

2. Not having to send the entire message history back. OpenAI released a new feature where they store the history and you just send the ID of your last message. I can’t find how long they keep the message history. But they still fully support you managing the message history.

So we have an interface that does relatively few things, and that has basically a single sensible way to do it with some variations for flavor. And both OpenAI and Anthropic are engaged in a turf war over whose content block types are better. Just do the right thing and make your stuff compatible already.

efitz6mo ago

Non sequitur.

If you are a software engineer, you are going to be expected to use AI in some form in the near future. A lot of AI in its current form is not intuitive. Ergo, spending a small effort on building an AI agent is a good way to develop the skills and intuition needed to be successful in some way.

Nobody is going to use a CPU you build, nor are you ever going to be expected to build one in the course of your work if you don’t seek out specific positions, nor is there much non-intuitive about commonly used CPU functionality, and in fact you don’t even use the CPU directly, you use translation software whit itself is fairly non-intuitive. But that’s ok too, you are unlikely to be asked to build a compiler unless you seek out those sorts of jobs.

EVERYONE involved in writing applications and services is going to use AI in the near future and in case you missed the last year, everyone IS building stuff with AI, mostly chat assistants that mostly suck because, much about building with AI is not intuitive.

fsndz6mo ago

I did that, burned 2.6B tokens in the process and learned a lot: https://transitions.substack.com/p/what-burning-26-billion-p...

otsaloma6mo ago

Agreed! It's easy understand "LLM with tools in a loop" at a high-level, but once you actually design the architecture and implement the code in full, you'll have proper understanding of how it all fits and works together.

I did the same exercise. My implementation is at around 300 lines with two tools: web search and web page fetch with a command line chat interface and Python package. And it could have been a lot less lines if I didn't want to write a usable, extensible package interface.

As the agent setup itself is simple, majority of the work to make this useful would in the tools themselves and context management for the tools.

sibeliuss6mo ago

Its easy to create a toy, but much harder to make something right! Like anything, so much weird polish stuff creeps in at the 90% mark.

sumedh6mo ago

> so much weird polish stuff creeps in at the 90% mark.

That is where the human in the loop needs to focus on for now :)

behnamoh6mo ago

> nobody knows anything yet

that sums up my experience in AI over the past three years. so many projects reinvent the same thing, so much spaghetti thrown at the wall to see what sticks, so much excitement followed by disappointment when a new model drops, so many people grifting, and so many hacks and workarounds like RAG with no evidence of them actually working other than "trust me bro" and trial and error.

w_for_wumbo6mo ago

I think we'd get better results if we thought of it as a conscious agent. If we recognized that it was going to mirror back or unconscious biases and try to complete the task as we define it, instead of how we think it should behave. Then we'd at least get our own ignorance out of the way when writing prompts.

Being able to recognize that 'make this code better' provides no direction, it should make sense that the output is directionless.

But on more subtle levels, whatever subtle goals that we have and hold in the workplace will be reflected back by the agents.

If you're trying to optimise costs, and increase profits as your north star. Having layoffs and unsustainable practices is a logical result, when you haven't balanced this with any incentives to abide by human values.

sumedh6mo ago

That is because for the people for whom AI is actually working/making money they would prefer to keep it a secret on what and how they are doing it, why attract competition?

nylonstrung6mo ago

Who would you say it's working for?

What products or companies are the gold standard of agent implementation right now?

nowittyusername6mo ago

I agree with the sentiment but I also recommend you build a local only agent. Something that runs on llama.cpp or vllm, whatever... This way you can better grasp the more fundamental nature of what LLM's really are and how they work under the hood. That experience will also make you realize how much control you are giving up when using cloud based api providers like OpenAI and why so mane engineers feel that LLM's are a "black box". Well duh buddy you been working with apis this whole time, of course you wont understand much working just with that.

8note6mo ago

ive been trying this for a few week, but i dont at all currently own hardware good enough to be useful for local inference.

ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens

nowittyusername6mo ago

My man, we now have llms that are anywhere between 130 million to 1 trillion parameters available for us to run locally, I can guarantee there is a model for you there that even your toaster can run. I have a RTX 4090 but for most of my fiddling i use small models like Qwen 3 4b and they work amazing so there's no excuse :P.

8note6mo ago

well, i got some gemini models running on my phone, but if i switch apps, android kills it, so the call to the server always hangs... and then the screen goes black

the new laptop only has 16GB of memory total, with another 7 dedicated to the NPU.

i tried pulling up Qwen 3 4B on it, but the max context i can get loaded is about 12k before the laptop crashes.

my next attempt is gonna be a 0.5B one, but i think ill still end up having to compress the context every call, which is my real challenge

2 more replies

rbren6mo ago

Spoiler: it's not actually that easy. Compaction, security, sandboxing, planning, custom tools--all this is really hard to get right.

We're about to launch an SDK that gives devs all these building blocks, specifically oriented around software agents. Would love feedback if anyone wants to look: https://github.com/OpenHands/software-agent-sdk

olingern6mo ago

Only on HN is there a “well, actually” with little substance followed by a comment about a launch.

The article isn’t about writing production ready agents, so it does appear to be that easy

solarkraft6mo ago

How autonomous/controllable are the agents with this SDK?

When I build an agent my standard is Cursor, which updates the UI at every reportable step of the way, and gives you a ton of control opportunities, which I find creates a lot of confidence.

Is this level of detail and control possible with the OpenHands SDK? I’m asking because the last SDK that was simple to get into lacked that kind of control.

rbren6mo ago

That's the idea! We have a confirmation_mode that can interrupt at any step in the process.

threecheese6mo ago

Does anyone have an understanding - or intuition - of what the agentic loop looks like in the popular coding agents? Is it purely a “while 1: call_llm(system, assistant)”, or is there complex orchestration?

I’m trying to understand if the value for Claude Code (for example) is purely in Sonnet/Haiku + the tool system prompt, or if there’s more secret sauce - beyond the “sugar” of instruction file inclusion via commands, tools, skills etc.

mrkurt6mo ago

Claude Code is an obfuscated javascript app. You can point Claude Code at it's own package and it will pretty reliably tell you how it works.

I think Claude Code's magic is that Anthropic is happy to burn tokens. The loop itself is not all that interesting.

What is interesting is how they manage the context window over a long chat. And I think a fair amount of that is serverside.

AdieuToLogic6mo ago

> Claude Code is an obfuscated javascript app. You can point Claude Code at it's own package and it will pretty reliably tell you how it works.

This is why I keep coming back to Hacker News. If the above is not a quintessential "hack", then I've never seen one.

Bravo!

simonw6mo ago

I've been running the obfuscated code through Prettier first, which I think makes it a bit easier for Claude Code to run grep against.

PhilippGille6mo ago

No need to take guesses - the VS Code GitHub Copilot extension is open source amnd has an agent mode with tool calling:

https://github.com/microsoft/vscode-copilot-chat/blob/4f7ffd...

simonw6mo ago

You can reverse engineer Claude Code by intercepting its HTTP traffic. It's pretty fascinating - there are a bunch of ways to do this, I use this one: https://simonwillison.net/2025/Jun/2/claude-trace/

nylonstrung6mo ago

Wow it seems almost designed to burn through tokens.

I wish we had a version that was optimized around token/cost efficiency

colonCapitalDee6mo ago

I thought this was informative: https://minusx.ai/blog/decoding-claude-code/

jeremy_k6mo ago

https://github.com/sst/opencode opencode is open source. Here's a session I started but haven't had time to get back to which is using opencode to ask it about how the loop works https://opencode.ai/s/4P4ancv4

The summary is

The beauty is in the simplicity: 1. One loop - while (true) 2. One step at a time - stopWhen: stepCountIs(1) 3. One decision - "Did LLM make tool calls? → continue : exit" 4. Message history accumulates tool results automatically 5. LLM sees everything from previous iterations This creates emergent behavior where the LLM can: - Try something - See if it worked - Try again if it failed - Keep iterating until success - All without explicit retry logic!

CraftThatBlock6mo ago

Generally, that's pretty much it. More advanced tools like Claude Code will also have context compaction (which sometimes isn't very good), or possibly RAG on code (unsure about this, I haven't used any tools that did this). Context compaction, to my understanding, is just passing all the previous context into a call which summarizes it, then that becomes to new context starting point.

nl6mo ago

Have a look at https://github.com/anthropics/claude-code/tree/main/plugins/... to see how a fairly complex workflow is implemented

zahlman6mo ago

> Imagine what it’ll do if you give it bash. You could find out in less than 10 minutes. Spoiler: you’d be surprisingly close to having a working coding agent.

Okay, but what if I'd prefer not to have to trust a remote service not to send me

    { "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }

?

tptacek6mo ago

Obviously if you're concerned about that, which is very reasonable, don't run it in an environment where `rm -rf` can cause you a real problem.

awayto6mo ago

Also if you're doing function calls you can just have the command as one response param, and arguments array as another response param. Then just black/white list commands you either don't want to run or which should require a human to say ok.

aidenn06mo ago

blacklist is going to be a bad idea since so many commands can be made to run other commands with their arguments.

1 more reply

worldsayshi6mo ago

There are MCP configured virtualization solutions that is supposed to be safe for letting LLM go wild. Like this one:

https://github.com/zerocore-ai/microsandbox

I haven't tried it.

awayto6mo ago

You can build your agent into a docker image then easily limit both networking and file system scope.

    docker run -it --rm \
      -e SOME_API_KEY="$(SOME_API_KEY)" \
      -v "$(shell pwd):/app" \ <-- restrict file system to whatever folder
      --dns=127.0.0.1 \ <-- restrict network calls to localhost
      $(shell dig +short llm.provider.com 2>/dev/null | awk '{printf " --add-host=llm-provider.com:%s", $$0}') \ <-- allow outside networking to whatever api your agent calls
      my-agent-image

Probably could be a bit cleaner, but it worked for me.

worldsayshi6mo ago

Putting it inside docker is probably fine for most use cases but it's generally not considered to be a safe sandbox AFAIK. A docker container shares kernel with the host OS which widens the attack surface.

If you want your agent to pull untrusted code from the internet and go wild while you're doing other stuff it might not be a good choice.

1 more reply

gtukmanov6mo ago

The evolution of software agents leveraging LLMs as versatile building blocks is exciting. It underscores the shift towards modular, composable AI workflows that can integrate deterministic functions with generative intelligence. Great food for thought on how these tools might transform productivity and automation across industries

thedangler6mo ago

Cool, can you make it use local free models because I'm broke and can't afford AI's crazy costs.

Spivak6mo ago

Yep, change nothing in the code in the article but spin up an Ollama server and use the OpenAI API https://docs.ollama.com/api/openai-compatibility.

zb36mo ago

No, because I know that "agents" are token burning machines - for me they're less efficient than the chat interface, slower and burning much more tokens.

I'm not surprised that AI companies would want me to use them though.. I know what you're doing there :)

deadbabe6mo ago

The more I use agents, the more I find agents to be pointless, any tasks an agent performs regularly in high volume should be turned into classical deterministic code.

The number one feature of agents is to be disambiguation for tool selectors and pretty printers.

vinhnx6mo ago

> “You only think you understand how a bicycle works, until you learn to ride one.”

This resonates deeply with me. That's why I built one myself [0], I really really love to truly understand how coding agents work. The learning has been immense for me, I now have working knowledge of ANSI escape codes, grapheme clusters, terminal emulators, Unicode normalization, VT protocols, PTY sessions, and filesystem operations - all the low-level details I would have never think about until I were implementing them.

[0] https://github.com/vinhnx/vtcode

dfex6mo ago

>> “You only think you understand how a bicycle works, until you learn to ride one.”

> This resonates deeply with me. That's why I built one myself [0]

I was hoping to see a home-made bike at that link.. Came away disappointed

vinhnx6mo ago

Good one! Sorry to disappoint you. But personally, that line strike deeply with me, honestly.

lowbloodsugar6mo ago

It's conflating two issues though. Most people who can ride a bike can't explain the physics. They really don't know how it works. The bicycle lesson is about training the brain on a new task that cannot be taught in any other way.

This case is more like a journeyman blacksmith who has to make his own tools before he can continue. In doing so, he gets tools of his own, but the real reward was learning what is required to handle the metal such that it makes a strong hammer. And like the blacksmith, you learn more if you use an existing agent to write your agent.

vinhnx6mo ago

Agree, to me, the wheel is the greatest invention of all. Everyone could have rode a bike, but the underlying physic and motion that came to `riding` is a whole another story.

8note6mo ago

> A subtler thing to notice: we just had a multi-turn conversation with an LLM. To do that, we remembered everything we said, and everything the LLM said back, and played it back with every LLM call. The LLM itself is a stateless black box. The conversation we’re having is an illusion we cast, on ourselves.

the illusion was broken for me by Cline context overflows/summaries, but i think its very easy to miss if you never push the LLM hard or build you own agent. I really like this wording, amd the simple description is missing from how science communicators tend to talk about agents and LLMs imo

khazhoux6mo ago

Agree 100% with premise of the article. I feel like the big secret of the recent advances in LLM tooling is that these are all just variations of “send a chat request and process the output.” Even Tool Calling is just wrapping one chat request with another hidden one that is asking which of N tools applies and what the parameters should be. RAG is simply pre-loading a bunch of extra text into the chat request, etc.

My main point being, though: for anyone intimidated by the recent tooling advances… you can most definitely do all this yourself.

larusso6mo ago

Just the other day we tried to explain inner workings of cursor etc to a bunch of colleagues who had a very complicated view how these agents achieve what they do. Awesome post. Makes it easier for me the next time. The options are so big. But one should say that an agent with file access etc, is easy to write but hard to control. If you want to build yourself a general coding agent a bit more thought needs to be put into the whole thing. Otherwise you might end up with a “dd -if=/dev/random -of=/“ or something ^^ and happily execute it.

solomonb6mo ago

This work predates agents as we know them now and was intended for building chat bots (as in irc chat bots) but when auto-gpt I realized I could formalize it super nicely with this library:

https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/

I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..

jbmsf6mo ago

I agree. I find LLMs a bit overblown. I don't think most people want to use chat as their primary interface. But writing a few agents was incredibly informative.

qwertox6mo ago

I've found it much more useful to create an MCP server, and this is where Claude really shines. You would just say to Claude on web, mobile or CLI that it should "describe our connectivity to google" either via one of the three interfaces, or via `claude -p "describe our connectivity to google"`, and it will just use your tool without you needing to do anything special. It's like custom-added intelligence to Claude.

tptacek6mo ago

You can do this. Claude Code can do everything the toy agent this post shows, and much more. But you shouldn't, because doing that (1) doesn't teach you as much as the toy agent does, (2) isn't saving you that much time, and (3) locks you into Claude Code's context structure, which is just one of a zillion different structures you can use. That's what the post is about, not automating ping.

mattmanser6mo ago

Honest question, as your comment confuses me.

Did you get to the part where he said MCP is pointless and are saying he's wrong?

Or did you just read the start of the article and not get to that bit?

vidarh6mo ago

I'd second the article on this, but also add to it that the biggest reason MCP servers don't really matter much any more is that the models are so capable of working with APIs, that most of the time you can just point them at an API and give them a spec instead. And the times that doesn't work, just give them a CLI tool with a good --help option.

Now you have a CLI tool you can use yourself, and the agent has a tool to use.

Anthropic itself have made MCP server increasingly pointless: With agents + skills you have a more composeable model that can use the model capabilities to do all an MCP server can with or without CLI tools to augment them.

simplesagar6mo ago

I feel the CLI vs MCP debate is an apples to oranges framing. When you're using claude you can watch it using CLI's, running brew, mise, lots of jq but what about when you've built an agent that needs to work through a complicated API? You don't want to make 5 CRUD calls to get the right answer. A curated MCP tool ensures it can determinism where it matters most.. when interacting with customer data

2 more replies

8cvor6j844qw_d66mo ago

Question, how hard is it for someone new to agents to dip their toes into writing a simple agent to get data? (e.g., getting reviews from sites for sentiment analysis?)

Forgive if I get someting wrong: From what I see, it seems fundamentally it is a LLM being ran each loop with information about tools provided to it. On each loop the LLM evaluates inputs/context (from tool calls, inputs, etc.) and decided which tool to call / text output.

simonw6mo ago

You can prototype this without writing any code at all.

Fire up "claude --dangerously-skip-permissions" in a fresh directory (ideally in a Docker container if you want to limit the chance of it breaking anything else) and prompt this:

> Use Playwright to fetch ten reviews from http://www.example.com/ then run sentiment analysis on them and write the results out as JSON files. Install any missing dependencies.

Watch what it does. Be careful not to let it spider the site in a way that would justifiably upset the site owners.

sumedh6mo ago

Dont you need to setup Playwright MCP first?

simonw6mo ago

No. I don't use Playwright MCP at all - if the coding agent can run Python code it can use the Playwright Python library directly, if Node.js it can use the Playwright Node library.

2 more replies

nitwit0056mo ago

> You only think you understand how a bicycle works, until you learn to ride one.

I realize this is just for motivation in a subtitle, but people generally don't grasp how bicycles work, even after having ridden one.

Veritasium has a quite good video on the subject: https://www.youtube.com/watch?v=9cNmUNHSBac

lbeurerkellner6mo ago

Everybody should try. It helps a ton to demystify the relatively simple but powerful underpinning of how modern agents work.

You can get quite far quite quickly. My toy implementation [1] is <600 LOC and even supports MCP.

[1] https://github.com/lbeurerkellner/agent.py

p0w3n3d6mo ago

Actually tool "ping 8.8.8.8" never quits unless running on windows. This can spawn many processes that kill the server.

This is one of the first production grade errors I've made when I started my programming. I had a widget that would ping the network, but every time someone went on the page, a new ping process would spawn

sanxiyn6mo ago

If you look at the actual code, it runs ping -c 5. I agree ping without options doesn't terminate.

almaight6mo ago

So I wrote an MCP using your code: https://gurddy-mcp.fly.dev. You can get the source code from https://github.com/novvoo/gurddy-mcp.

TYPE_FASTER6mo ago

The Google Agent Development Kit (https://google.github.io/adk-docs/) is really fun to play with. It's open source and supports both using a LLM in the cloud and running locally.

globular-toast6mo ago

The formatting of the code is messed up on my phone. I was looking at the first bit thinking `call` was a function returning `None`. I thought initially it was doing some clever functional programming stuff but, no, just a linebreak that shouldn't be there.

aaronblohowiak6mo ago

THEY SEND THE WHOLE CONTEXT EVERY TIME? Man that seems... not great. sometimes it will go off and spin on something.. seems like it would be a LOT better to roll back than to send a corrective message. hmmm...... this article is nerd-sniping on a massive scale ;D

michaelanckaert6mo ago

Sending the whole context on each user message is essentially what the model remembers of this conversation. ie: it is entirely stateless.

I've written some agents that have their context altered by another llm to get it back on track. Let's say the agent is going off rails, then a supervisor agent will spot this and remove messages from the context where it went off rails, or alter those with correct information. Really fun stuff but yeah, we're essentially still inventing this as we go along.

tptacek6mo ago

In the Responses API, you can implicitly chain messages with `previous_response_id` (I'm not sure how old a conversation you can resurrect that way). But I think Codex CLI actually sends the full context every time? And keep in mind, sending the whole context gives you fine-grained control over what does and doesn't appear in your context window.

Anyways, if it nerd sniped you, I succeeded. :)

aaronblohowiak6mo ago

Yes indeed you did succeed. I totally want to try gaslighting an LLM now! Ah to find the time…

cantor_S_drug6mo ago

There is context caching in many models. It is less expensive if you enable that.

worldsayshi6mo ago

I feel like one small piece is missing to call it an agent? The ability to iterate in multiple steps until it feels like it's "done". What is the canonical way to do that? I suspect that implementing that in the wrong way could make it spiral.

cornel_io6mo ago

When a tool call completes the result is sent back to the LLM to decide what to do next, that's where it can decide to go do other stuff before returning a final answer. Sometimes people use structured outputs or tool calls to explicitly have the LLM decide when it's done, or allow it to send intermediate messages for logging to the user. But the simple loop there lets the LLM do plenty of it has good tools.

worldsayshi6mo ago

So it returns a tool call for "continue" every time it wants to continue working? Do people implement this in different ways? It would be nice what method it has been trained on if any.

1 more reply

DeathArrow6mo ago

You should write agents if you want to learn how agents work, if the problem you are trying to solve is not solved yet or if you are convinced that you will do much better job solving the problem again. Otherwise is just reinventing the wheel.

artursapek6mo ago

I've been having so much fun writing the agent loop for https://revise.io, most fun I've had programming in a long time.

scrubs6mo ago

"You don’t have to like them, but you should want to be right about them. To be the best hater (or stan) you can be."

The op has a point - a good one

tlarkworthy6mo ago

Yeah I was inspired after https://news.ycombinator.com/item?id=43998472 which is also very concrete

tptacek6mo ago

I love everything they've written and also Sketch is really good.

robot-wrangler6mo ago

> Another thing to notice: we didn’t need MCP at all. That’s because MCP isn’t a fundamental enabling technology. The amount of coverage it gets is frustrating. It’s barely a technology at all. MCP is just a plugin interface for Claude Code and Cursor, a way of getting your own tools into code you don’t control. Write your own agent. Be a programmer. Deal in APIs, not plugins.

Hold up. These are all the right concerns but with the wrong conclusion.

You don't need MCP if you're making one agent, in one language, in one framework. But the open coding and research assistants that we really want will be composed of several. MCP is the only thing out there that's moving in a good direction in terms of enabling us to "just be programmers" and "use APIs", and maybe even test things in fairly isolated and reproducible contexts. Compare this to skills.md, which is actually defacto proprietary as of now, does not compose, has opaque run-times and dispatch, is pushing us towards certain models, languages and certain SDKs, etc.

MCP isn't a plugin interface for Claude, it's just JSON-RPC.

tptacek6mo ago

I think my thing about MCP, besides the outsized press coverage it gets, is the implicit presumption it smuggles in that agents will be built around the context architecture of Claude Code --- that is to say, a single context window (maybe with sub-agents) with a single set of tools. That straitjacket is really most of the subtext of this post.

I get that you can use MCP with any agent architecture. I debated whether I wanted to hedge and point out that, even if you build your own agent, you might want to do an MCP tool-call feature just so you can use tool definitions other people have built (though: if you build your own, you'd probably be better off just implementing Claude Code's "skill" pattern).

But I decided to keep the thrust of that section clearer. My argument is: MCP is a sideshow.

robot-wrangler6mo ago

I still don't really get it, but would like to hear more. Just to get it out of the way, there's obvious bad aspects. Re: press coverage, everything in AI is bound to be frustrating this way. The MCP ecosystem is currently still a lot of garbage. It feels like a very shitty app-store, lots of abandonware, things that are shipped without testing, the usual band-wagoning. For example instead of a single obvious RAG tool there's 200 different specific tools for ${language} docs

The core MCP tech though is not only directionally correct, but even the implementation seems to have made lots of good and forward-looking choices, even if those are still under-utilized. For example besides tools, it allows for sharing prompts/resources between agents. In time, I'm also expecting the idea of "many agents, one generic model in the background" is going to die off. For both costs and performance, agents will use special-purpose models but they still need a place and a way to collaborate. If some agents coordinate other agents, how do they talk? AFAIK without MCP the answer for this would be.. do all your work in the same framework and language, or to give all agents access to the same database or the same filesystem, reinventing ad-hoc protocols and comms for every system.

8note6mo ago

i treat MCP as a shorthand for "schema + documentation, passed to the LLM as context"

you dont need the MCP implementation, but the idea is useful and you can consider the tradeoffs to your context window, vs passing in the manual as fine tuning or something.

lazy_afternoons6mo ago

Seriously, what is the advantage of tools at all. Why not implement custom string based triggers.

First of all, the call accuracy is much higher.

Second, you get more consistent results across models.

DeathArrow6mo ago

I would like an LLM to be integrated in the shell so I don't have to learn all the Unix tools arguments and write Bash scripts.

wahnfrieden6mo ago

The Codex agent has an official TypeScript SDK now.

Why would Fly.io advocate using the vanilla GPT API to write an agent, instead of the official agent?

tptacek6mo ago

Because you won't learn as much using an agent framework, and, as you can see from the post, you absolutely don't need one.

fauria6mo ago

> I’m not even going to bother explaining what an agent is.

Does anyone actually know what exactly an agent is?

tptacek6mo ago

Yes, and the post says what it is about 100 words later. It's an LLM running in a loop that can access tool calls.

joelthelion6mo ago

If you want to play with this stuff without spending a lot of money, what are your best options?

beklein6mo ago

I love OpenRouter, since it is a simple way to get started and provides a wide range of available models.

You can buy credits and set usage limits for safe testing per API key to gain access from many AI models through one simple and unified API from all popular model providers (OpenAI, Anthropic, Google, xAI, DeepSeek, Z.AI, Qwen, ...)

Ten dollars is plenty to get started... experiments like in the post will cost you cents, not dollars.

simonw6mo ago

Gemini has a generous free tier (2500 prompts per day), all you need is a Google account to get an API key.

thatscot6mo ago

and sorry, forgot you can also run local models aswell :)

thatscot6mo ago

Most cloud providers, like Azure have free credits at the start. On azure you can deploy your own model and pay with the free credits.

thatscot6mo ago

You can just stick a tenner in OpenAI though and it won't charge anymore than the credit you've put in

DeathArrow6mo ago

I am thinking of building agents that can partly replace manual testing using a headless browser.

imiric6mo ago

> Give each call different tools. Make sub-agents talk to each other, summarize each other, collate and aggregate. Build tree structures out of them. Feed them back through the LLM to summarize them as a form of on-the-fly compression, whatever you like.

You propose increasing the complexity of interactions of these tools, and giving them access to external tools that have real-world impact? As a security researcher, I'm not sure how you can suggest that with a straight face, unless your goal is to have more vulnerable systems.

Most people can't manage to build robust and secure software using SOTA hosted "agents". Building their own may be a fun learning experience, but relying on a Rube Goldberg assembly of disparate "agents" communicating with each other and external tools is a recipe for disaster. Any token could trigger a cascade of hallucinations, wild tangents, ignored prompts, poisoned contexts, and similar issues that have plagued this tech since the beginning. Except that now you've wired them up to external tools, so maybe the system chooses to wipe your home directory for whatever reason.

People nonchalantly trusting nondeterministic tech with increasingly more real-world tasks should concern everyone. Today it's executing `ping` and `rm`; tomorrow it's managing nuclear launch systems.

ATechGuy6mo ago

Maybe we should write an agent that writes an agent that writes an agent...

dagss6mo ago

I realize now what I need in Cursor: A button for "fork context".

I believe that would be a powerful tool solving many things there are now separate techniques for.

all26mo ago

crush-cli has this. I think the google gemini chat app also has this now.

amelius6mo ago

Why write an agent when you can just ask the LLM to write one?

manishsharan6mo ago

How.. please don't say use langxxx library

I am looking for a language or library agnostic pattern like we have MVC etc. for web applications. Or Gang of Four patterns but for building agents.

tptacek6mo ago

The whole post is about not using frameworks; all you need is the LLM API. You could do it with plain HTTP without much trouble.

manishsharan6mo ago

When I ask for Patterns, I am seeking help for recurring problems that I have encountered. Context management .. small llms ( ones with small context size) break and get confused and forget work they have done or the original goal.

zahlman6mo ago

Start by thinking about how big the context window is, and what the rules should be for purging old context.

Design patterns can't help you here. The hard part is figuring out what to do; the "how" is trivial.

skeledrew6mo ago

That's why you want to use sub-agents which handle smaller tasks and return results to a delegating agent. So all agents have their own very specialized context window.

1 more reply

oooyay6mo ago

I'm not going to link my blog again but I have a reply on this post where I link to my blog post where I talk about how I built mine. Most agents fit nicely into a finite state machine or a directed acyclic graph that responds to an event loop. I do use provider SDKs to interact with models but mostly because it saves me a lot of boilerplate. MCP clients and servers are also widely available as SDKs. The biggest thing to remember, imo, is to keep the relationship between prompts, resources, and tools in mind. They make up a sort of dynamic workflow engine.

esafak6mo ago

What's wrong with the OWASP Top Ten?

kennethallen6mo ago

Author on Twitter a few years ago: https://x.com/tqbf/status/851466178535055362

andai6mo ago

.text-gray-600 { color: black; }

byronic6mo ago

The author shoulda written a REPL

rambojohnson6mo ago

The bravado posturing in this article is nauseating. Sure, there are a few serious points buried in there, but damn...dial it down, please.

a-dub6mo ago

they kinda feel like the cgi perl scripts of the mid 2020s.

indeyets6mo ago

You mean late 1990’s? :)

a-dub6mo ago

no i mean, back in the 90's cgi perl scripts were the easy it thing for interacting with the big tech wave and now in the mid-2020s llm python agent scripts with tool extensions are the easy it thing for interacting with the big tech wave.

oblio6mo ago

Now we need PHP and Ruby or Rails, somewhere down the line :-))

_pdp_6mo ago

It is also very simple to be a programmer.. see,

print "Hello world!"

so easy...

dan_can_code6mo ago

But that didn't use the H100 I just bought to put me out of my own job!

jq_20236mo ago

the point around MCPs is spot on

gloosx6mo ago

Didn't see such a bad piece of writing for a long time. Serously guys, is it just me? It's hard to read for some reason.

teiferer6mo ago

Write an agent, it's easy! You will learn so much!

... let's see ...

client = OpenAI()

Um right. That's like saying you should implement a web server, you will learn so much, and then you go and import http (in golang). Yeah well, sure, but that brings you like 98% of the way there, doesn't it? What am I missing?

tptacek6mo ago

That OpenAI() is a wrapper around a POST to a single HTTP endpoint:

    POST https://api.openai.com/v1/responses

tabletcorryOP6mo ago

Plus a few other endpoints, but it is pretty exclusively an HTTP/REST wrapper.

OpenAI does have an agents library, but it is separate in https://github.com/openai/openai-agents-python

MeetingsBrowser6mo ago

I think you might be conflating an agent with an LLM.

The term "agent" isn't really defined, but its generally a wrapper around an LLM designed to do some task better than the LLM would on its own.

Think Claude vs Claude Code. The latter wraps the former, but with extra prompts and tooling specific to software engineering.

bootwoot6mo ago

That's not an agent, it's an LLM. An agent is an LLM that takes real-world actions

Bjartr6mo ago

No, it's saying "let's build a web service" and starting with a framework that just lets you write your endpoints. This is about something higher level than the nuts and bolts. Both are worth learning.

The fact you find this trivial is kind of the point that's being made. Some people think having an agent is some kind of voodoo, but it's really not.

munchbunny6mo ago

An agent is more like a web service in your metaphor. Yes, building a web server is instructive, but almost nobody has a reason to do it instead of using an out of the box implementation once it’s time to build a production web service.

victorbjorklund6mo ago

maybe more like “let’s write a web server but let’s use a library for the low level networking stack”. That can still teach you a lot.

zkmon6mo ago

A very good blog article that I have read in a while. Maybe MCP could have been involved as well?

j / k navigate · click thread line to collapse

395 comments

dave1010uk6mo ago

It then seems logical that you can compose calls to LLMs, loop and branch and combine them with other functions.

[0] https://github.com/dave1010/hubcap

[1] https://github.com/simonw/llm

simonw6mo ago

I love hubcap so much. It was a real eye-opener for me at the time, really impressive result for so little code. https://simonwillison.net/2023/Sep/6/hubcap/

dave1010uk6mo ago

Thanks Simon!

It only worked because of your LLM tool. Standing on the shoulders of giants.

dingnuts6mo ago

You're posting too fast please slow down

1 more reply

saghm6mo ago

keyle6mo ago

> a small Autobot that you can't trust

That gave me a hearty chuckle!

nativeit6mo ago

I let it watch my kids. Was that a mistake?

pjmlp6mo ago

And that is how we end up with iPaaS products powered by agentic runtimes, slowly dragging us away from programming language wars.

Only a selected few get to argue about what is the best programming language for XYZ.

singularity20016mo ago

what's the point of specialized agents when you just have one universal agent that can do anything e.g. Claude

baq6mo ago

If you can get a specialized agent to work in its domain at 10% parameters of a foundation model, you can feasibly run locally, which opens up e.g. offline use cases.

Personally I’d absolutely buy an LLM in a box which I could connect to my home assistant via usb.

monomers6mo ago

What use cases do you imagine for LLMs in home automation?

2 more replies

criddell6mo ago

> Personally I’d absolutely buy an LLM in a box

In a box? I want one in a unit with arms and legs and cameras and microphones so I can have it do useful things for me around my home.

1 more reply

throwaway40126mo ago

Can you (or someone else) explain how to do that? How much does it typically cost to create a specialized agents that uses a local model? I thought it was expensive?

1 more reply

ljm6mo ago

fennecbutt6mo ago

Plus I'd say that the smaller context or more specific context is the important thing there.

So I try to give the puppy just one toy at a time.

1 more reply

andy996mo ago

But there is no reason (and lots of downside) to leave anything to the LLM that’s not “fuzzy” and you could just write deterministically, thus the agent model.

1 more reply

ericd6mo ago

_the_inflator6mo ago

I agree with you mostly.

On the other hand, I think that show or it didn’t happen is essential.

Dumping a bit of code into an LLM doesn’t make it a code agent.

I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.

The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.

I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.

ericd6mo ago

I think at least for educational purposes, it’s worth doing, even if people end up going back to Claude code, or away from genetic coding altogether for their day to day.

lowbloodsugar6mo ago

>build your own lightsaber

The other takeaway is that agents are fantastically simple.

ericd6mo ago

Agreed, and it's actually how I've been thinking about it, but it's also straight from the article, so can't claim credit. But it was fun to see it put into words by someone else.

And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.

afc6mo ago

I also started building my own, it's fun and you get far quickly.

threecheese6mo ago

Just reading through your docs, and feeling inspired. What are you spending, token-wise? Order of magnitude.

andai6mo ago

What are you using for transcription?

I tried Whisper, but it's slow and not great.

I tried the gpt audio models, but they're trained to refuse to transcribe things.

I tried Google's models and they were terrible.

I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.

So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!

tptacek6mo ago

t_akosuke6mo ago

Been meaning to build something very similar! What hardware did you use? I'm assuming that a Pi or similar won't cut it

1 more reply

ericd6mo ago

If you want something polished for system-wide use rather than rolling your own, I've been liking MacWhisper on the Mac side, currently hunting for something on Arch.

richardlblair6mo ago

The new Qwen model is supposed to be very good.

1 more reply

nostrebored6mo ago

Parakeet is sota

1 more reply

raymond_goo6mo ago

https://github.com/rhulha/Speech2Speech

https://github.com/rhulha/EchoMate

segu6mo ago

Handy is free, open-source and local model only. Supports Parakeet: https://github.com/cjpais/Handy

ty000016mo ago

Speechmatics - it is on the expensive side, but provides access to a bunch of languages and the accuracy is phenomenal on all of them - even with multi-speakers.

greenfish66mo ago

I use Willow AI, which I think is pretty good

Uehreka6mo ago

unshavedyak6mo ago

simonw6mo ago

sumedh6mo ago

> catch you eventually if you try to extract and reuse access tokens

What does that mean?

baq6mo ago

How do they know your requests come from Claude Code?

2 more replies

ericd6mo ago

When comparing, are you using the normal token cost, or cached? I find that the vast majority of my token usage is in the 90% off cached bucket, and the costs aren’t terrible.

ay6mo ago

Kimi is noticeably better at tool calling than gpt-oss-120b.

And yes - it is very much fun to build those!

anonym296mo ago

Cerebras now has glm 4.6. Still obscenely fast, and now obscenely smart, too.

DeathArrow6mo ago

Aren't there cheaper providers of GLM 4.6 on Openrouter? What are the advantages of using Cerebras? Is it much faster?

3 more replies

ericd6mo ago

Ooh thanks for the heads up!

lukevp6mo ago

What’s a good staring point for getting into this? I don’t even know what Cerebras is. I just use GitHub copilot in VS Code. Is this local models?

ericd6mo ago

A lot of it is just from HN osmosis, but /r/LocalLLaMA/ is a good place to hear about the latest open weight models, if that's interesting.

andai6mo ago

Here is ChatGpt in 50 lines of Python:

https://gist.github.com/avelican/4fa1baaac403bc0af04f3a7f007...

No dependencies, and very easy to swap out for OpenRouter, Groq or any other API. (Except Anthropic and Google, they are special ;)

This also works on the frontend: pro tip you don't need a server for this stuff, you can make the requests directly from a HTML file. (Patent pending.)

GardenLetter276mo ago

But it's way more expensive since most providers won't give you prompt caching?

riskable6mo ago

tptacek6mo ago

0xbadcafebee6mo ago

> I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now.

simonw6mo ago

Some of us have been happily using agentic coding tools (Claude Code etc) since February and we're still not abandoning them for their inherent flaws.

2 more replies

foobarian6mo ago

Edit: reflecting on what the lesson is here, in either case I suppose we're avoiding the pain of dealing with Unix CLI tools :-D

jacquesm6mo ago

2 more replies

andai6mo ago

Could you give some examples? I'm having the AI write the shell scripts, wondering if I'm missing out on some comfy UIs...

2 more replies

chickensong6mo ago

I hadn't given much thought to building agents, but the article and this comment are inspiring, thx. It's interesting to consider agents as a new kind of interface/function/broker within a system.

zahlman6mo ago

> They know all the flags and are generally better at interpreting tool output than I am.

tptacek6mo ago

4 more replies

chemotaxis6mo ago

You could even imagine a world in which we create an entire suite of deterministic, limited-purpose tools and then expose it directly to humans!

SatvikBeri6mo ago

Half my use of LLM tools is just to remember the options for command line tools, including ones I wrote but only use every few months.

layer86mo ago

I wonder if we could develop a language with well-defined semantics to interact with and wire up those tools.

chubot6mo ago

> language with well-defined semantics

That would certainly be nice! That's why we have been overhauling shell with https://oils.pub , because shell can't be described as that right now

It's in extremely poor shape

e.g. some things found from building several thousand packages with OSH recently (decades of accumulated shell scripts)

- bugs caused by the differing behavior of 'echo hi | read x; echo x=$x' in shells, i.e. shopt -s lastpipe in bash.

- 'set -' is an archaic shortcut for 'set +v +x'

- Almquist shell is technically a separate dialact of shell -- namely it supports 'chdir /tmp' as well as cd /tmp. So bash and other shells can't run any Alpine builds.

I used to maintain this page, but there are so many problems with shell that I haven't kept up ...

https://github.com/oils-for-unix/oils/wiki/Shell-WTFs

OSH is the most bash-compatible shell, and it's also now Almquist shell compatible: https://pages.oils.pub/spec-compat/2025-11-02/renamed-tmp/sp...

It's more POSIX-compatible than the default /bin/sh on Debian, which is dash

The bigger issue is not just bugs, but lack of understanding among people who write foundational shell programs. e.g. the lastpipe issue, using () as grouping instead of {}, etc.

---

It is often treated like an "unknowable" language

Any reasonable person would use LLMs to write shell/bash, and I think that is a problem. You should be able to know the language, and read shell programs that others have written

1 more reply

zahlman6mo ago

As it happens, I have a prototype for this, but the syntax is honestly rather unwieldy. Maybe there's a way to make it more like natural human language....

1 more reply

utopiah6mo ago

Hmmm but how would you name that? Agent skills? Meta cognition agentic tooling? Intelligence driven self improving partial building blocks?

Oh... oh I know how about... UNIX Philosophy? No... no that'd never work.

danpalmer6mo ago

Doing one thing well means you need a lot more tools to achieve outcomes, and more tools means more context and potentially more understanding of how to string them together.

I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.

tinodb6mo ago

losvedir6mo ago

I kind of am missing the bridge between that, and the fundamental knowledge that everything is token based in and out.

simonw6mo ago

Yeah, that's basically it. Many models these days are specifically trained for tool calling though so the system prompt doesn't need to spend much effort reminding them how to do it.

You can see the prompts that make this work for gpt-oss in the chat template in their Hugging Face repo: https://huggingface.co/openai/gpt-oss-120b/blob/main/chat_te... - including this bit:

    {%- macro render_tool_namespace(namespace_name, tools) -%}
        {{- "## " + namespace_name + "\n\n" }}
        {{- "namespace " + namespace_name + " {\n\n" }}
        {%- for tool in tools %}
            {%- set tool = tool.function %}
            {{- "// " + tool.description + "\n" }}
            {{- "type "+ tool.name + " = " }}
            {%- if tool.parameters and tool.parameters.properties %}
                {{- "(_: {\n" }}
                {%- for param_name, param_spec in tool.parameters.properties.items() %}
                    {%- if param_spec.description %}
                        {{- "// " + param_spec.description + "\n" }}
                    {%- endif %}
                    {{- param_name }}
    ...

    {
      "bos_token_id": 199998,
      "do_sample": true,
      "eos_token_id": [
        200002,
        199999,
        200012
      ],
      "pad_token_id": 199999,
      "transformers_version": "4.55.0.dev0"
    }

The model is trained to output one of those three tokens when it's "done".

https://cookbook.openai.com/articles/openai-harmony#special-... defines some of those tokens:

200002 = <|return|> - you should stop inference

200012 = <|call|> - "Indicates the model wants to call a tool."

I think that 199999 is a legacy EOS token ID that's included for backwards compatibility? Not sure.

vinhnx6mo ago

Thank you Simon! This information is invaluable to know about the underlying tools coherent of language model, gladly we have gpt-oss for clear example for how the model understand and perform tool.

JoshMandel6mo ago

fryz6mo ago

The "magic" is done via the JSON schemas that are passed in along with the definition of the tool.

Structured Output APIs (inc. the Tool API) take the schema and build a Context-free Grammar, which is then used during generation to mask which tokens can be output.

hoppp6mo ago

I should? what problems can I solve, that can be only done with an agent? As long as every AI provider is operating at a loss starting a sustainably monetizable project doesn't feel that realistic.

johnfn6mo ago

hoppp6mo ago

Agents use Apis that I will need to pay for and generally software dev is a job for me that needs to generate income.

If the Apis I call are not profitable for the provider then they won't be for me either.

This post is a fly.io advertisement

6 more replies

balder19916mo ago

Yeah we have open source models too that we can use, and it’s actually more fun than using cloud providers in my opinion.

simonw6mo ago

> what problems can I solve, that can be only done with an agent?

lelanthran6mo ago

>> what problems can I solve, that can be only done with an agent?

> The problem that you might not intuitively understand how agents work and what they are and aren't capable of

With LLMs, the lack of determinism coupled with completely opaque inner-workings is a problem when trying to form an intuition, but that problem is not solved by building an agent.

veryemartguy6mo ago

Seems like it would be a lot easier for everyone if we knew the answer to his/her question.

furyofantares6mo ago

> As long as every AI provider is operating at a loss

None of them are doing that.

roadside_picnic6mo ago

Parent comment never said operating inference at a loss, though it wouldn't surprise me, they just said "operating at a loss" which they most definitely are [0].

However, knowing a few people on teams at inference-only providers, I can promise you some of them absolutely are operating inference at a loss.

0. https://www.theregister.com/2025/10/29/microsoft_earnings_q1...

furyofantares6mo ago

> Parent comment never said operating inference at a loss

Context. Whether inference is profitable at current prices is what informs how risky it is to build a product that depends on buying inference, which is what the post was about.

1 more reply

GoatInGrey6mo ago

So AI companies are profitable when you ignore some of the things they have to spend money on to operate?

Snark aside, inference is still being done at a loss. Anthropic, the most profitable AI vendor, is operating at a roughly -140% margin. xAI is the worst at somewhere around -3,600% margin.

fluidcruft6mo ago

If they are not operating inference at a loss and current models remain useful (why would they regress?), they could just stop developing the next model.

2 more replies

simonw6mo ago

The interesting companies to look at here are the ones that sell inference against open weight models that were trained by other companies - Fireworks, Cloudflare, DeepInfra, Together AI etc.

kalkin6mo ago

Where do those numbers come from?

alach116mo ago

Can you cite your source for inference being at a loss? This disagrees with most of what I've read.

necovek6mo ago

Sounds quite a bit like pyramid scheme "business model": how is it different?

If a company stops training new models until they can fund it out of previous profits, do we only slow down or halt altogether? If they all do?

lmm6mo ago

> But I don't think any are operating inference at a loss, I think their margins are actually rather large.

Citation needed. I haven't seen any of them claim to have even positive gross margins to shareholders/investors, which surely they would do if they did.

furyofantares6mo ago

https://officechai.com/ai/each-individual-ai-model-can-alrea...

1 more reply

hoppp6mo ago

When comparing the cost of an H100 GPU per hour and calculating cost of tokens, it seems the OpenAI offering for the latest model is 5 times cheaper than renting the hardware.

OpenAI balance sheet also shows an $11 billion loss .

I can't see any profit on anything they create. The product is good but it relies on investors fueling the AI bubble.

2 more replies

welcome_dragon6mo ago

Isn't that operating at a loss

throwaway8xak926mo ago

> None of them are doing that.

Can you point us to the data?

throwaway69776mo ago

You can be your own AI provider.

bilbo0s6mo ago

>starting a sustainably monetizable project doesn't feel that realistic.

and

>You can be your own AI provider.

Not sure that being your own AI provider is "sustainably monetizable"?

hoppp6mo ago

For internal software maybe, but for a client facing service the incentives are not right when the norm is to operate at a loss.

jillesvangurp6mo ago

paulcole6mo ago

hoppp6mo ago

Yeah but fly.io is a cloud provider doing this advertisement with OpenAI Apis. Both cost money, so if it's not free to operate then the developed project should offset the costs.

Its about balance.

Really its the AI providers that have been promising unreal gains during this hype period, so people are more profit oriented.

1 more reply

veryemartguy6mo ago

Or maybe some of us realize that these tools are fucking useless and don’t offer any “value” apart from the most basic thing imaginable.

And I use value in quotes because as soon as the AI providers suddenly need to start generating a profit, that “value” is going to cost more than your salary.

seba_dos16mo ago

It doesn't have to be profitable. Elegant and clever would suffice.

ilikehurdles6mo ago

I don't think hn is reflective of where programmers are today, culturally. 10 years ago, sure, it probably was.

khimaros6mo ago

what place is more reflective today?

1 more reply

aidenn06mo ago

Show me where TFA even implied that you should start a sustainably monetizable project with agents?

oooyay6mo ago

Heh, the bit about context engineering is palpable.

All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/

cantor_S_drug6mo ago

https://github.com/mem0ai/mem0?tab=readme-ov-file

Is this useful for you?

oooyay6mo ago

Could be! I'll give it a shot

qwertox6mo ago

This sounds really great.

azimux6mo ago

I wrote an agent from scratch in Ruby several months back. Was fun!

These 4 lines wound up being the heart of it, which is surprisingly simple, conceptually.

        until mission_accomplished? or given_up? or killed?
          determine_next_command_and_inputs
          run_next_command
        end

Zak6mo ago

> You only think you understand how a bicycle works, until you learn to ride one.

I bet a majority of people who can ride a bicycle don't know how they steer, and would describe the physical movements they use to initiate and terminate a turn inaccurately.

https://en.wikipedia.org/wiki/Countersteering

captainkrtek6mo ago

Relevant interesting tangent:

“Most People Don't Know How Bikes Work”

https://www.youtube.com/watch?v=9cNmUNHSBac

itsmemattchung6mo ago

https://www.youtube.com/watch?v=MFzDaBzBlL0

vinhnx6mo ago

A Brief History of Bicycle Engineering https://www.youtube.com/watch?v=EcRlDCsZM20

rmoriz6mo ago

wayy6mo ago

now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong

furyofantares6mo ago

AdieuToLogic6mo ago

In the event this comment is slathered in sarcasm:

  Well done!  :-D

ht966mo ago

Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

AdieuToLogic6mo ago

cantor_S_drug6mo ago

https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Interesting use of Rerun!

https://github.com/gustofied/P2Engine

aenis6mo ago

saturatedfat6mo ago

heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

darin@mcptesting.com

(gist: evals as a service)

tptacek6mo ago

But the point of the article is that its arguments work both ways.

MinimalAction6mo ago

tptacek6mo ago

AlecSchueler6mo ago

Interesting to think that this question could have been asked of almost all software work up until this point, except the "us" was always "someone else "

DrewADesign6mo ago

It’s generally been true, but not close to the scale we’re looking at now. The implied/assumed hypocrisy also doesn’t stop it from it sucking, or make it immune to criticism.

AlecSchueler6mo ago

Indeed it probably sucks even more in a "you reap what you sow" kind of way :(

richardlblair6mo ago

MinimalAction6mo ago

psychoslave6mo ago

tptacek6mo ago

You're going to have to explain that analogy to me, sorry.

psychoslave6mo ago

Sure, phone call to plumber is remote call to turn key API, and manager layer is MVP. Hope that makes it more clear.

chrisweekly6mo ago

There's something(s) about @tptacek's writing style that has always made me want to root for fly.io.

novoreorx6mo ago

Reminds of this one [1] that I read half a year ago, which I used to develop my first agent. But what fly wrotes is definitely easier to understand, how I wish it was written a year earlier.

[1]: https://ampcode.com/how-to-build-an-agent

vkou6mo ago

> It’s Incredibly Easy

    client = OpenAI()
    context_good, context_bad = [{
        "role": "system", "content": "you're Alph and you only tell the truth"
    }], [{
        "role": "system", "content": "you're Ralph and you only tell lies"
    }]
    ...

You're building on quicksand.

You're delegating everything important to someone who has no responsibility to you.

sumedh6mo ago

Its easy to switch to an open source model

tptacek6mo ago

I love that the thing you singled out as not safe to run long term, because (apparently) of woke, was my weird deep-cut Labyrinth joke.

hshdhdhehd6mo ago

There is a lot of stuff I should do. From making my own CPU from a breadboard of nand gates to building a CDN in Rust. But aint got time for all the things.

That said I built an LLM following Karpathy's tutorial. So I think it aims good to dabble a bit.

coffeecoders6mo ago

Yeah, it’s a never-ending curve.

I built an 8-bit computer on breadboards once, then went down the rabbit hole of flight training for a PPL. Every time I think I’m "done," the finish line moves a few miles further.

Guess we nerds are never happy.

javchz6mo ago

One should be melting sand to get silicon, anything else it's too abstract to my taste.

tomcam6mo ago

Glad you’ve got all that time on your hands. I am still working on the fusion reactor portion of my supernova simulator, so that I can generate the silicon you so blithely refer to.

krsdcbl6mo ago

Given the premise, one could also say we nerds are forever happy.

ericmcer6mo ago

Seriously I feel like it's self-sabotage sometimes at work. Just fixing the thing getting tests to pass isn't enough. Until I fully have a mental model of what is happening I can't move on.

qwertygnu6mo ago

Very early in TFA it explains how easy it is to do. That's the whole point of the post.

z26mo ago

IgorPartola6mo ago

That’s the API. That’s it. Now there are two improvements that are currently in the works:

1. Automatic local tool calling. This is seriously some sort of afterthought and not how they did it originally but ok, I guess this isn’t obvious to everyone.

efitz6mo ago

Non sequitur.

fsndz6mo ago

I did that, burned 2.6B tokens in the process and learned a lot: https://transitions.substack.com/p/what-burning-26-billion-p...

otsaloma6mo ago

As the agent setup itself is simple, majority of the work to make this useful would in the tools themselves and context management for the tools.

sibeliuss6mo ago

Its easy to create a toy, but much harder to make something right! Like anything, so much weird polish stuff creeps in at the 90% mark.

sumedh6mo ago

> so much weird polish stuff creeps in at the 90% mark.

That is where the human in the loop needs to focus on for now :)

behnamoh6mo ago

> nobody knows anything yet

w_for_wumbo6mo ago

Being able to recognize that 'make this code better' provides no direction, it should make sense that the output is directionless.

But on more subtle levels, whatever subtle goals that we have and hold in the workplace will be reflected back by the agents.

sumedh6mo ago

That is because for the people for whom AI is actually working/making money they would prefer to keep it a secret on what and how they are doing it, why attract competition?

nylonstrung6mo ago

Who would you say it's working for?

What products or companies are the gold standard of agent implementation right now?

nowittyusername6mo ago

8note6mo ago

ive been trying this for a few week, but i dont at all currently own hardware good enough to be useful for local inference.

ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens

nowittyusername6mo ago

8note6mo ago

well, i got some gemini models running on my phone, but if i switch apps, android kills it, so the call to the server always hangs... and then the screen goes black

the new laptop only has 16GB of memory total, with another 7 dedicated to the NPU.

i tried pulling up Qwen 3 4B on it, but the max context i can get loaded is about 12k before the laptop crashes.

my next attempt is gonna be a 0.5B one, but i think ill still end up having to compress the context every call, which is my real challenge

2 more replies

rbren6mo ago

Spoiler: it's not actually that easy. Compaction, security, sandboxing, planning, custom tools--all this is really hard to get right.

olingern6mo ago

Only on HN is there a “well, actually” with little substance followed by a comment about a launch.

The article isn’t about writing production ready agents, so it does appear to be that easy

solarkraft6mo ago

How autonomous/controllable are the agents with this SDK?

When I build an agent my standard is Cursor, which updates the UI at every reportable step of the way, and gives you a ton of control opportunities, which I find creates a lot of confidence.

Is this level of detail and control possible with the OpenHands SDK? I’m asking because the last SDK that was simple to get into lacked that kind of control.

rbren6mo ago

That's the idea! We have a confirmation_mode that can interrupt at any step in the process.

threecheese6mo ago

mrkurt6mo ago

Claude Code is an obfuscated javascript app. You can point Claude Code at it's own package and it will pretty reliably tell you how it works.

I think Claude Code's magic is that Anthropic is happy to burn tokens. The loop itself is not all that interesting.

What is interesting is how they manage the context window over a long chat. And I think a fair amount of that is serverside.

AdieuToLogic6mo ago

> Claude Code is an obfuscated javascript app. You can point Claude Code at it's own package and it will pretty reliably tell you how it works.

This is why I keep coming back to Hacker News. If the above is not a quintessential "hack", then I've never seen one.

Bravo!

simonw6mo ago

I've been running the obfuscated code through Prettier first, which I think makes it a bit easier for Claude Code to run grep against.

PhilippGille6mo ago

No need to take guesses - the VS Code GitHub Copilot extension is open source amnd has an agent mode with tool calling:

https://github.com/microsoft/vscode-copilot-chat/blob/4f7ffd...

simonw6mo ago

nylonstrung6mo ago

Wow it seems almost designed to burn through tokens.

I wish we had a version that was optimized around token/cost efficiency

colonCapitalDee6mo ago

I thought this was informative: https://minusx.ai/blog/decoding-claude-code/

jeremy_k6mo ago

The summary is

CraftThatBlock6mo ago

nl6mo ago

Have a look at https://github.com/anthropics/claude-code/tree/main/plugins/... to see how a fairly complex workflow is implemented

zahlman6mo ago

> Imagine what it’ll do if you give it bash. You could find out in less than 10 minutes. Spoiler: you’d be surprisingly close to having a working coding agent.

Okay, but what if I'd prefer not to have to trust a remote service not to send me

    { "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }

?

tptacek6mo ago

Obviously if you're concerned about that, which is very reasonable, don't run it in an environment where `rm -rf` can cause you a real problem.

awayto6mo ago

aidenn06mo ago

blacklist is going to be a bad idea since so many commands can be made to run other commands with their arguments.

1 more reply

worldsayshi6mo ago

There are MCP configured virtualization solutions that is supposed to be safe for letting LLM go wild. Like this one:

https://github.com/zerocore-ai/microsandbox

I haven't tried it.

awayto6mo ago

You can build your agent into a docker image then easily limit both networking and file system scope.

    docker run -it --rm \
      -e SOME_API_KEY="$(SOME_API_KEY)" \
      -v "$(shell pwd):/app" \ <-- restrict file system to whatever folder
      --dns=127.0.0.1 \ <-- restrict network calls to localhost
      $(shell dig +short llm.provider.com 2>/dev/null | awk '{printf " --add-host=llm-provider.com:%s", $$0}') \ <-- allow outside networking to whatever api your agent calls
      my-agent-image

Probably could be a bit cleaner, but it worked for me.

worldsayshi6mo ago

If you want your agent to pull untrusted code from the internet and go wild while you're doing other stuff it might not be a good choice.

1 more reply

gtukmanov6mo ago

thedangler6mo ago

Cool, can you make it use local free models because I'm broke and can't afford AI's crazy costs.

Spivak6mo ago

Yep, change nothing in the code in the article but spin up an Ollama server and use the OpenAI API https://docs.ollama.com/api/openai-compatibility.

zb36mo ago

No, because I know that "agents" are token burning machines - for me they're less efficient than the chat interface, slower and burning much more tokens.

I'm not surprised that AI companies would want me to use them though.. I know what you're doing there :)

deadbabe6mo ago

The more I use agents, the more I find agents to be pointless, any tasks an agent performs regularly in high volume should be turned into classical deterministic code.

The number one feature of agents is to be disambiguation for tool selectors and pretty printers.

vinhnx6mo ago

> “You only think you understand how a bicycle works, until you learn to ride one.”

[0] https://github.com/vinhnx/vtcode

dfex6mo ago

>> “You only think you understand how a bicycle works, until you learn to ride one.”

> This resonates deeply with me. That's why I built one myself [0]

I was hoping to see a home-made bike at that link.. Came away disappointed

vinhnx6mo ago

Good one! Sorry to disappoint you. But personally, that line strike deeply with me, honestly.

lowbloodsugar6mo ago

vinhnx6mo ago

Agree, to me, the wheel is the greatest invention of all. Everyone could have rode a bike, but the underlying physic and motion that came to `riding` is a whole another story.

8note6mo ago

khazhoux6mo ago

My main point being, though: for anyone intimidated by the recent tooling advances… you can most definitely do all this yourself.

larusso6mo ago

solomonb6mo ago

This work predates agents as we know them now and was intended for building chat bots (as in irc chat bots) but when auto-gpt I realized I could formalize it super nicely with this library:

https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/

I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..

jbmsf6mo ago

I agree. I find LLMs a bit overblown. I don't think most people want to use chat as their primary interface. But writing a few agents was incredibly informative.

qwertox6mo ago

tptacek6mo ago

mattmanser6mo ago

Honest question, as your comment confuses me.

Did you get to the part where he said MCP is pointless and are saying he's wrong?

Or did you just read the start of the article and not get to that bit?

vidarh6mo ago

Now you have a CLI tool you can use yourself, and the agent has a tool to use.

simplesagar6mo ago

2 more replies

8cvor6j844qw_d66mo ago

Question, how hard is it for someone new to agents to dip their toes into writing a simple agent to get data? (e.g., getting reviews from sites for sentiment analysis?)

simonw6mo ago

You can prototype this without writing any code at all.

Fire up "claude --dangerously-skip-permissions" in a fresh directory (ideally in a Docker container if you want to limit the chance of it breaking anything else) and prompt this:

> Use Playwright to fetch ten reviews from http://www.example.com/ then run sentiment analysis on them and write the results out as JSON files. Install any missing dependencies.

Watch what it does. Be careful not to let it spider the site in a way that would justifiably upset the site owners.

sumedh6mo ago

Dont you need to setup Playwright MCP first?

simonw6mo ago

No. I don't use Playwright MCP at all - if the coding agent can run Python code it can use the Playwright Python library directly, if Node.js it can use the Playwright Node library.

2 more replies

nitwit0056mo ago

> You only think you understand how a bicycle works, until you learn to ride one.

I realize this is just for motivation in a subtitle, but people generally don't grasp how bicycles work, even after having ridden one.

Veritasium has a quite good video on the subject: https://www.youtube.com/watch?v=9cNmUNHSBac

lbeurerkellner6mo ago

Everybody should try. It helps a ton to demystify the relatively simple but powerful underpinning of how modern agents work.

You can get quite far quite quickly. My toy implementation [1] is <600 LOC and even supports MCP.

[1] https://github.com/lbeurerkellner/agent.py

p0w3n3d6mo ago

Actually tool "ping 8.8.8.8" never quits unless running on windows. This can spawn many processes that kill the server.

sanxiyn6mo ago

If you look at the actual code, it runs ping -c 5. I agree ping without options doesn't terminate.

almaight6mo ago

So I wrote an MCP using your code: https://gurddy-mcp.fly.dev. You can get the source code from https://github.com/novvoo/gurddy-mcp.

TYPE_FASTER6mo ago

The Google Agent Development Kit (https://google.github.io/adk-docs/) is really fun to play with. It's open source and supports both using a LLM in the cloud and running locally.

globular-toast6mo ago

aaronblohowiak6mo ago

michaelanckaert6mo ago

Sending the whole context on each user message is essentially what the model remembers of this conversation. ie: it is entirely stateless.

tptacek6mo ago

Anyways, if it nerd sniped you, I succeeded. :)

aaronblohowiak6mo ago

Yes indeed you did succeed. I totally want to try gaslighting an LLM now! Ah to find the time…

cantor_S_drug6mo ago

There is context caching in many models. It is less expensive if you enable that.

worldsayshi6mo ago

cornel_io6mo ago

worldsayshi6mo ago

So it returns a tool call for "continue" every time it wants to continue working? Do people implement this in different ways? It would be nice what method it has been trained on if any.

1 more reply

DeathArrow6mo ago

artursapek6mo ago

I've been having so much fun writing the agent loop for https://revise.io, most fun I've had programming in a long time.

scrubs6mo ago

"You don’t have to like them, but you should want to be right about them. To be the best hater (or stan) you can be."

The op has a point - a good one

tlarkworthy6mo ago

Yeah I was inspired after https://news.ycombinator.com/item?id=43998472 which is also very concrete

tptacek6mo ago

I love everything they've written and also Sketch is really good.

robot-wrangler6mo ago

Hold up. These are all the right concerns but with the wrong conclusion.

MCP isn't a plugin interface for Claude, it's just JSON-RPC.

tptacek6mo ago

But I decided to keep the thrust of that section clearer. My argument is: MCP is a sideshow.

robot-wrangler6mo ago

8note6mo ago

i treat MCP as a shorthand for "schema + documentation, passed to the LLM as context"

you dont need the MCP implementation, but the idea is useful and you can consider the tradeoffs to your context window, vs passing in the manual as fine tuning or something.

lazy_afternoons6mo ago

Seriously, what is the advantage of tools at all. Why not implement custom string based triggers.

First of all, the call accuracy is much higher.

Second, you get more consistent results across models.

DeathArrow6mo ago

I would like an LLM to be integrated in the shell so I don't have to learn all the Unix tools arguments and write Bash scripts.

wahnfrieden6mo ago

The Codex agent has an official TypeScript SDK now.

Why would Fly.io advocate using the vanilla GPT API to write an agent, instead of the official agent?

tptacek6mo ago

Because you won't learn as much using an agent framework, and, as you can see from the post, you absolutely don't need one.

fauria6mo ago

> I’m not even going to bother explaining what an agent is.

Does anyone actually know what exactly an agent is?

tptacek6mo ago

Yes, and the post says what it is about 100 words later. It's an LLM running in a loop that can access tool calls.

joelthelion6mo ago

If you want to play with this stuff without spending a lot of money, what are your best options?

beklein6mo ago

I love OpenRouter, since it is a simple way to get started and provides a wide range of available models.

Ten dollars is plenty to get started... experiments like in the post will cost you cents, not dollars.

simonw6mo ago

Gemini has a generous free tier (2500 prompts per day), all you need is a Google account to get an API key.

thatscot6mo ago

and sorry, forgot you can also run local models aswell :)

thatscot6mo ago

Most cloud providers, like Azure have free credits at the start. On azure you can deploy your own model and pay with the free credits.

thatscot6mo ago

You can just stick a tenner in OpenAI though and it won't charge anymore than the credit you've put in

DeathArrow6mo ago

I am thinking of building agents that can partly replace manual testing using a headless browser.

imiric6mo ago

ATechGuy6mo ago

Maybe we should write an agent that writes an agent that writes an agent...

dagss6mo ago

I realize now what I need in Cursor: A button for "fork context".

I believe that would be a powerful tool solving many things there are now separate techniques for.

all26mo ago

crush-cli has this. I think the google gemini chat app also has this now.

amelius6mo ago

Why write an agent when you can just ask the LLM to write one?

manishsharan6mo ago

How.. please don't say use langxxx library

I am looking for a language or library agnostic pattern like we have MVC etc. for web applications. Or Gang of Four patterns but for building agents.

tptacek6mo ago

The whole post is about not using frameworks; all you need is the LLM API. You could do it with plain HTTP without much trouble.

manishsharan6mo ago

zahlman6mo ago

Start by thinking about how big the context window is, and what the rules should be for purging old context.

Design patterns can't help you here. The hard part is figuring out what to do; the "how" is trivial.

skeledrew6mo ago

That's why you want to use sub-agents which handle smaller tasks and return results to a delegating agent. So all agents have their own very specialized context window.

1 more reply

oooyay6mo ago

esafak6mo ago

What's wrong with the OWASP Top Ten?

kennethallen6mo ago

Author on Twitter a few years ago: https://x.com/tqbf/status/851466178535055362

andai6mo ago

.text-gray-600 { color: black; }

byronic6mo ago

The author shoulda written a REPL

rambojohnson6mo ago

The bravado posturing in this article is nauseating. Sure, there are a few serious points buried in there, but damn...dial it down, please.

a-dub6mo ago

they kinda feel like the cgi perl scripts of the mid 2020s.

indeyets6mo ago

You mean late 1990’s? :)

a-dub6mo ago

oblio6mo ago

Now we need PHP and Ruby or Rails, somewhere down the line :-))

_pdp_6mo ago

It is also very simple to be a programmer.. see,

print "Hello world!"

so easy...

dan_can_code6mo ago

But that didn't use the H100 I just bought to put me out of my own job!

jq_20236mo ago

the point around MCPs is spot on

gloosx6mo ago

Didn't see such a bad piece of writing for a long time. Serously guys, is it just me? It's hard to read for some reason.

teiferer6mo ago

Write an agent, it's easy! You will learn so much!

... let's see ...

client = OpenAI()

tptacek6mo ago

That OpenAI() is a wrapper around a POST to a single HTTP endpoint:

    POST https://api.openai.com/v1/responses

tabletcorryOP6mo ago

Plus a few other endpoints, but it is pretty exclusively an HTTP/REST wrapper.

OpenAI does have an agents library, but it is separate in https://github.com/openai/openai-agents-python

MeetingsBrowser6mo ago

I think you might be conflating an agent with an LLM.

The term "agent" isn't really defined, but its generally a wrapper around an LLM designed to do some task better than the LLM would on its own.

Think Claude vs Claude Code. The latter wraps the former, but with extra prompts and tooling specific to software engineering.

bootwoot6mo ago

That's not an agent, it's an LLM. An agent is an LLM that takes real-world actions

Bjartr6mo ago

The fact you find this trivial is kind of the point that's being made. Some people think having an agent is some kind of voodoo, but it's really not.

munchbunny6mo ago

victorbjorklund6mo ago

maybe more like “let’s write a web server but let’s use a library for the low level networking stack”. That can still teach you a lot.

zkmon6mo ago

A very good blog article that I have read in a while. Maybe MCP could have been involved as well?

j / k navigate · click thread line to collapse