In my mind LLMs are just UNIX strong manipulation tools like `sed` or `awk`: you give them an input and command and they give you an output. This is especially true if you use something like `llm` [1].
It then seems logical that you can compose calls to LLMs, loop and branch and combine them with other functions.
It only worked because of your LLM tool. Standing on the shoulders of giants.
I'm not saying it's not worth doing, considering how the software development process we've already been using as an industry ends up with a lot of bugs in our code. (When talking about this with people who aren't technical, I sometimes like to say that the reason software has bugs in it is that we don't really have a good process for writing software without bugs at any significant scale, and it turns out that software is useful for enough stuff that we still write it knowing this). I do think I'd be pretty concerned with how I could model constraints in this type of workflow though. Right now, my fairly naive sense is that we've already moved the needle so far on how much easier it is to create new code than review it and notice bugs (despite starting from a place where it already was tilted in favor of creation over review) that I'm not convinced being able to create it even more efficiently and powerfully is something I'd find useful.
That gave me a hearty chuckle!
/s
Only a selected few get to argue about what is the best programming language for XYZ.
Personally I’d absolutely buy an LLM in a box which I could connect to my home assistant via usb.
To use an example: I could write an elaborate prompt to fetch requirements, browse a website, generate E2E test cases, and compile a report, and Claude could run it all to some degree of success. But I could also break it down into four specialised agents, with their own context windows, and make them good at their individual tasks.
But there is no reason (and lots of downside) to leave anything to the LLM that’s not “fuzzy” and you could just write deterministically, thus the agent model.
On the other hand, I think that show or it didn’t happen is essential.
Dumping a bit of code into an LLM doesn’t make it a code agent.
And what Magic? I think you never hit conceptual and structural problems. Context window? History? Good or bad? Large Scale changes or small refactoring here and there? Sample size one or several teams? What app? How many components? Green field or not? Which programming language?
I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.
Code Agents are incredibly hard to build and use. Vibe Coding is dead for a reason. I remember vividly the inflation of Todo apps and JS frameworks (Ember, Backbone, Knockout are survivors) years ago.
The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.
I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.
Caching helps a lot, but yeah, there are some growing pains as the agent gets larger. Anthropic’s caching strategy (4 blocks you designate) is a bit annoying compared to OpenAI’s cache-everything-recent. And you start running into the need to start summarizing old turns, or outright tossing them, and deciding what’s still relevant. Large tool call results can be killer.
I think at least for educational purposes, it’s worth doing, even if people end up going back to Claude code, or away from genetic coding altogether for their day to day.
I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.
The other takeaway is that agents are fantastically simple.
And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.
I'm now experimenting with letting the agent generate its own source code from a specification (currently generating 9K lines of Python code (3K of implementation, 6K of tests) from 1.5K lines in specifications (https://alejo.ch/3hi).
I tried Whisper, but it's slow and not great.
I tried the gpt audio models, but they're trained to refuse to transcribe things.
I tried Google's models and they were terrible.
I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.
So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!
I'm mostly running this on an M4 Max, so pretty good, but not an exotic GPU or anything. But with that setup, multiple sentences usually transcribe quickly enough that it doesn't really feel like much of a delay.
If you want something polished for system-wide use rather than rolling your own, I've been liking MacWhisper on the Mac side, currently hunting for something on Arch.
Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.
What does that mean?
I made a fun toy agent where the two models are shoulder surfing each other and swap the turns (either voluntarily, during a summarization phase), or forcefully if a tool calling mistake is made, and Kimi ends up running the show much much more often than gpt-oss.
And yes - it is very much fun to build those!
gpt-oss 120b is an open weight model that OpenAI released a while back, and Cerebras (a startup that is making massive wafer-scale chips that keep models in SRAM) is running that as one of the models they provide. They're a small scale contender against nvidia, but by keeping the model weights in SRAM, they get pretty crazy token throughput at low latency.
In terms of making your own agent, this one's pretty good as a starting point, and you can ask the models to help you make tools for eg running ls on a subdirectory, or editing a file. Once you have those two, you can ask it to edit itself, and you're off to the races.
https://gist.github.com/avelican/4fa1baaac403bc0af04f3a7f007...
No dependencies, and very easy to swap out for OpenRouter, Groq or any other API. (Except Anthropic and Google, they are special ;)
This also works on the frontend: pro tip you don't need a server for this stuff, you can make the requests directly from a HTML file. (Patent pending.)
I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now. But I do know that getting an agent across the 90% threshold of utility for a problem like this is much, much easier than building the good telemetry system is.
And that's why I won't touch 'em. All the agents will be abandoned when people realize their inherent flaws (security, reliability, truthfulness, etc) are not worth the constant low-grade uncertainty.
In a way it fits our times. Our leaders don't find truth to be a very useful notion. So we build systems that hallucinate and act unpredictably, and then invest all our money and infrastructure in them. Humans are weird.
Edit: reflecting on what the lesson is here, in either case I suppose we're avoiding the pain of dealing with Unix CLI tools :-D
In the toy example, you explicitly restrict the agent to supply just a `host`, and hard-code the rest of the command. Is the idea that you'd instead give a `description` something like "invoke the UNIX `ping` command", and a parameter described as constituting all the arguments to `ping`?
Oh... oh I know how about... UNIX Philosophy? No... no that'd never work.
/s
I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.
I kind of am missing the bridge between that, and the fundamental knowledge that everything is token based in and out.
Is it fair to say that the tool abstraction the library provides you is essentially some niceties around a prompt something like "Defined below are certain 'tools' you can use to gather data or perform actions. If you want to use one, please return the tool call you want and it's arguments, delimited before and after with '###', and stop. I will invoke the tool call and then reply with the output delimited by '==='".
Basically, telling the model how to use tools, earlier in the context window. I already don't totally understand how a model knows when to stop generating tokens, but presumably those instructions will get it to output the request for a tool call in a certain way and stop. Then the agent harness knows to look for those delimiters and extract out the tool call to execute, and then add to the context with the response so the LLM keeps going.
Is that basically it? Or is there more magic there? Are the tool call instructions in some sort of permanent context, or could the interaction demonstrated in a fine tuning step, and inferred by the model and just in its weights?
You can see the prompts that make this work for gpt-oss in the chat template in their Hugging Face repo: https://huggingface.co/openai/gpt-oss-120b/blob/main/chat_te... - including this bit:
{%- macro render_tool_namespace(namespace_name, tools) -%}
{{- "## " + namespace_name + "\n\n" }}
{{- "namespace " + namespace_name + " {\n\n" }}
{%- for tool in tools %}
{%- set tool = tool.function %}
{{- "// " + tool.description + "\n" }}
{{- "type "+ tool.name + " = " }}
{%- if tool.parameters and tool.parameters.properties %}
{{- "(_: {\n" }}
{%- for param_name, param_spec in tool.parameters.properties.items() %}
{%- if param_spec.description %}
{{- "// " + param_spec.description + "\n" }}
{%- endif %}
{{- param_name }}
...
As for how LLMs know when to stop... they have special tokens for that. "eos_token_id" stands for End of Sequence - here's the gpt-oss config for that: https://huggingface.co/openai/gpt-oss-120b/blob/main/generat... {
"bos_token_id": 199998,
"do_sample": true,
"eos_token_id": [
200002,
199999,
200012
],
"pad_token_id": 199999,
"transformers_version": "4.55.0.dev0"
}
The model is trained to output one of those three tokens when it's "done".https://cookbook.openai.com/articles/openai-harmony#special-... defines some of those tokens:
200002 = <|return|> - you should stop inference
200012 = <|call|> - "Indicates the model wants to call a tool."
I think that 199999 is a legacy EOS token ID that's included for backwards compatibility? Not sure.
That said, it is worth understanding that the current generation of models is extensively RL-trained on how to make tool calls... so they may in fact be better at issuing tool calls in the specific format that their training has focused on (using specific internal tokens to demarcate and indicate when a tool call begins/ends, etc). Intuitively, there's probably a lot of transfer learning between this format and any ad-hoc format that you might request inline your prompt.
There may be recent literature quantifying the performance gap here. And certainly if you're doing anything performance-sensitive you will want to characterize this for your use case, with benchmarks. But conceptually, I think your model is spot on.
Structured Output APIs (inc. the Tool API) take the schema and build a Context-free Grammar, which is then used during generation to mask which tokens can be output.
I found https://openai.com/index/introducing-structured-outputs-in-t... (have to scroll down a bit to the "under the hood" section) and https://www.leewayhertz.com/structured-outputs-in-llms/#cons... to be pretty good resources
If the Apis I call are not profitable for the provider then they won't be for me either.
This post is a fly.io advertisement
The problem that you might not intuitively understand how agents work and what they are and aren't capable of - at least not as well as you would understand it if you spent half an hour building one for yourself.
> The problem that you might not intuitively understand how agents work and what they are and aren't capable of
I don't necessarily agree with the GP here, but I also disagree with this sentiment: I don't need to go through the experience of building a piece of software to understand what the capabilities of that class of software is.
Fair enough, with most other things (software or otherwise), they're either deterministic or predictably probabilistic, so simply using it or even just reading how it works is sufficient for me to understand what the capabilities are.
With LLMs, the lack of determinism coupled with completely opaque inner-workings is a problem when trying to form an intuition, but that problem is not solved by building an agent.
None of them are doing that.
They need funding because the next model has always been much more expensive to train than the profits of the previous model. And many do offer a lot of free usage which is of course operated at a loss. But I don't think any are operating inference at a loss, I think their margins are actually rather large.
However, knowing a few people on teams at inference-only providers, I can promise you some of them absolutely are operating inference at a loss.
0. https://www.theregister.com/2025/10/29/microsoft_earnings_q1...
Snark aside, inference is still being done at a loss. Anthropic, the most profitable AI vendor, is operating at a roughly -140% margin. xAI is the worst at somewhere around -3,600% margin.
If a company stops training new models until they can fund it out of previous profits, do we only slow down or halt altogether? If they all do?
Citation needed. I haven't seen any of them claim to have even positive gross margins to shareholders/investors, which surely they would do if they did.
OpenAI balance sheet also shows an $11 billion loss .
I can't see any profit on anything they create. The product is good but it relies on investors fueling the AI bubble.
Can you point us to the data?
and
>You can be your own AI provider.
Not sure that being your own AI provider is "sustainably monetizable"?
Its about balance.
Really its the AI providers that have been promising unreal gains during this hype period, so people are more profit oriented.
And I use value in quotes because as soon as the AI providers suddenly need to start generating a profit, that “value” is going to cost more than your salary.
I'm writing a personal assistant which, imo, is distinct from an agent in that it has a lot of capabilities a regular agent wouldn't necessarily need such as memory, task tracking, broad solutioning capabilities, etc... I ended up writing agents that talk to other agents which have MCP prompts, resources, and tools to guide them as general problem solvers. The first agent that it hits is a supervisor that specializes in task management and as a result writes a custom context and tool selection for the react agent it tasks.
All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/
Is this useful for you?
These 4 lines wound up being the heart of it, which is surprisingly simple, conceptually.
until mission_accomplished? or given_up? or killed?
determine_next_command_and_inputs
run_next_command
endI bet a majority of people who can ride a bicycle don't know how they steer, and would describe the physical movements they use to initiate and terminate a turn inaccurately.
“Most People Don't Know How Bikes Work”
now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong
Well done! :-DFor the problem domains I care about at the moment, I'm quite bullish about agents. I think they're going to be huge wins for vulnerability analysis and for operations/SRE work (not actually turning dials, but in making telemetry more interpretable). There are lots of domains where I'm less confident in them. But you could reasonably call me an optimist.
But the point of the article is that its arguments work both ways.
client = OpenAI()
context_good, context_bad = [{
"role": "system", "content": "you're Alph and you only tell the truth"
}], [{
"role": "system", "content": "you're Ralph and you only tell lies"
}]
...
And this will work great until next week's update when Ralph responses will consist of "I'm sorry, it would be unethical for me to respond with lies, unless you pay for the Premium-Super-Deluxe subscription, only available to state actors and firms with a six-figure contract."You're building on quicksand.
You're delegating everything important to someone who has no responsibility to you.
That said I built an LLM following Karpathy's tutorial. So I think it aims good to dabble a bit.
I built an 8-bit computer on breadboards once, then went down the rabbit hole of flight training for a PPL. Every time I think I’m "done," the finish line moves a few miles further.
Guess we nerds are never happy.
If you are a software engineer, you are going to be expected to use AI in some form in the near future. A lot of AI in its current form is not intuitive. Ergo, spending a small effort on building an AI agent is a good way to develop the skills and intuition needed to be successful in some way.
Nobody is going to use a CPU you build, nor are you ever going to be expected to build one in the course of your work if you don’t seek out specific positions, nor is there much non-intuitive about commonly used CPU functionality, and in fact you don’t even use the CPU directly, you use translation software whit itself is fairly non-intuitive. But that’s ok too, you are unlikely to be asked to build a compiler unless you seek out those sorts of jobs.
EVERYONE involved in writing applications and services is going to use AI in the near future and in case you missed the last year, everyone IS building stuff with AI, mostly chat assistants that mostly suck because, much about building with AI is not intuitive.
I did the same exercise. My implementation is at around 300 lines with two tools: web search and web page fetch with a command line chat interface and Python package. And it could have been a lot less lines if I didn't want to write a usable, extensible package interface.
As the agent setup itself is simple, majority of the work to make this useful would in the tools themselves and context management for the tools.
That is where the human in the loop needs to focus on for now :)
that sums up my experience in AI over the past three years. so many projects reinvent the same thing, so much spaghetti thrown at the wall to see what sticks, so much excitement followed by disappointment when a new model drops, so many people grifting, and so many hacks and workarounds like RAG with no evidence of them actually working other than "trust me bro" and trial and error.
Being able to recognize that 'make this code better' provides no direction, it should make sense that the output is directionless.
But on more subtle levels, whatever subtle goals that we have and hold in the workplace will be reflected back by the agents.
If you're trying to optimise costs, and increase profits as your north star. Having layoffs and unsustainable practices is a logical result, when you haven't balanced this with any incentives to abide by human values.
What products or companies are the gold standard of agent implementation right now?
ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens
We're about to launch an SDK that gives devs all these building blocks, specifically oriented around software agents. Would love feedback if anyone wants to look: https://github.com/OpenHands/software-agent-sdk
The article isn’t about writing production ready agents, so it does appear to be that easy
When I build an agent my standard is Cursor, which updates the UI at every reportable step of the way, and gives you a ton of control opportunities, which I find creates a lot of confidence.
Is this level of detail and control possible with the OpenHands SDK? I’m asking because the last SDK that was simple to get into lacked that kind of control.
I’m trying to understand if the value for Claude Code (for example) is purely in Sonnet/Haiku + the tool system prompt, or if there’s more secret sauce - beyond the “sugar” of instruction file inclusion via commands, tools, skills etc.
I think Claude Code's magic is that Anthropic is happy to burn tokens. The loop itself is not all that interesting.
What is interesting is how they manage the context window over a long chat. And I think a fair amount of that is serverside.
This is why I keep coming back to Hacker News. If the above is not a quintessential "hack", then I've never seen one.
Bravo!
https://github.com/microsoft/vscode-copilot-chat/blob/4f7ffd...
I wish we had a version that was optimized around token/cost efficiency
The summary is
The beauty is in the simplicity: 1. One loop - while (true) 2. One step at a time - stopWhen: stepCountIs(1) 3. One decision - "Did LLM make tool calls? → continue : exit" 4. Message history accumulates tool results automatically 5. LLM sees everything from previous iterations This creates emergent behavior where the LLM can: - Try something - See if it worked - Try again if it failed - Keep iterating until success - All without explicit retry logic!
Okay, but what if I'd prefer not to have to trust a remote service not to send me
{ "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }
?https://github.com/zerocore-ai/microsandbox
I haven't tried it.
docker run -it --rm \
-e SOME_API_KEY="$(SOME_API_KEY)" \
-v "$(shell pwd):/app" \ <-- restrict file system to whatever folder
--dns=127.0.0.1 \ <-- restrict network calls to localhost
$(shell dig +short llm.provider.com 2>/dev/null | awk '{printf " --add-host=llm-provider.com:%s", $$0}') \ <-- allow outside networking to whatever api your agent calls
my-agent-image
Probably could be a bit cleaner, but it worked for me.I'm not surprised that AI companies would want me to use them though.. I know what you're doing there :)
The number one feature of agents is to be disambiguation for tool selectors and pretty printers.
This resonates deeply with me. That's why I built one myself [0], I really really love to truly understand how coding agents work. The learning has been immense for me, I now have working knowledge of ANSI escape codes, grapheme clusters, terminal emulators, Unicode normalization, VT protocols, PTY sessions, and filesystem operations - all the low-level details I would have never think about until I were implementing them.
> This resonates deeply with me. That's why I built one myself [0]
I was hoping to see a home-made bike at that link.. Came away disappointed
This case is more like a journeyman blacksmith who has to make his own tools before he can continue. In doing so, he gets tools of his own, but the real reward was learning what is required to handle the metal such that it makes a strong hammer. And like the blacksmith, you learn more if you use an existing agent to write your agent.
the illusion was broken for me by Cline context overflows/summaries, but i think its very easy to miss if you never push the LLM hard or build you own agent. I really like this wording, amd the simple description is missing from how science communicators tend to talk about agents and LLMs imo
My main point being, though: for anyone intimidated by the recent tooling advances… you can most definitely do all this yourself.
https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/
I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..
Did you get to the part where he said MCP is pointless and are saying he's wrong?
Or did you just read the start of the article and not get to that bit?
Forgive if I get someting wrong: From what I see, it seems fundamentally it is a LLM being ran each loop with information about tools provided to it. On each loop the LLM evaluates inputs/context (from tool calls, inputs, etc.) and decided which tool to call / text output.
Fire up "claude --dangerously-skip-permissions" in a fresh directory (ideally in a Docker container if you want to limit the chance of it breaking anything else) and prompt this:
> Use Playwright to fetch ten reviews from http://www.example.com/ then run sentiment analysis on them and write the results out as JSON files. Install any missing dependencies.
Watch what it does. Be careful not to let it spider the site in a way that would justifiably upset the site owners.
I realize this is just for motivation in a subtitle, but people generally don't grasp how bicycles work, even after having ridden one.
Veritasium has a quite good video on the subject: https://www.youtube.com/watch?v=9cNmUNHSBac
You can get quite far quite quickly. My toy implementation [1] is <600 LOC and even supports MCP.
This is one of the first production grade errors I've made when I started my programming. I had a widget that would ping the network, but every time someone went on the page, a new ping process would spawn
I've written some agents that have their context altered by another llm to get it back on track. Let's say the agent is going off rails, then a supervisor agent will spot this and remove messages from the context where it went off rails, or alter those with correct information. Really fun stuff but yeah, we're essentially still inventing this as we go along.
Anyways, if it nerd sniped you, I succeeded. :)
The op has a point - a good one
Hold up. These are all the right concerns but with the wrong conclusion.
You don't need MCP if you're making one agent, in one language, in one framework. But the open coding and research assistants that we really want will be composed of several. MCP is the only thing out there that's moving in a good direction in terms of enabling us to "just be programmers" and "use APIs", and maybe even test things in fairly isolated and reproducible contexts. Compare this to skills.md, which is actually defacto proprietary as of now, does not compose, has opaque run-times and dispatch, is pushing us towards certain models, languages and certain SDKs, etc.
MCP isn't a plugin interface for Claude, it's just JSON-RPC.
I get that you can use MCP with any agent architecture. I debated whether I wanted to hedge and point out that, even if you build your own agent, you might want to do an MCP tool-call feature just so you can use tool definitions other people have built (though: if you build your own, you'd probably be better off just implementing Claude Code's "skill" pattern).
But I decided to keep the thrust of that section clearer. My argument is: MCP is a sideshow.
First of all, the call accuracy is much higher.
Second, you get more consistent results across models.
Why would Fly.io advocate using the vanilla GPT API to write an agent, instead of the official agent?
Does anyone actually know what exactly an agent is?
You can buy credits and set usage limits for safe testing per API key to gain access from many AI models through one simple and unified API from all popular model providers (OpenAI, Anthropic, Google, xAI, DeepSeek, Z.AI, Qwen, ...)
Ten dollars is plenty to get started... experiments like in the post will cost you cents, not dollars.
You propose increasing the complexity of interactions of these tools, and giving them access to external tools that have real-world impact? As a security researcher, I'm not sure how you can suggest that with a straight face, unless your goal is to have more vulnerable systems.
Most people can't manage to build robust and secure software using SOTA hosted "agents". Building their own may be a fun learning experience, but relying on a Rube Goldberg assembly of disparate "agents" communicating with each other and external tools is a recipe for disaster. Any token could trigger a cascade of hallucinations, wild tangents, ignored prompts, poisoned contexts, and similar issues that have plagued this tech since the beginning. Except that now you've wired them up to external tools, so maybe the system chooses to wipe your home directory for whatever reason.
People nonchalantly trusting nondeterministic tech with increasingly more real-world tasks should concern everyone. Today it's executing `ping` and `rm`; tomorrow it's managing nuclear launch systems.
I believe that would be a powerful tool solving many things there are now separate techniques for.
I am looking for a language or library agnostic pattern like we have MVC etc. for web applications. Or Gang of Four patterns but for building agents.
print "Hello world!"
so easy...
... let's see ...
client = OpenAI()
Um right. That's like saying you should implement a web server, you will learn so much, and then you go and import http (in golang). Yeah well, sure, but that brings you like 98% of the way there, doesn't it? What am I missing?
POST https://api.openai.com/v1/responsesThe term "agent" isn't really defined, but its generally a wrapper around an LLM designed to do some task better than the LLM would on its own.
Think Claude vs Claude Code. The latter wraps the former, but with extra prompts and tooling specific to software engineering.
The fact you find this trivial is kind of the point that's being made. Some people think having an agent is some kind of voodoo, but it's really not.