I agree a dedicated layer for intent capture makes a lot of sense. I thought about that as well, I am just not fully convinced it has to be conversational (or free-form conversational). Writing a prompt to get the right spec change is still a skill in itself, and it feels like it'd just be shifting the problem upstream rather than actually solving it. A structured editing experience over specs feels like it'd be more tractable to me. But the explicit vs inferred distinction you mention is interesting and worth thinking through more.
It's just that we're lazy. After being able to chat, I don't see people going back. You can't just paste some error into the specs, you can't paste it image and say it make it look more like this. Plus however well designed the spec, something like "actually make it always wait for the user feedback" can trigger changes in many places (even for the sake of removing contradictions).
1. You can write a spec that builds something that is not what you actually wanted
2. You can write spec that is incoherent with itself or with the external world
3. You can write a spec that doesn't have sufficient mechanical sympathy with the tooling you have and so it requires you to all spec out more and more of the surrounding tech than you practically can.
All of those issues can be addressed by iterating on the spec with the help of agents. It's just an engineering practice, one that we have to become better at understanding
The third point is harder. You still need to know your tooling well enough to write a spec that works with it. That part hasn't gone away.
Say "I want HN client for mobile", "must notify me about comments", you see it and you add "should support dark mode". Can you see how that is much less than anything in any programming language?
I also make individual tasks md files (task.md) which makes them capable of carrying intent, plan, but not just checkbox driven "- [ ]" gates, they get annotated with outcomes, and become a workbook after execution. The same task.md is seen twice by judge agents which run without extra context, the plan judge and the implementation judge.
I ran tests to see which component of my harness contributes the most and it came out that it is the judges. Apparently claude code can solve a task with or without a task file just as well, but the existence of this task file makes plans and work more auditable, and not just for bugs, but for intent follow.
Coming back to user intent, I have a post user message hook that writes user messages to a project scoped chat_log.md file, which means all user messages are preserved (user text << agent text, it is efficient), when we start a new task the chat log is checked to see if intent was properly captured. I also use it to recover context across sessions and remember what we did last.
Once every 10-20 tasks I run a retrospective task that inspects all task.md files since last retro and judges how the harness performs and project goes. This can detect things not apparent in task level work, for example when using multiple tasks to implement a more complex feature, or when a subsystem is touched by multiple tasks. I think reflection is the one place where the harness itself and how we use it can be refined.
claude plugin marketplace add horiacristescu/claude-playbook-plugin
source at https://github.com/horiacristescu/claude-playbook-plugin/tree/mainThe judge finding is interesting though. Right now verification during build for each task in Ossature is command-based, compile, tests, that kind of thing. A judge checking spec-to-code fidelity rather than (or maybe in addition to?) runtime correctness is worth thinking about.
One interesting data point - I counted word count in my chat messages vs final code and they came out about 1:1, but in reality a programmer would type 10x the final code during development. From a different perspective I found I created 10x more projects since I relied on Claude and my harness than before. So it looks user intent is 10x more effective than manual coding now.