undefined | Better HN

0 pointsgirvo16d ago0 comments

Interesting. I’ve been playing with something similar, at the coding agent harness message sequence level (memory, I guess). I’m looking at human driven UX for compaction and resolving/pruning dead ends

0 comments

silentsvn15d ago

Human-driven compaction is interesting — you sidestep the "what's worth keeping" problem by putting a person in the loop. The tradeoff I've hit is that agents running autonomously need it to happen automatically or coherence degrades fast between sessions.

For pruning we landed on a last-touched timestamp + recall frequency counter per memory. Things not accessed in N sessions that were weakly formed to begin with get soft-deleted. Human review before hard delete is probably better UX if your setup allows it.

Curious what "dead ends" look like in yours; conversational chains that didn't resolve, or factual ones?

girvoOP15d ago

> The tradeoff I've hit is that agents running autonomously need it to happen automatically or coherence degrades fast between sessions.

Yeah that makes total sense. I wonder (and am sure the labs are doing so) if the HitL output would be good to fine tune the models used to do it autonomously?

I’m sticking with humans for the moment because I’m not sure where the boundaries lie: what actually makes it better and what makes it worse. It’s non obvious so far

Pruning “loops” has been pretty effective though, where a model gets stuck over N turns checking the same thing over and over and not breaking out of it til way later. That has been good because it gives strong context size benefits, but is also the most automatable I think

Pruning factually incorrect turns is something I’m trying, and pruning “correct” but “not correct based on my style” as well. Building a dataset of it all is fun :)

silentsvn15d ago

> I’m sticking with humans for the moment Haha totally get this statement.

The HitL fine-tuning angle is exactly right. The labeled dataset you're building (good/bad/stylistically-wrong memory events) is probably worth more than the compaction itself. Coherence preferences are surprisingly personal — what reads as "not correct based on my style" is hard to spec without examples.

The loop-pruning maps really cleanly to the contradiction detection in our setup. A model circling the same state N times is often because it stored an inconclusive result with the same confidence as a resolved one they look identical at recall time. Tagging memory entries with a status [open, resolved, or contradicted] before they go in cuts a lot of that.

On the autonomy question: we ended up treating certainty as continuous rather than binary. Low-certainty memories stay soft, high-certainty ones get promoted. Automatic compaction only operates on the low end, higher certainty entries are off-limits without explicit override. That lets you keep the autonomy without the coherence risk. The failure mode shifts from "deleted something important" to "kept something stale too long," which feels more recoverable.

Would be curious what your pruning signal looks like at the turn level — are you scoring relevance per-turn retroactively, or flagging at write time?

1 more reply

j / k navigate · click thread line to collapse

0 comments

silentsvn15d ago

Curious what "dead ends" look like in yours; conversational chains that didn't resolve, or factual ones?

girvoOP15d ago

> The tradeoff I've hit is that agents running autonomously need it to happen automatically or coherence degrades fast between sessions.

Yeah that makes total sense. I wonder (and am sure the labs are doing so) if the HitL output would be good to fine tune the models used to do it autonomously?

I’m sticking with humans for the moment because I’m not sure where the boundaries lie: what actually makes it better and what makes it worse. It’s non obvious so far

Pruning factually incorrect turns is something I’m trying, and pruning “correct” but “not correct based on my style” as well. Building a dataset of it all is fun :)

silentsvn15d ago

> I’m sticking with humans for the moment Haha totally get this statement.

Would be curious what your pruning signal looks like at the turn level — are you scoring relevance per-turn retroactively, or flagging at write time?

1 more reply

j / k navigate · click thread line to collapse